[PATCH v4 0/9] Handle Firmware reported Hardware Errors

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/9] Handle Firmware reported Hardware Errors
@ 2025-07-09 11:20 Riana Tauro
  2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
                   ` (13 more replies)
  0 siblings, 14 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add support to handle firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of firmware error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
wedged and the only way to recover from these errors is firmware flash.

Add a vendor-specific recovery method to drm device wedged uevent.
The device will enter runtime survivability mode and send a drm device
wedged uevent when a firmware flash is required to notify userspace.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[754.709341] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=vendor-specific
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0

Bspec: 50875, 53073, 53074, 53075, 53076

IGT: https://patchwork.freedesktop.org/patch/660122/

Rev2: add a fault injection for csc errors
      fix review comments

Rev3: add a vendor-specific recovery method
      add support for runtime survivability mode
      enable runtime survivability mode when csc errors are reported

Rev4: refactor survivability code

Riana Tauro (9):
  drm: Add a vendor-specific recovery method to device wedged uevent
  drm/xe: Set GT as wedged before sending wedged uevent
  drm/xe: Add a helper function to set recovery method
  drm/xe/xe_survivability: Refactor survivability mode
  drm/xe/xe_survivability: Add support for Runtime survivability mode
  drm/xe/doc: Document device wedged and runtime survivability
  drm/xe: Add support to handle hardware errors
  drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  drm/xe/xe_hw_error: Add fault injection to trigger csc error handler

 Documentation/gpu/drm-uapi.rst                |   9 +-
 Documentation/gpu/xe/index.rst                |   1 +
 Documentation/gpu/xe/xe_device.rst            |  10 +
 Documentation/gpu/xe/xe_pcode.rst             |   6 +-
 drivers/gpu/drm/drm_drv.c                     |   2 +
 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/regs/xe_gsc_regs.h         |   2 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h    |  20 ++
 drivers/gpu/drm/xe/regs/xe_irq_regs.h         |   1 +
 drivers/gpu/drm/xe/xe_debugfs.c               |   2 +
 drivers/gpu/drm/xe/xe_device.c                |  53 ++++-
 drivers/gpu/drm/xe/xe_device.h                |   1 +
 drivers/gpu/drm/xe/xe_device_types.h          |   5 +
 drivers/gpu/drm/xe/xe_heci_gsc.c              |   2 +-
 drivers/gpu/drm/xe/xe_hw_error.c              | 185 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h              |  15 ++
 drivers/gpu/drm/xe/xe_irq.c                   |   4 +
 drivers/gpu/drm/xe/xe_pci.c                   |   6 +-
 drivers/gpu/drm/xe/xe_survivability_mode.c    | 164 +++++++++++++---
 drivers/gpu/drm/xe/xe_survivability_mode.h    |   5 +-
 .../gpu/drm/xe/xe_survivability_mode_types.h  |   8 +
 include/drm/drm_device.h                      |   4 +
 22 files changed, 454 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_device.rst
 create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-09 13:41   ` Simona Vetter
  2025-07-09 11:20 ` [PATCH v4 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida,
	Christian König, David Airlie, dri-devel

Certain errors can cause the device to be wedged and may
require a vendor specific recovery method to restore normal
operation.

Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
must provide additional recovery documentation if this method
is used.

v2: fix documentation (Raag)

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: <dri-devel@lists.freedesktop.org>
Suggested-by: Raag Jadav <raag.jadav@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-uapi.rst | 9 +++++----
 drivers/gpu/drm/drm_drv.c      | 2 ++
 include/drm/drm_device.h       | 4 ++++
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 263e5a97c080..c33070bdb347 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -421,10 +421,10 @@ Recovery
 Current implementation defines three recovery methods, out of which, drivers
 can use any one, multiple or none. Method(s) of choice will be sent in the
 uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
-more side-effects. If driver is unsure about recovery or method is unknown
-(like soft/hard system reboot, firmware flashing, physical device replacement
-or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
-will be sent instead.
+more side-effects. If recovery method is specific to vendor
+``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
+specific documentation for further recovery steps. If driver is unsure about
+recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
 
 Userspace consumers can parse this event and attempt recovery as per the
 following expectations.
@@ -435,6 +435,7 @@ following expectations.
     none            optional telemetry collection
     rebind          unbind + bind driver
     bus-reset       unbind + bus reset/re-enumeration + bind
+    vendor-specific vendor specific recovery method
     unknown         consumer policy
     =============== ========================================
 
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index cdd591b11488..0ac723a46a91 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
 		return "rebind";
 	case DRM_WEDGE_RECOVERY_BUS_RESET:
 		return "bus-reset";
+	case DRM_WEDGE_RECOVERY_VENDOR:
+		return "vendor-specific";
 	default:
 		return NULL;
 	}
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index 08b3b2467c4c..08a087f149ff 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -26,10 +26,14 @@ struct pci_controller;
  * Recovery methods for wedged device in order of less to more side-effects.
  * To be used with drm_dev_wedged_event() as recovery @method. Callers can
  * use any one, multiple (or'd) or none depending on their needs.
+ *
+ * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
+ * details.
  */
 #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
 #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
 #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
+#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
 
 /**
  * struct drm_wedge_task_info - information about the guilty task of a wedge dev
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 2/9] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
  2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-09 17:26   ` Matthew Brost
  2025-07-09 11:20 ` [PATCH v4 3/9] drm/xe: Add a helper function to set recovery method Riana Tauro
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, Matthew Brost

Userspace should be notified after setting the device as wedged.
Re-order function calls to set gt wedged before sending uevent.

Cc: Matthew Brost <matthew.brost@intel.com>
Suggested-by: Raag Jadav <raag.jadav@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 0b73cb72bad1..8a5bb7b6d09b 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a module
+ * This is a final state that can only be cleared with the recovery method
+ * specified in the drm wedged uevent. The default recovery method is
  * re-probe (unbind + bind).
+ *
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
  * failure or guc loading failure. Userspace will be notified of this state
@@ -1158,13 +1160,15 @@ void xe_device_declare_wedged(struct xe_device *xe)
 			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
 			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
 			dev_name(xe->drm.dev));
+	}
+
+	for_each_gt(gt, xe, id)
+		xe_gt_declare_wedged(gt);
 
+	if (xe_device_wedged(xe)) {
 		/* Notify userspace of wedged device */
 		drm_dev_wedged_event(&xe->drm,
 				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
 				     NULL);
 	}
-
-	for_each_gt(gt, xe, id)
-		xe_gt_declare_wedged(gt);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 3/9] drm/xe: Add a helper function to set recovery method
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
  2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
  2025-07-09 11:20 ` [PATCH v4 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-09 11:20 ` [PATCH v4 4/9] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add a helper function to set recovery method. The recovery
method has to be set before declaring the device wedged and sending the
drm wedged uevent. If no method is set, default unbind/re-bind method
will be set

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c       | 26 +++++++++++++++++++++-----
 drivers/gpu/drm/xe/xe_device.h       |  1 +
 drivers/gpu/drm/xe/xe_device_types.h |  2 ++
 3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 8a5bb7b6d09b..70e8bd4d5fd3 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1119,13 +1119,26 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
 	xe_pm_runtime_put(xe);
 }
 
+/**
+ * xe_device_set_wedged_method - Set wedged recovery method
+ * @xe: xe device instance
+ * @method: recovery method to set
+ *
+ * Set wedged recovery method to be sent in drm wedged uevent.
+ */
+void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
+{
+	xe->wedged.method = method;
+}
+
 /**
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
  * This is a final state that can only be cleared with the recovery method
- * specified in the drm wedged uevent. The default recovery method is
- * re-probe (unbind + bind).
+ * specified in the drm wedged uevent. The method needs to be set using
+ * xe_device_set_wedged_method before declaring the device as wedged or the
+ * default method of reprobe (unbind/re-bind) will be sent
  *
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
@@ -1166,9 +1179,12 @@ void xe_device_declare_wedged(struct xe_device *xe)
 		xe_gt_declare_wedged(gt);
 
 	if (xe_device_wedged(xe)) {
+		/* If no wedge recovery method is set, use default */
+		if (!xe->wedged.method)
+			xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND |
+						    DRM_WEDGE_RECOVERY_BUS_RESET);
+
 		/* Notify userspace of wedged device */
-		drm_dev_wedged_event(&xe->drm,
-				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
-				     NULL);
+		drm_dev_wedged_event(&xe->drm, xe->wedged.method, NULL);
 	}
 }
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index f0eb8150f185..c2a67db3ab72 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -183,6 +183,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
 	return atomic_read(&xe->wedged.flag);
 }
 
+void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
 void xe_device_declare_wedged(struct xe_device *xe);
 
 struct xe_file *xe_file_get(struct xe_file *xef);
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 78c4acafd268..ca300338e8c2 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -572,6 +572,8 @@ struct xe_device {
 		atomic_t flag;
 		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
 		int mode;
+		/** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
+		unsigned long method;
 	} wedged;
 
 	/** @bo_device: Struct to control async free of BOs */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 4/9] drm/xe/xe_survivability: Refactor survivability mode
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (2 preceding siblings ...)
  2025-07-09 11:20 ` [PATCH v4 3/9] drm/xe: Add a helper function to set recovery method Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-09 11:20 ` [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

The patches in these series refactor the boot survivability code to
allow adding runtime survivability
Refactor existing code to separate both the modes

This patch renames the functions and separates init and enable

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c                |  2 +-
 drivers/gpu/drm/xe/xe_heci_gsc.c              |  2 +-
 drivers/gpu/drm/xe/xe_pci.c                   |  6 +-
 drivers/gpu/drm/xe/xe_survivability_mode.c    | 93 +++++++++++++------
 drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 +-
 .../gpu/drm/xe/xe_survivability_mode_types.h  |  7 ++
 6 files changed, 81 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 70e8bd4d5fd3..f5958137435a 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
 		 * possible, but still return the previous error for error
 		 * propagation
 		 */
-		err = xe_survivability_mode_enable(xe);
+		err = xe_survivability_mode_boot_enable(xe);
 		if (err)
 			return err;
 
diff --git a/drivers/gpu/drm/xe/xe_heci_gsc.c b/drivers/gpu/drm/xe/xe_heci_gsc.c
index 6d7b62724126..a415ca488791 100644
--- a/drivers/gpu/drm/xe/xe_heci_gsc.c
+++ b/drivers/gpu/drm/xe/xe_heci_gsc.c
@@ -197,7 +197,7 @@ int xe_heci_gsc_init(struct xe_device *xe)
 	if (ret)
 		return ret;
 
-	if (!def->use_polling && !xe_survivability_mode_is_enabled(xe)) {
+	if (!def->use_polling && !xe_survivability_mode_is_boot_enabled(xe)) {
 		ret = heci_gsc_irq_setup(xe);
 		if (ret)
 			return ret;
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index ffd6ad569b7c..9e9e6abca6a9 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -730,7 +730,7 @@ static void xe_pci_remove(struct pci_dev *pdev)
 	if (IS_SRIOV_PF(xe))
 		xe_pci_sriov_configure(pdev, 0);
 
-	if (xe_survivability_mode_is_enabled(xe))
+	if (xe_survivability_mode_is_boot_enabled(xe))
 		return;
 
 	xe_device_remove(xe);
@@ -810,7 +810,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	 * flashed through mei. Return success, if survivability mode
 	 * is enabled due to pcode failure or configfs being set
 	 */
-	if (xe_survivability_mode_is_enabled(xe))
+	if (xe_survivability_mode_is_boot_enabled(xe))
 		return 0;
 
 	if (err)
@@ -904,7 +904,7 @@ static int xe_pci_suspend(struct device *dev)
 	struct xe_device *xe = pdev_to_xe_device(pdev);
 	int err;
 
-	if (xe_survivability_mode_is_enabled(xe))
+	if (xe_survivability_mode_is_boot_enabled(xe))
 		return -EBUSY;
 
 	err = xe_pm_suspend(xe);
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
index 1f710b3fc599..fefb027b1c84 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
@@ -120,6 +120,14 @@ static void log_survivability_info(struct pci_dev *pdev)
 	}
 }
 
+static int check_boot_failure(struct xe_device *xe)
+{
+	struct xe_survivability *survivability = &xe->survivability;
+
+	return survivability->boot_status == NON_CRITICAL_FAILURE ||
+		survivability->boot_status == CRITICAL_FAILURE;
+}
+
 static ssize_t survivability_mode_show(struct device *dev,
 				       struct device_attribute *attr, char *buff)
 {
@@ -129,6 +137,11 @@ static ssize_t survivability_mode_show(struct device *dev,
 	struct xe_survivability_info *info = survivability->info;
 	int index = 0, count = 0;
 
+	count += sysfs_emit_at(buff, count, "Survivability mode type: Boot\n");
+
+	if (!check_boot_failure(xe))
+		return count;
+
 	for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
 		if (info[index].reg)
 			count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
@@ -150,12 +163,11 @@ static void xe_survivability_mode_fini(void *arg)
 	sysfs_remove_file(&dev->kobj, &dev_attr_survivability_mode.attr);
 }
 
-static int enable_survivability_mode(struct pci_dev *pdev)
+static int create_survivability_sysfs(struct pci_dev *pdev)
 {
 	struct device *dev = &pdev->dev;
 	struct xe_device *xe = pdev_to_xe_device(pdev);
-	struct xe_survivability *survivability = &xe->survivability;
-	int ret = 0;
+	int ret;
 
 	/* create survivability mode sysfs */
 	ret = sysfs_create_file(&dev->kobj, &dev_attr_survivability_mode.attr);
@@ -169,6 +181,20 @@ static int enable_survivability_mode(struct pci_dev *pdev)
 	if (ret)
 		return ret;
 
+	return 0;
+}
+
+static int enable_boot_survivability_mode(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct xe_device *xe = pdev_to_xe_device(pdev);
+	struct xe_survivability *survivability = &xe->survivability;
+	int ret = 0;
+
+	ret = create_survivability_sysfs(pdev);
+	if (ret)
+		return ret;
+
 	/* Make sure xe_heci_gsc_init() knows about survivability mode */
 	survivability->mode = true;
 
@@ -189,15 +215,36 @@ static int enable_survivability_mode(struct pci_dev *pdev)
 	return 0;
 }
 
+static int init_survivability_mode(struct xe_device *xe)
+{
+	struct xe_survivability *survivability = &xe->survivability;
+	struct xe_survivability_info *info;
+
+	survivability->size = MAX_SCRATCH_MMIO;
+
+	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
+			    GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	survivability->info = info;
+
+	populate_survivability_info(xe);
+
+	return 0;
+}
+
 /**
- * xe_survivability_mode_is_enabled - check if survivability mode is enabled
+ * xe_survivability_mode_is_boot_enabled- check if boot survivability mode is enabled
  * @xe: xe device instance
  *
- * Returns true if in survivability mode, false otherwise
+ * Returns true if in boot survivability mode of type, else false
  */
-bool xe_survivability_mode_is_enabled(struct xe_device *xe)
+bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe)
 {
-	return xe->survivability.mode;
+	struct xe_survivability *survivability = &xe->survivability;
+
+	return survivability->mode && survivability->type == XE_SURVIVABILITY_TYPE_BOOT;
 }
 
 /**
@@ -238,44 +285,38 @@ bool xe_survivability_mode_is_requested(struct xe_device *xe)
 	data = xe_mmio_read32(mmio, PCODE_SCRATCH(0));
 	survivability->boot_status = REG_FIELD_GET(BOOT_STATUS, data);
 
-	return survivability->boot_status == NON_CRITICAL_FAILURE ||
-		survivability->boot_status == CRITICAL_FAILURE;
+	return check_boot_failure(xe);
 }
 
 /**
- * xe_survivability_mode_enable - Initialize and enable the survivability mode
+ * xe_survivability_mode_boot_enable - Initialize and enable boot survivability mode
  * @xe: xe device instance
  *
- * Initialize survivability information and enable survivability mode
+ * Initialize survivability information and enable boot survivability mode
  *
- * Return: 0 if survivability mode is enabled or not requested; negative error
+ * Return: 0 if boot survivability mode is enabled or not requested, negative error
  * code otherwise.
  */
-int xe_survivability_mode_enable(struct xe_device *xe)
+int xe_survivability_mode_boot_enable(struct xe_device *xe)
 {
 	struct xe_survivability *survivability = &xe->survivability;
-	struct xe_survivability_info *info;
 	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+	int ret;
 
 	if (!xe_survivability_mode_is_requested(xe))
 		return 0;
 
-	survivability->size = MAX_SCRATCH_MMIO;
-
-	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
-			    GFP_KERNEL);
-	if (!info)
-		return -ENOMEM;
-
-	survivability->info = info;
-
-	populate_survivability_info(xe);
+	ret = init_survivability_mode(xe);
+	if (ret)
+		return ret;
 
-	/* Only log debug information and exit if it is a critical failure */
+	/* Log breadcrumbs but do not enter survivability mode for Critical boot errors */
 	if (survivability->boot_status == CRITICAL_FAILURE) {
 		log_survivability_info(pdev);
 		return -ENXIO;
 	}
 
-	return enable_survivability_mode(pdev);
+	survivability->type = XE_SURVIVABILITY_TYPE_BOOT;
+
+	return enable_boot_survivability_mode(pdev);
 }
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
index 02231c2bf008..f6ee283ea5e8 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.h
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
@@ -10,8 +10,8 @@
 
 struct xe_device;
 
-int xe_survivability_mode_enable(struct xe_device *xe);
-bool xe_survivability_mode_is_enabled(struct xe_device *xe);
+int xe_survivability_mode_boot_enable(struct xe_device *xe);
+bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
 bool xe_survivability_mode_is_requested(struct xe_device *xe);
 
 #endif /* _XE_SURVIVABILITY_MODE_H_ */
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
index 19d433e253df..5dce393498da 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
+++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
@@ -9,6 +9,10 @@
 #include <linux/limits.h>
 #include <linux/types.h>
 
+enum xe_survivability_type {
+	XE_SURVIVABILITY_TYPE_BOOT,
+};
+
 struct xe_survivability_info {
 	char name[NAME_MAX];
 	u32 reg;
@@ -30,6 +34,9 @@ struct xe_survivability {
 
 	/** @mode: boolean to indicate survivability mode */
 	bool mode;
+
+	/** @type: survivability type */
+	enum xe_survivability_type type;
 };
 
 #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (3 preceding siblings ...)
  2025-07-09 11:20 ` [PATCH v4 4/9] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-09 23:44   ` Umesh Nerlige Ramappa
  2025-07-09 11:20 ` [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Certain runtime firmware errors can cause the device to be in a unusable
state requiring a firmware flash to restore normal operation.
Runtime Survivability Mode indicates firmware flash is necessary by
wedging the device and exposing survivability mode sysfs.

The below sysfs is an indication that device is in survivability mode

/sys/bus/pci/devices/<device>/survivability_mode

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_survivability_mode.c    | 42 ++++++++++++++++++-
 drivers/gpu/drm/xe/xe_survivability_mode.h    |  1 +
 .../gpu/drm/xe/xe_survivability_mode_types.h  |  1 +
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
index fefb027b1c84..ca1cfa13525a 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
@@ -137,7 +137,8 @@ static ssize_t survivability_mode_show(struct device *dev,
 	struct xe_survivability_info *info = survivability->info;
 	int index = 0, count = 0;
 
-	count += sysfs_emit_at(buff, count, "Survivability mode type: Boot\n");
+	count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
+			       survivability->type ? "Runtime" : "Boot");
 
 	if (!check_boot_failure(xe))
 		return count;
@@ -288,6 +289,45 @@ bool xe_survivability_mode_is_requested(struct xe_device *xe)
 	return check_boot_failure(xe);
 }
 
+/**
+ * xe_survivability_mode_runtime_enable - Initialize and enable runtime survivability mode
+ * @xe: xe device instance
+ *
+ * Initialize survivability information and enable runtime survivability mode.
+ * Runtime survivability mode is enabled when certain errors cause the device to be
+ * in non-recoverable state. The device is declared wedged with the appropriate
+ * recovery method and survivability mode sysfs exposed to userspace
+ *
+ * Return: 0 if runtime survivability mode is enabled or not requested, negative error
+ * code otherwise.
+ */
+int xe_survivability_mode_runtime_enable(struct xe_device *xe)
+{
+	struct xe_survivability *survivability = &xe->survivability;
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+	int ret;
+
+	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform < XE_BATTLEMAGE) {
+		dev_err(&pdev->dev, "Runtime Survivability Mode not supported\n");
+		return -EINVAL;
+	}
+
+	ret = init_survivability_mode(xe);
+	if (ret)
+		return ret;
+
+	ret = create_survivability_sysfs(pdev);
+	if (ret)
+		dev_err(&pdev->dev, "Failed to create survivability mode sysfs\n");
+
+	survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME;
+	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
+	xe_device_declare_wedged(xe);
+
+	dev_err(&pdev->dev, "Runtime Survivability mode enabled\n");
+	return 0;
+}
+
 /**
  * xe_survivability_mode_boot_enable - Initialize and enable boot survivability mode
  * @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
index f6ee283ea5e8..1cc94226aa82 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.h
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
@@ -11,6 +11,7 @@
 struct xe_device;
 
 int xe_survivability_mode_boot_enable(struct xe_device *xe);
+int xe_survivability_mode_runtime_enable(struct xe_device *xe);
 bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
 bool xe_survivability_mode_is_requested(struct xe_device *xe);
 
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
index 5dce393498da..cd65a5d167c9 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
+++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
@@ -11,6 +11,7 @@
 
 enum xe_survivability_type {
 	XE_SURVIVABILITY_TYPE_BOOT,
+	XE_SURVIVABILITY_TYPE_RUNTIME,
 };
 
 struct xe_survivability_info {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (4 preceding siblings ...)
  2025-07-09 11:20 ` [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-11  5:39   ` Raag Jadav
  2025-07-09 11:20 ` [PATCH v4 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add documentation for vendor specific device wedged recovery method
and runtime survivability.

v2: fix documentation (Raag)

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/xe/index.rst             |  1 +
 Documentation/gpu/xe/xe_device.rst         | 10 +++++++
 Documentation/gpu/xe/xe_pcode.rst          |  6 ++--
 drivers/gpu/drm/xe/xe_device.c             | 17 +++++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c | 33 +++++++++++++++++-----
 5 files changed, 58 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_device.rst

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index 42ba6c263cd0..88b22fad880e 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -25,5 +25,6 @@ DG2, etc is provided to prototype the driver.
    xe_tile
    xe_debugging
    xe_devcoredump
+   xe_device
    xe-drm-usage-stats.rst
    xe_configfs
diff --git a/Documentation/gpu/xe/xe_device.rst b/Documentation/gpu/xe/xe_device.rst
new file mode 100644
index 000000000000..f9b962169919
--- /dev/null
+++ b/Documentation/gpu/xe/xe_device.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+.. _xe-device-wedging:
+
+==================
+Xe Device Wedging
+==================
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_device.c
+   :doc: Device Wedging
diff --git a/Documentation/gpu/xe/xe_pcode.rst b/Documentation/gpu/xe/xe_pcode.rst
index 5937ef3599b0..2a43601123cb 100644
--- a/Documentation/gpu/xe/xe_pcode.rst
+++ b/Documentation/gpu/xe/xe_pcode.rst
@@ -13,9 +13,11 @@ Internal API
 .. kernel-doc:: drivers/gpu/drm/xe/xe_pcode.c
    :internal:
 
+.. _xe-survivability-mode:
+
 ==================
-Boot Survivability
+Survivability Mode
 ==================
 
 .. kernel-doc:: drivers/gpu/drm/xe/xe_survivability_mode.c
-   :doc: Xe Boot Survivability
+   :doc: Survivability Mode
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index f5958137435a..b3aa203fc5d2 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1120,6 +1120,23 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
 }
 
 /**
+ * DOC: Device Wedging
+ *
+ * Xe driver uses device wedged uevent as documented in Documentation/gpu/drm-uapi.rst.
+ *
+ * When device is in wedged state, every IOCTL will be blocked and GT cannot be
+ * used. Certain critical errors like gt reset failure, firmware failures can cause
+ * the device to be wedged. The default recovery mechanism for a wedged state
+ * is re-probe (unbind + bind)
+ *
+ * However, CSC firmware errors require a firmware flash to restore normal device
+ * operation. Since firmware flash is a vendor-specific action ``WEDGED=vendor-specific``
+ * recovery method along with :ref:`runtime survivability mode <xe-survivability-mode>`
+ * is used to notify userspace. User can then initiate a firmware flash to restore
+ * device to normal operation.
+ */
+
+/*
  * xe_device_set_wedged_method - Set wedged recovery method
  * @xe: xe device instance
  * @method: recovery method to set
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
index ca1cfa13525a..043f29dc3caf 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
@@ -21,15 +21,18 @@
 #define MAX_SCRATCH_MMIO 8
 
 /**
- * DOC: Xe Boot Survivability
+ * DOC: Survivability Mode
  *
- * Boot Survivability is a software based workflow for recovering a system in a failed boot state
+ * Survivability Mode is a software based workflow for recovering a system in a failed boot state
  * Here system recoverability is concerned with recovering the firmware responsible for boot.
  *
- * This is implemented by loading the driver with bare minimum (no drm card) to allow the firmware
- * to be flashed through mei and collect telemetry. The driver's probe flow is modified
- * such that it enters survivability mode when pcode initialization is incomplete and boot status
- * denotes a failure.
+ * Boot Survivability
+ * ===================
+ *
+ * Boot Survivability is implemented by loading the driver with bare minimum (no drm card) to allow
+ * the firmware to be flashed through mei and collect telemetry. The driver's probe flow is
+ * modified such that it enters survivability mode when pcode initialization is incomplete and boot
+ * status denotes a failure.
  *
  * Survivability mode can also be entered manually using the survivability mode attribute available
  * through configfs which is beneficial in several usecases. It can be used to address scenarios
@@ -45,7 +48,7 @@
  * Survivability mode is indicated by the below admin-only readable sysfs which provides additional
  * debug information::
  *
- *	/sys/bus/pci/devices/<device>/surivability_mode
+ *	/sys/bus/pci/devices/<device>/survivability_mode
  *
  * Capability Information:
  *	Provides boot status
@@ -55,6 +58,22 @@
  *	Provides history of previous failures
  * Auxiliary Information
  *	Certain failures may have information in addition to postcode information
+ *
+ * Runtime Survivability
+ * =====================
+ *
+ * Certain runtime firmware errors can cause the device to enter a wedged state
+ * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
+ * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
+ * is indicated by the presence of survivability mode sysfs::
+ *
+ *	/sys/bus/pci/devices/<device>/survivability_mode
+ *
+ * Survivability mode sysfs provides information about the type of survivability mode.
+ *
+ * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
+ * survivability mode. User can then initiate a firmware flash to restore device to normal
+ * operation.
  */
 
 static u32 aux_history_offset(u32 reg_value)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 7/9] drm/xe: Add support to handle hardware errors
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (5 preceding siblings ...)
  2025-07-09 11:20 ` [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-10 21:09   ` Umesh Nerlige Ramappa
  2025-07-09 11:20 ` [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, Himal Prasad Ghimiray

Gfx device reports two classes of errors: uncorrectable and
correctable. Depending on the severity uncorrectable errors are
further classified as non fatal and fatal

Correctable and non-fatal errors are reported as MSI's and bits in
the Master Interrupt Register indicate the class of the error.
The source of the error is then read from the Device Error Source
Register. Fatal errors are reported as PCIe errors
When a PCIe error is asserted, the OS will perform a device warm reset
which causes the driver to reload. The error registers are sticky
and the values are maintained through a warm reset

Add basic support to handle these errors

Bspec: 50875, 53073, 53074, 53075, 53076

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
 drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
 drivers/gpu/drm/xe/xe_hw_error.c           | 108 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
 drivers/gpu/drm/xe/xe_irq.c                |   4 +
 6 files changed, 144 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 1d97e5b63f4e..fea8ee3b0785 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -73,6 +73,7 @@ xe-y += xe_bb.o \
 	xe_hw_engine.o \
 	xe_hw_engine_class_sysfs.o \
 	xe_hw_engine_group.o \
+	xe_hw_error.o \
 	xe_hw_fence.o \
 	xe_irq.o \
 	xe_lrc.o \
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
new file mode 100644
index 000000000000..ed9b81fb28a0
--- /dev/null
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_HW_ERROR_REGS_H_
+#define _XE_HW_ERROR_REGS_H_
+
+#define DEV_ERR_STAT_NONFATAL			0x100178
+#define DEV_ERR_STAT_CORRECTABLE		0x10017c
+#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
+								  DEV_ERR_STAT_CORRECTABLE, \
+								  DEV_ERR_STAT_NONFATAL))
+
+#endif
diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
index f0ecfcac4003..2758b64cec9e 100644
--- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
@@ -18,6 +18,7 @@
 #define GFX_MSTR_IRQ				XE_REG(0x190010, XE_REG_OPTION_VF)
 #define   MASTER_IRQ				REG_BIT(31)
 #define   GU_MISC_IRQ				REG_BIT(29)
+#define   ERROR_IRQ(x)				REG_BIT(26 + (x))
 #define   DISPLAY_IRQ				REG_BIT(16)
 #define   GT_DW_IRQ(x)				REG_BIT(x)
 
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
new file mode 100644
index 000000000000..0f2590839900
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include "regs/xe_hw_error_regs.h"
+#include "regs/xe_irq_regs.h"
+
+#include "xe_device.h"
+#include "xe_hw_error.h"
+#include "xe_mmio.h"
+
+/* Error categories reported by hardware */
+enum hardware_error {
+	HARDWARE_ERROR_CORRECTABLE = 0,
+	HARDWARE_ERROR_NONFATAL = 1,
+	HARDWARE_ERROR_FATAL = 2,
+	HARDWARE_ERROR_MAX,
+};
+
+static const char *hw_error_to_str(const enum hardware_error hw_err)
+{
+	switch (hw_err) {
+	case HARDWARE_ERROR_CORRECTABLE:
+		return "CORRECTABLE";
+	case HARDWARE_ERROR_NONFATAL:
+		return "NONFATAL";
+	case HARDWARE_ERROR_FATAL:
+		return "FATAL";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hw_error_to_str(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	unsigned long flags;
+	u32 err_src;
+
+	if (xe->info.platform != XE_BATTLEMAGE)
+		return;
+
+	spin_lock_irqsave(&xe->irq.lock, flags);
+	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
+	if (!err_src) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
+				    tile->id, hw_err_str);
+		goto unlock;
+	}
+
+	/* TODO: Process errrors per source */
+
+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+
+unlock:
+	spin_unlock_irqrestore(&xe->irq.lock, flags);
+}
+
+/**
+ * xe_hw_error_irq_handler - irq handling for hw errors
+ * @tile: tile instance
+ * @master_ctl: value read from master interrupt register
+ *
+ * Xe platforms add three error bits to the master interrupt register to support error handling.
+ * These three bits are used to convey the class of error FATAL, NONFATAL, or CORRECTABLE.
+ * To process the interrupt, determine the source of error by reading the Device Error Source
+ * Register that corresponds to the class of error being serviced.
+ */
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
+{
+	enum hardware_error hw_err;
+
+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+		if (master_ctl & ERROR_IRQ(hw_err))
+			hw_error_source_handler(tile, hw_err);
+}
+
+/*
+ * Process hardware errors during boot
+ */
+static void process_hw_errors(struct xe_device *xe)
+{
+	struct xe_tile *tile;
+	u32 master_ctl;
+	u8 id;
+
+	for_each_tile(tile, xe, id) {
+		master_ctl = xe_mmio_read32(&tile->mmio, GFX_MSTR_IRQ);
+		xe_hw_error_irq_handler(tile, master_ctl);
+		xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ, master_ctl);
+	}
+}
+
+/**
+ * xe_hw_error_init - Initialize hw errors
+ * @xe: xe device instance
+ *
+ * Initialize and process hw errors
+ */
+void xe_hw_error_init(struct xe_device *xe)
+{
+	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
+		return;
+
+	process_hw_errors(xe);
+}
diff --git a/drivers/gpu/drm/xe/xe_hw_error.h b/drivers/gpu/drm/xe/xe_hw_error.h
new file mode 100644
index 000000000000..d86e28c5180c
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+#ifndef XE_HW_ERROR_H_
+#define XE_HW_ERROR_H_
+
+#include <linux/types.h>
+
+struct xe_tile;
+struct xe_device;
+
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl);
+void xe_hw_error_init(struct xe_device *xe);
+#endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 5362d3174b06..24ccf3bec52c 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -18,6 +18,7 @@
 #include "xe_gt.h"
 #include "xe_guc.h"
 #include "xe_hw_engine.h"
+#include "xe_hw_error.h"
 #include "xe_memirq.h"
 #include "xe_mmio.h"
 #include "xe_pxp.h"
@@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
 		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
 
 		gt_irq_handler(tile, master_ctl, intr_dw, identity);
+		xe_hw_error_irq_handler(tile, master_ctl);
 
 		/*
 		 * Display interrupts (including display backlight operations
@@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
 	int nvec = 1;
 	int err;
 
+	xe_hw_error_init(xe);
+
 	xe_irq_reset(xe);
 
 	if (xe_device_has_msix(xe)) {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (6 preceding siblings ...)
  2025-07-09 11:20 ` [PATCH v4 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-11  0:36   ` Umesh Nerlige Ramappa
  2025-07-09 11:20 ` [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add support to handle CSC firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash. The device is then wedged and userspace is
notified with a drm uevent

v2: use vendor recovery method with
    runtime survivability (Christian, Rodrigo, Raag)

v3: move declare wedged to runtime survivability mode (Rodrigo)

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
 drivers/gpu/drm/xe/xe_device_types.h       |  3 +
 drivers/gpu/drm/xe/xe_hw_error.c           | 68 +++++++++++++++++++++-
 4 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
index 9b66cc972a63..180be82672ab 100644
--- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
@@ -13,6 +13,8 @@
 
 /* Definitions of GSC H/W registers, bits, etc */
 
+#define BMG_GSC_HECI1_BASE	0x373000
+
 #define MTL_GSC_HECI1_BASE	0x00116000
 #define MTL_GSC_HECI2_BASE	0x00117000
 
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index ed9b81fb28a0..c146b9ef44eb 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -6,10 +6,15 @@
 #ifndef _XE_HW_ERROR_REGS_H_
 #define _XE_HW_ERROR_REGS_H_
 
+#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
+#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
+
+#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
+
 #define DEV_ERR_STAT_NONFATAL			0x100178
 #define DEV_ERR_STAT_CORRECTABLE		0x10017c
 #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
 								  DEV_ERR_STAT_CORRECTABLE, \
 								  DEV_ERR_STAT_NONFATAL))
-
+#define   XE_CSC_ERROR				BIT(17)
 #endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index ca300338e8c2..283d5c88758e 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -241,6 +241,9 @@ struct xe_tile {
 	/** @memirq: Memory Based Interrupts. */
 	struct xe_memirq memirq;
 
+	/** @csc_hw_error_work: worker to report CSC HW errors */
+	struct work_struct csc_hw_error_work;
+
 	/** @pcode: tile's PCODE */
 	struct {
 		/** @pcode.lock: protecting tile's PCODE mailbox data */
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 0f2590839900..7cc9b8a7fa1a 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,12 +3,16 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include "regs/xe_gsc_regs.h"
 #include "regs/xe_hw_error_regs.h"
 #include "regs/xe_irq_regs.h"
 
 #include "xe_device.h"
 #include "xe_hw_error.h"
 #include "xe_mmio.h"
+#include "xe_survivability_mode.h"
+
+#define  HEC_UNCORR_FW_ERR_BITS 4
 
 /* Error categories reported by hardware */
 enum hardware_error {
@@ -18,6 +22,13 @@ enum hardware_error {
 	HARDWARE_ERROR_MAX,
 };
 
+static const char * const hec_uncorrected_fw_errors[] = {
+	"Fatal",
+	"CSE Disabled",
+	"FD Corruption",
+	"Data Corruption"
+};
+
 static const char *hw_error_to_str(const enum hardware_error hw_err)
 {
 	switch (hw_err) {
@@ -32,6 +43,56 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
 	}
 }
 
+static void csc_hw_error_work(struct work_struct *work)
+{
+	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
+	struct xe_device *xe = tile_to_xe(tile);
+	int ret;
+
+	ret = xe_survivability_mode_runtime_enable(xe);
+	if (ret)
+		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");
+}
+
+static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hw_error_to_str(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_mmio *mmio = &tile->mmio;
+	u32 base, err_bit, err_src;
+	unsigned long fw_err;
+
+	if (xe->info.platform != XE_BATTLEMAGE)
+		return;
+
+	/* Not supported in BMG */
+	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+		return;
+
+	base = BMG_GSC_HECI1_BASE;
+	lockdep_assert_held(&xe->irq.lock);
+	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
+	if (!err_src) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
+				    tile->id, hw_err_str);
+		return;
+	}
+
+	if (err_src & UNCORR_FW_REPORTED_ERR) {
+		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
+		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
+			drm_err_ratelimited(&xe->drm, HW_ERR
+					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
+					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
+					     err_bit);
+
+			schedule_work(&tile->csc_hw_error_work);
+		}
+	}
+
+	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const char *hw_err_str = hw_error_to_str(hw_err);
@@ -50,7 +111,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 		goto unlock;
 	}
 
-	/* TODO: Process errrors per source */
+	if (err_src & XE_CSC_ERROR)
+		csc_hw_error_handler(tile, hw_err);
 
 	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
 
@@ -101,8 +163,12 @@ static void process_hw_errors(struct xe_device *xe)
  */
 void xe_hw_error_init(struct xe_device *xe)
 {
+	struct xe_tile *tile = xe_device_get_root_tile(xe);
+
 	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
 		return;
 
+	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
+
 	process_hw_errors(xe);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (7 preceding siblings ...)
  2025-07-09 11:20 ` [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-11 17:41   ` Umesh Nerlige Ramappa
  2025-07-09 12:28 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev4) Patchwork
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add a debugfs fault handler to trigger csc error handler that
wedges the device and sends drm uevent

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c  |  2 ++
 drivers/gpu/drm/xe/xe_hw_error.c | 11 +++++++++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index d83cd6ed3fa8..134610437aea 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -29,6 +29,7 @@
 #endif
 
 DECLARE_FAULT_ATTR(gt_reset_failure);
+DECLARE_FAULT_ATTR(inject_csc_hw_error);
 
 static struct xe_device *node_to_xe(struct drm_info_node *node)
 {
@@ -273,4 +274,5 @@ void xe_debugfs_register(struct xe_device *xe)
 	xe_pxp_debugfs_register(xe->pxp);
 
 	fault_create_debugfs_attr("fail_gt_reset", root, &gt_reset_failure);
+	fault_create_debugfs_attr("inject_csc_hw_error", root, &inject_csc_hw_error);
 }
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 7cc9b8a7fa1a..2d56a93b3a71 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,8 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <linux/fault-inject.h>
+
 #include "regs/xe_gsc_regs.h"
 #include "regs/xe_hw_error_regs.h"
 #include "regs/xe_irq_regs.h"
@@ -13,6 +15,7 @@
 #include "xe_survivability_mode.h"
 
 #define  HEC_UNCORR_FW_ERR_BITS 4
+extern struct fault_attr inject_csc_hw_error;
 
 /* Error categories reported by hardware */
 enum hardware_error {
@@ -43,6 +46,11 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
 	}
 }
 
+static bool fault_inject_csc_hw_error(void)
+{
+	return should_fail(&inject_csc_hw_error, 1);
+}
+
 static void csc_hw_error_work(struct work_struct *work)
 {
 	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
@@ -134,6 +142,9 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 {
 	enum hardware_error hw_err;
 
+	if (fault_inject_csc_hw_error())
+		schedule_work(&tile->csc_hw_error_work);
+
 	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
 		if (master_ctl & ERROR_IRQ(hw_err))
 			hw_error_source_handler(tile, hw_err);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev4)
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (8 preceding siblings ...)
  2025-07-09 11:20 ` [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
@ 2025-07-09 12:28 ` Patchwork
  2025-07-09 12:30 ` ✓ CI.KUnit: success " Patchwork
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2025-07-09 12:28 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev4)
URL   : https://patchwork.freedesktop.org/series/149756/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
43254c2aa575037fc031c7ac21b0d031c700b2bf
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 687f6fef4cdc2359f8a5cab02cbdad009dec90a8
Author: Riana Tauro <riana.tauro@intel.com>
Date:   Wed Jul 9 16:50:21 2025 +0530

    drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
    
    Add a debugfs fault handler to trigger csc error handler that
    wedges the device and sends drm uevent
    
    Signed-off-by: Riana Tauro <riana.tauro@intel.com>
+ /mt/dim checkpatch 20adfb60af27bc0e490b2d20609c3158ae2fbd26 drm-intel
bbc7ca5d17df drm: Add a vendor-specific recovery method to device wedged uevent
2c51c0258aa6 drm/xe: Set GT as wedged before sending wedged uevent
236f7ec3b4a6 drm/xe: Add a helper function to set recovery method
9712ceb6cd2a drm/xe/xe_survivability: Refactor survivability mode
51371af8f5e3 drm/xe/xe_survivability: Add support for Runtime survivability mode
62dc45c71895 drm/xe/doc: Document device wedged and runtime survivability
-:25: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#25: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 106 lines checked
9be309241b33 drm/xe: Add support to handle hardware errors
-:39: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#39: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 174 lines checked
c2731205f14d drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
687f6fef4cdc drm/xe/xe_hw_error: Add fault injection to trigger csc error handler



^ permalink raw reply	[flat|nested] 48+ messages in thread

* ✓ CI.KUnit: success for Handle Firmware reported Hardware Errors (rev4)
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (9 preceding siblings ...)
  2025-07-09 12:28 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev4) Patchwork
@ 2025-07-09 12:30 ` Patchwork
  2025-07-09 12:44 ` ✗ CI.checksparse: warning " Patchwork
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2025-07-09 12:30 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev4)
URL   : https://patchwork.freedesktop.org/series/149756/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[12:28:59] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[12:29:03] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[12:29:29] Starting KUnit Kernel (1/1)...
[12:29:29] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[12:29:30] ================== guc_buf (11 subtests) ===================
[12:29:30] [PASSED] test_smallest
[12:29:30] [PASSED] test_largest
[12:29:30] [PASSED] test_granular
[12:29:30] [PASSED] test_unique
[12:29:30] [PASSED] test_overlap
[12:29:30] [PASSED] test_reusable
[12:29:30] [PASSED] test_too_big
[12:29:30] [PASSED] test_flush
[12:29:30] [PASSED] test_lookup
[12:29:30] [PASSED] test_data
[12:29:30] [PASSED] test_class
[12:29:30] ===================== [PASSED] guc_buf =====================
[12:29:30] =================== guc_dbm (7 subtests) ===================
[12:29:30] [PASSED] test_empty
[12:29:30] [PASSED] test_default
[12:29:30] ======================== test_size  ========================
[12:29:30] [PASSED] 4
[12:29:30] [PASSED] 8
[12:29:30] [PASSED] 32
[12:29:30] [PASSED] 256
[12:29:30] ==================== [PASSED] test_size ====================
[12:29:30] ======================= test_reuse  ========================
[12:29:30] [PASSED] 4
[12:29:30] [PASSED] 8
[12:29:30] [PASSED] 32
[12:29:30] [PASSED] 256
[12:29:30] =================== [PASSED] test_reuse ====================
[12:29:30] =================== test_range_overlap  ====================
[12:29:30] [PASSED] 4
[12:29:30] [PASSED] 8
[12:29:30] [PASSED] 32
[12:29:30] [PASSED] 256
[12:29:30] =============== [PASSED] test_range_overlap ================
[12:29:30] =================== test_range_compact  ====================
[12:29:30] [PASSED] 4
[12:29:30] [PASSED] 8
[12:29:30] [PASSED] 32
[12:29:30] [PASSED] 256
[12:29:30] =============== [PASSED] test_range_compact ================
[12:29:30] ==================== test_range_spare  =====================
[12:29:30] [PASSED] 4
[12:29:30] [PASSED] 8
[12:29:30] [PASSED] 32
[12:29:30] [PASSED] 256
[12:29:30] ================ [PASSED] test_range_spare =================
[12:29:30] ===================== [PASSED] guc_dbm =====================
[12:29:30] =================== guc_idm (6 subtests) ===================
[12:29:30] [PASSED] bad_init
[12:29:30] [PASSED] no_init
[12:29:30] [PASSED] init_fini
[12:29:30] [PASSED] check_used
[12:29:30] [PASSED] check_quota
[12:29:30] [PASSED] check_all
[12:29:30] ===================== [PASSED] guc_idm =====================
[12:29:30] ================== no_relay (3 subtests) ===================
[12:29:30] [PASSED] xe_drops_guc2pf_if_not_ready
[12:29:30] [PASSED] xe_drops_guc2vf_if_not_ready
[12:29:30] [PASSED] xe_rejects_send_if_not_ready
[12:29:30] ==================== [PASSED] no_relay =====================
[12:29:30] ================== pf_relay (14 subtests) ==================
[12:29:30] [PASSED] pf_rejects_guc2pf_too_short
[12:29:30] [PASSED] pf_rejects_guc2pf_too_long
[12:29:30] [PASSED] pf_rejects_guc2pf_no_payload
[12:29:30] [PASSED] pf_fails_no_payload
[12:29:30] [PASSED] pf_fails_bad_origin
[12:29:30] [PASSED] pf_fails_bad_type
[12:29:30] [PASSED] pf_txn_reports_error
[12:29:30] [PASSED] pf_txn_sends_pf2guc
[12:29:30] [PASSED] pf_sends_pf2guc
[12:29:30] [SKIPPED] pf_loopback_nop
[12:29:30] [SKIPPED] pf_loopback_echo
[12:29:30] [SKIPPED] pf_loopback_fail
[12:29:30] [SKIPPED] pf_loopback_busy
[12:29:30] [SKIPPED] pf_loopback_retry
[12:29:30] ==================== [PASSED] pf_relay =====================
[12:29:30] ================== vf_relay (3 subtests) ===================
[12:29:30] [PASSED] vf_rejects_guc2vf_too_short
[12:29:30] [PASSED] vf_rejects_guc2vf_too_long
[12:29:30] [PASSED] vf_rejects_guc2vf_no_payload
[12:29:30] ==================== [PASSED] vf_relay =====================
[12:29:30] ================= pf_service (11 subtests) =================
[12:29:30] [PASSED] pf_negotiate_any
[12:29:30] [PASSED] pf_negotiate_base_match
[12:29:30] [PASSED] pf_negotiate_base_newer
[12:29:30] [PASSED] pf_negotiate_base_next
[12:29:30] [SKIPPED] pf_negotiate_base_older
[12:29:30] [PASSED] pf_negotiate_base_prev
[12:29:30] [PASSED] pf_negotiate_latest_match
[12:29:30] [PASSED] pf_negotiate_latest_newer
[12:29:30] [PASSED] pf_negotiate_latest_next
[12:29:30] [SKIPPED] pf_negotiate_latest_older
[12:29:30] [SKIPPED] pf_negotiate_latest_prev
[12:29:30] =================== [PASSED] pf_service ====================
[12:29:30] ===================== lmtt (1 subtest) =====================
[12:29:30] ======================== test_ops  =========================
[12:29:30] [PASSED] 2-level
[12:29:30] [PASSED] multi-level
[12:29:30] ==================== [PASSED] test_ops =====================
[12:29:30] ====================== [PASSED] lmtt =======================
[12:29:30] =================== xe_mocs (2 subtests) ===================
[12:29:30] ================ xe_live_mocs_kernel_kunit  ================
[12:29:30] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[12:29:30] ================ xe_live_mocs_reset_kunit  =================
[12:29:30] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[12:29:30] ==================== [SKIPPED] xe_mocs =====================
[12:29:30] ================= xe_migrate (2 subtests) ==================
[12:29:30] ================= xe_migrate_sanity_kunit  =================
[12:29:30] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[12:29:30] ================== xe_validate_ccs_kunit  ==================
[12:29:30] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[12:29:30] =================== [SKIPPED] xe_migrate ===================
[12:29:30] ================== xe_dma_buf (1 subtest) ==================
[12:29:30] ==================== xe_dma_buf_kunit  =====================
[12:29:30] ================ [SKIPPED] xe_dma_buf_kunit ================
[12:29:30] =================== [SKIPPED] xe_dma_buf ===================
[12:29:30] ================= xe_bo_shrink (1 subtest) =================
[12:29:30] =================== xe_bo_shrink_kunit  ====================
[12:29:30] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[12:29:30] ================== [SKIPPED] xe_bo_shrink ==================
[12:29:30] ==================== xe_bo (2 subtests) ====================
[12:29:30] ================== xe_ccs_migrate_kunit  ===================
[12:29:30] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[12:29:30] ==================== xe_bo_evict_kunit  ====================
[12:29:30] =============== [SKIPPED] xe_bo_evict_kunit ================
[12:29:30] ===================== [SKIPPED] xe_bo ======================
[12:29:30] ==================== args (11 subtests) ====================
[12:29:30] [PASSED] count_args_test
[12:29:30] [PASSED] call_args_example
[12:29:30] [PASSED] call_args_test
[12:29:30] [PASSED] drop_first_arg_example
[12:29:30] [PASSED] drop_first_arg_test
[12:29:30] [PASSED] first_arg_example
[12:29:30] [PASSED] first_arg_test
[12:29:30] [PASSED] last_arg_example
[12:29:30] [PASSED] last_arg_test
[12:29:30] [PASSED] pick_arg_example
[12:29:30] [PASSED] sep_comma_example
[12:29:30] ====================== [PASSED] args =======================
[12:29:30] =================== xe_pci (3 subtests) ====================
[12:29:30] ==================== check_graphics_ip  ====================
[12:29:30] [PASSED] 12.70 Xe_LPG
[12:29:30] [PASSED] 12.71 Xe_LPG
[12:29:30] [PASSED] 12.74 Xe_LPG+
[12:29:30] [PASSED] 20.01 Xe2_HPG
[12:29:30] [PASSED] 20.02 Xe2_HPG
[12:29:30] [PASSED] 20.04 Xe2_LPG
[12:29:30] [PASSED] 30.00 Xe3_LPG
[12:29:30] [PASSED] 30.01 Xe3_LPG
[12:29:30] [PASSED] 30.03 Xe3_LPG
[12:29:30] ================ [PASSED] check_graphics_ip ================
[12:29:30] ===================== check_media_ip  ======================
[12:29:30] [PASSED] 13.00 Xe_LPM+
[12:29:30] [PASSED] 13.01 Xe2_HPM
[12:29:30] [PASSED] 20.00 Xe2_LPM
[12:29:30] [PASSED] 30.00 Xe3_LPM
[12:29:30] [PASSED] 30.02 Xe3_LPM
[12:29:30] ================= [PASSED] check_media_ip ==================
[12:29:30] ================= check_platform_gt_count  =================
[12:29:30] [PASSED] 0x9A60 (TIGERLAKE)
[12:29:30] [PASSED] 0x9A68 (TIGERLAKE)
[12:29:30] [PASSED] 0x9A70 (TIGERLAKE)
[12:29:30] [PASSED] 0x9A40 (TIGERLAKE)
[12:29:30] [PASSED] 0x9A49 (TIGERLAKE)
[12:29:30] [PASSED] 0x9A59 (TIGERLAKE)
[12:29:30] [PASSED] 0x9A78 (TIGERLAKE)
[12:29:30] [PASSED] 0x9AC0 (TIGERLAKE)
[12:29:30] [PASSED] 0x9AC9 (TIGERLAKE)
[12:29:30] [PASSED] 0x9AD9 (TIGERLAKE)
[12:29:30] [PASSED] 0x9AF8 (TIGERLAKE)
[12:29:30] [PASSED] 0x4C80 (ROCKETLAKE)
[12:29:30] [PASSED] 0x4C8A (ROCKETLAKE)
[12:29:30] [PASSED] 0x4C8B (ROCKETLAKE)
[12:29:30] [PASSED] 0x4C8C (ROCKETLAKE)
[12:29:30] [PASSED] 0x4C90 (ROCKETLAKE)
[12:29:30] [PASSED] 0x4C9A (ROCKETLAKE)
[12:29:30] [PASSED] 0x4680 (ALDERLAKE_S)
[12:29:30] [PASSED] 0x4682 (ALDERLAKE_S)
[12:29:30] [PASSED] 0x4688 (ALDERLAKE_S)
[12:29:30] [PASSED] 0x468A (ALDERLAKE_S)
[12:29:30] [PASSED] 0x468B (ALDERLAKE_S)
[12:29:30] [PASSED] 0x4690 (ALDERLAKE_S)
[12:29:30] [PASSED] 0x4692 (ALDERLAKE_S)
[12:29:30] [PASSED] 0x4693 (ALDERLAKE_S)
[12:29:30] [PASSED] 0x46A0 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46A1 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46A2 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46A3 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46A6 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46A8 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46AA (ALDERLAKE_P)
[12:29:30] [PASSED] 0x462A (ALDERLAKE_P)
[12:29:30] [PASSED] 0x4626 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x4628 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46B0 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46B1 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46B2 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46B3 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46C0 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46C1 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46C2 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46C3 (ALDERLAKE_P)
[12:29:30] [PASSED] 0x46D0 (ALDERLAKE_N)
[12:29:30] [PASSED] 0x46D1 (ALDERLAKE_N)
[12:29:30] [PASSED] 0x46D2 (ALDERLAKE_N)
[12:29:30] [PASSED] 0x46D3 (ALDERLAKE_N)
[12:29:30] [PASSED] 0x46D4 (ALDERLAKE_N)
[12:29:30] [PASSED] 0xA721 (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7A1 (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7A9 (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7AC (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7AD (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA720 (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7A0 (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7A8 (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7AA (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA7AB (ALDERLAKE_P)
[12:29:30] [PASSED] 0xA780 (ALDERLAKE_S)
[12:29:30] [PASSED] 0xA781 (ALDERLAKE_S)
[12:29:30] [PASSED] 0xA782 (ALDERLAKE_S)
[12:29:30] [PASSED] 0xA783 (ALDERLAKE_S)
[12:29:30] [PASSED] 0xA788 (ALDERLAKE_S)
[12:29:30] [PASSED] 0xA789 (ALDERLAKE_S)
[12:29:30] [PASSED] 0xA78A (ALDERLAKE_S)
[12:29:30] [PASSED] 0xA78B (ALDERLAKE_S)
[12:29:30] [PASSED] 0x4905 (DG1)
[12:29:30] [PASSED] 0x4906 (DG1)
[12:29:30] [PASSED] 0x4907 (DG1)
[12:29:30] [PASSED] 0x4908 (DG1)
[12:29:30] [PASSED] 0x4909 (DG1)
[12:29:30] [PASSED] 0x56C0 (DG2)
[12:29:30] [PASSED] 0x56C2 (DG2)
[12:29:30] [PASSED] 0x56C1 (DG2)
[12:29:30] [PASSED] 0x7D51 (METEORLAKE)
[12:29:30] [PASSED] 0x7DD1 (METEORLAKE)
[12:29:30] [PASSED] 0x7D41 (METEORLAKE)
[12:29:30] [PASSED] 0x7D67 (METEORLAKE)
[12:29:30] [PASSED] 0xB640 (METEORLAKE)
[12:29:30] [PASSED] 0x56A0 (DG2)
[12:29:30] [PASSED] 0x56A1 (DG2)
[12:29:30] [PASSED] 0x56A2 (DG2)
[12:29:30] [PASSED] 0x56BE (DG2)
[12:29:30] [PASSED] 0x56BF (DG2)
[12:29:30] [PASSED] 0x5690 (DG2)
[12:29:30] [PASSED] 0x5691 (DG2)
[12:29:30] [PASSED] 0x5692 (DG2)
[12:29:30] [PASSED] 0x56A5 (DG2)
[12:29:30] [PASSED] 0x56A6 (DG2)
[12:29:30] [PASSED] 0x56B0 (DG2)
[12:29:30] [PASSED] 0x56B1 (DG2)
[12:29:30] [PASSED] 0x56BA (DG2)
[12:29:30] [PASSED] 0x56BB (DG2)
[12:29:30] [PASSED] 0x56BC (DG2)
[12:29:30] [PASSED] 0x56BD (DG2)
[12:29:30] [PASSED] 0x5693 (DG2)
[12:29:30] [PASSED] 0x5694 (DG2)
[12:29:30] [PASSED] 0x5695 (DG2)
[12:29:30] [PASSED] 0x56A3 (DG2)
[12:29:30] [PASSED] 0x56A4 (DG2)
[12:29:30] [PASSED] 0x56B2 (DG2)
[12:29:30] [PASSED] 0x56B3 (DG2)
[12:29:30] [PASSED] 0x5696 (DG2)
[12:29:30] [PASSED] 0x5697 (DG2)
[12:29:30] [PASSED] 0xB69 (PVC)
[12:29:30] [PASSED] 0xB6E (PVC)
[12:29:30] [PASSED] 0xBD4 (PVC)
[12:29:30] [PASSED] 0xBD5 (PVC)
[12:29:30] [PASSED] 0xBD6 (PVC)
[12:29:30] [PASSED] 0xBD7 (PVC)
[12:29:30] [PASSED] 0xBD8 (PVC)
[12:29:30] [PASSED] 0xBD9 (PVC)
[12:29:30] [PASSED] 0xBDA (PVC)
[12:29:30] [PASSED] 0xBDB (PVC)
[12:29:30] [PASSED] 0xBE0 (PVC)
[12:29:30] [PASSED] 0xBE1 (PVC)
[12:29:30] [PASSED] 0xBE5 (PVC)
[12:29:30] [PASSED] 0x7D40 (METEORLAKE)
[12:29:30] [PASSED] 0x7D45 (METEORLAKE)
[12:29:30] [PASSED] 0x7D55 (METEORLAKE)
[12:29:30] [PASSED] 0x7D60 (METEORLAKE)
[12:29:30] [PASSED] 0x7DD5 (METEORLAKE)
[12:29:30] [PASSED] 0x6420 (LUNARLAKE)
[12:29:30] [PASSED] 0x64A0 (LUNARLAKE)
[12:29:30] [PASSED] 0x64B0 (LUNARLAKE)
[12:29:30] [PASSED] 0xE202 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE209 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE20B (BATTLEMAGE)
[12:29:30] [PASSED] 0xE20C (BATTLEMAGE)
[12:29:30] [PASSED] 0xE20D (BATTLEMAGE)
[12:29:30] [PASSED] 0xE210 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE211 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE212 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE216 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE220 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE221 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE222 (BATTLEMAGE)
[12:29:30] [PASSED] 0xE223 (BATTLEMAGE)
[12:29:30] [PASSED] 0xB080 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB081 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB082 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB083 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB084 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB085 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB086 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB087 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB08F (PANTHERLAKE)
[12:29:30] [PASSED] 0xB090 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB0A0 (PANTHERLAKE)
[12:29:30] [PASSED] 0xB0B0 (PANTHERLAKE)
[12:29:30] [PASSED] 0xFD80 (PANTHERLAKE)
[12:29:30] [PASSED] 0xFD81 (PANTHERLAKE)
[12:29:30] ============= [PASSED] check_platform_gt_count =============
[12:29:30] ===================== [PASSED] xe_pci ======================
[12:29:30] =================== xe_rtp (2 subtests) ====================
[12:29:30] =============== xe_rtp_process_to_sr_tests  ================
[12:29:30] [PASSED] coalesce-same-reg
[12:29:30] [PASSED] no-match-no-add
[12:29:30] [PASSED] match-or
[12:29:30] [PASSED] match-or-xfail
[12:29:30] [PASSED] no-match-no-add-multiple-rules
[12:29:30] [PASSED] two-regs-two-entries
[12:29:30] [PASSED] clr-one-set-other
[12:29:30] [PASSED] set-field
[12:29:30] [PASSED] conflict-duplicate
[12:29:30] [PASSED] conflict-not-disjoint
[12:29:30] [PASSED] conflict-reg-type
[12:29:30] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[12:29:30] ================== xe_rtp_process_tests  ===================
[12:29:30] [PASSED] active1
[12:29:30] [PASSED] active2
[12:29:30] [PASSED] active-inactive
[12:29:30] [PASSED] inactive-active
[12:29:30] [PASSED] inactive-1st_or_active-inactive
[12:29:30] [PASSED] inactive-2nd_or_active-inactive
[12:29:30] [PASSED] inactive-last_or_active-inactive
[12:29:30] [PASSED] inactive-no_or_active-inactive
[12:29:30] ============== [PASSED] xe_rtp_process_tests ===============
[12:29:30] ===================== [PASSED] xe_rtp ======================
[12:29:30] ==================== xe_wa (1 subtest) =====================
[12:29:30] ======================== xe_wa_gt  =========================
[12:29:30] [PASSED] TIGERLAKE (B0)
[12:29:30] [PASSED] DG1 (A0)
[12:29:30] [PASSED] DG1 (B0)
[12:29:30] [PASSED] ALDERLAKE_S (A0)
[12:29:30] [PASSED] ALDERLAKE_S (B0)
[12:29:30] [PASSED] ALDERLAKE_S (C0)
[12:29:30] [PASSED] ALDERLAKE_S (D0)
[12:29:30] [PASSED] ALDERLAKE_P (A0)
[12:29:30] [PASSED] ALDERLAKE_P (B0)
[12:29:30] [PASSED] ALDERLAKE_P (C0)
[12:29:30] [PASSED] ALDERLAKE_S_RPLS (D0)
[12:29:30] [PASSED] ALDERLAKE_P_RPLU (E0)
[12:29:30] [PASSED] DG2_G10 (C0)
[12:29:30] [PASSED] DG2_G11 (B1)
[12:29:30] [PASSED] DG2_G12 (A1)
[12:29:30] [PASSED] METEORLAKE (g:A0, m:A0)
[12:29:30] [PASSED] METEORLAKE (g:A0, m:A0)
[12:29:30] [PASSED] METEORLAKE (g:A0, m:A0)
[12:29:30] [PASSED] LUNARLAKE (g:A0, m:A0)
[12:29:30] [PASSED] LUNARLAKE (g:B0, m:A0)
stty: 'standard input': Inappropriate ioctl for device
[12:29:30] [PASSED] BATTLEMAGE (g:A0, m:A1)
[12:29:30] ==================== [PASSED] xe_wa_gt =====================
[12:29:30] ====================== [PASSED] xe_wa ======================
[12:29:30] ============================================================
[12:29:30] Testing complete. Ran 297 tests: passed: 281, skipped: 16
[12:29:30] Elapsed time: 31.247s total, 4.166s configuring, 26.714s building, 0.314s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[12:29:30] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[12:29:32] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[12:29:53] Starting KUnit Kernel (1/1)...
[12:29:53] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[12:29:53] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[12:29:53] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[12:29:53] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[12:29:53] =========== drm_validate_clone_mode (2 subtests) ===========
[12:29:53] ============== drm_test_check_in_clone_mode  ===============
[12:29:53] [PASSED] in_clone_mode
[12:29:53] [PASSED] not_in_clone_mode
[12:29:53] ========== [PASSED] drm_test_check_in_clone_mode ===========
[12:29:53] =============== drm_test_check_valid_clones  ===============
[12:29:53] [PASSED] not_in_clone_mode
[12:29:53] [PASSED] valid_clone
[12:29:53] [PASSED] invalid_clone
[12:29:53] =========== [PASSED] drm_test_check_valid_clones ===========
[12:29:53] ============= [PASSED] drm_validate_clone_mode =============
[12:29:53] ============= drm_validate_modeset (1 subtest) =============
[12:29:53] [PASSED] drm_test_check_connector_changed_modeset
[12:29:53] ============== [PASSED] drm_validate_modeset ===============
[12:29:53] ====== drm_test_bridge_get_current_state (2 subtests) ======
[12:29:53] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[12:29:53] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[12:29:53] ======== [PASSED] drm_test_bridge_get_current_state ========
[12:29:53] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[12:29:53] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[12:29:53] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[12:29:53] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[12:29:53] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[12:29:53] ============== drm_bridge_alloc (2 subtests) ===============
[12:29:53] [PASSED] drm_test_drm_bridge_alloc_basic
[12:29:53] [PASSED] drm_test_drm_bridge_alloc_get_put
[12:29:53] ================ [PASSED] drm_bridge_alloc =================
[12:29:53] ================== drm_buddy (7 subtests) ==================
[12:29:53] [PASSED] drm_test_buddy_alloc_limit
[12:29:53] [PASSED] drm_test_buddy_alloc_optimistic
[12:29:53] [PASSED] drm_test_buddy_alloc_pessimistic
[12:29:53] [PASSED] drm_test_buddy_alloc_pathological
[12:29:53] [PASSED] drm_test_buddy_alloc_contiguous
[12:29:53] [PASSED] drm_test_buddy_alloc_clear
[12:29:53] [PASSED] drm_test_buddy_alloc_range_bias
[12:29:53] ==================== [PASSED] drm_buddy ====================
[12:29:53] ============= drm_cmdline_parser (40 subtests) =============
[12:29:53] [PASSED] drm_test_cmdline_force_d_only
[12:29:53] [PASSED] drm_test_cmdline_force_D_only_dvi
[12:29:53] [PASSED] drm_test_cmdline_force_D_only_hdmi
[12:29:53] [PASSED] drm_test_cmdline_force_D_only_not_digital
[12:29:53] [PASSED] drm_test_cmdline_force_e_only
[12:29:53] [PASSED] drm_test_cmdline_res
[12:29:53] [PASSED] drm_test_cmdline_res_vesa
[12:29:53] [PASSED] drm_test_cmdline_res_vesa_rblank
[12:29:53] [PASSED] drm_test_cmdline_res_rblank
[12:29:53] [PASSED] drm_test_cmdline_res_bpp
[12:29:53] [PASSED] drm_test_cmdline_res_refresh
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[12:29:53] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[12:29:53] [PASSED] drm_test_cmdline_res_margins_force_on
[12:29:53] [PASSED] drm_test_cmdline_res_vesa_margins
[12:29:53] [PASSED] drm_test_cmdline_name
[12:29:53] [PASSED] drm_test_cmdline_name_bpp
[12:29:53] [PASSED] drm_test_cmdline_name_option
[12:29:53] [PASSED] drm_test_cmdline_name_bpp_option
[12:29:53] [PASSED] drm_test_cmdline_rotate_0
[12:29:53] [PASSED] drm_test_cmdline_rotate_90
[12:29:53] [PASSED] drm_test_cmdline_rotate_180
[12:29:53] [PASSED] drm_test_cmdline_rotate_270
[12:29:53] [PASSED] drm_test_cmdline_hmirror
[12:29:53] [PASSED] drm_test_cmdline_vmirror
[12:29:53] [PASSED] drm_test_cmdline_margin_options
[12:29:53] [PASSED] drm_test_cmdline_multiple_options
[12:29:53] [PASSED] drm_test_cmdline_bpp_extra_and_option
[12:29:53] [PASSED] drm_test_cmdline_extra_and_option
[12:29:53] [PASSED] drm_test_cmdline_freestanding_options
[12:29:53] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[12:29:53] [PASSED] drm_test_cmdline_panel_orientation
[12:29:53] ================ drm_test_cmdline_invalid  =================
[12:29:53] [PASSED] margin_only
[12:29:53] [PASSED] interlace_only
[12:29:53] [PASSED] res_missing_x
[12:29:53] [PASSED] res_missing_y
[12:29:53] [PASSED] res_bad_y
[12:29:53] [PASSED] res_missing_y_bpp
[12:29:53] [PASSED] res_bad_bpp
[12:29:53] [PASSED] res_bad_refresh
[12:29:53] [PASSED] res_bpp_refresh_force_on_off
[12:29:53] [PASSED] res_invalid_mode
[12:29:53] [PASSED] res_bpp_wrong_place_mode
[12:29:53] [PASSED] name_bpp_refresh
[12:29:53] [PASSED] name_refresh
[12:29:53] [PASSED] name_refresh_wrong_mode
[12:29:53] [PASSED] name_refresh_invalid_mode
[12:29:53] [PASSED] rotate_multiple
[12:29:53] [PASSED] rotate_invalid_val
[12:29:53] [PASSED] rotate_truncated
[12:29:53] [PASSED] invalid_option
[12:29:53] [PASSED] invalid_tv_option
[12:29:53] [PASSED] truncated_tv_option
[12:29:53] ============ [PASSED] drm_test_cmdline_invalid =============
[12:29:53] =============== drm_test_cmdline_tv_options  ===============
[12:29:53] [PASSED] NTSC
[12:29:53] [PASSED] NTSC_443
[12:29:53] [PASSED] NTSC_J
[12:29:53] [PASSED] PAL
[12:29:53] [PASSED] PAL_M
[12:29:53] [PASSED] PAL_N
[12:29:53] [PASSED] SECAM
[12:29:53] [PASSED] MONO_525
[12:29:53] [PASSED] MONO_625
[12:29:53] =========== [PASSED] drm_test_cmdline_tv_options ===========
[12:29:53] =============== [PASSED] drm_cmdline_parser ================
[12:29:53] ========== drmm_connector_hdmi_init (20 subtests) ==========
[12:29:53] [PASSED] drm_test_connector_hdmi_init_valid
[12:29:53] [PASSED] drm_test_connector_hdmi_init_bpc_8
[12:29:53] [PASSED] drm_test_connector_hdmi_init_bpc_10
[12:29:53] [PASSED] drm_test_connector_hdmi_init_bpc_12
[12:29:53] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[12:29:53] [PASSED] drm_test_connector_hdmi_init_bpc_null
[12:29:53] [PASSED] drm_test_connector_hdmi_init_formats_empty
[12:29:53] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[12:29:53] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[12:29:53] [PASSED] supported_formats=0x9 yuv420_allowed=1
[12:29:53] [PASSED] supported_formats=0x9 yuv420_allowed=0
[12:29:53] [PASSED] supported_formats=0x3 yuv420_allowed=1
[12:29:53] [PASSED] supported_formats=0x3 yuv420_allowed=0
[12:29:53] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[12:29:53] [PASSED] drm_test_connector_hdmi_init_null_ddc
[12:29:53] [PASSED] drm_test_connector_hdmi_init_null_product
[12:29:53] [PASSED] drm_test_connector_hdmi_init_null_vendor
[12:29:53] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[12:29:53] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[12:29:53] [PASSED] drm_test_connector_hdmi_init_product_valid
[12:29:53] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[12:29:53] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[12:29:53] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[12:29:53] ========= drm_test_connector_hdmi_init_type_valid  =========
[12:29:53] [PASSED] HDMI-A
[12:29:53] [PASSED] HDMI-B
[12:29:53] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[12:29:53] ======== drm_test_connector_hdmi_init_type_invalid  ========
[12:29:53] [PASSED] Unknown
[12:29:53] [PASSED] VGA
[12:29:53] [PASSED] DVI-I
[12:29:53] [PASSED] DVI-D
[12:29:53] [PASSED] DVI-A
[12:29:53] [PASSED] Composite
[12:29:53] [PASSED] SVIDEO
[12:29:53] [PASSED] LVDS
[12:29:53] [PASSED] Component
[12:29:53] [PASSED] DIN
[12:29:53] [PASSED] DP
[12:29:53] [PASSED] TV
[12:29:53] [PASSED] eDP
[12:29:53] [PASSED] Virtual
[12:29:53] [PASSED] DSI
[12:29:53] [PASSED] DPI
[12:29:53] [PASSED] Writeback
[12:29:53] [PASSED] SPI
[12:29:53] [PASSED] USB
[12:29:53] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[12:29:53] ============ [PASSED] drmm_connector_hdmi_init =============
[12:29:53] ============= drmm_connector_init (3 subtests) =============
[12:29:53] [PASSED] drm_test_drmm_connector_init
[12:29:53] [PASSED] drm_test_drmm_connector_init_null_ddc
[12:29:53] ========= drm_test_drmm_connector_init_type_valid  =========
[12:29:53] [PASSED] Unknown
[12:29:53] [PASSED] VGA
[12:29:53] [PASSED] DVI-I
[12:29:53] [PASSED] DVI-D
[12:29:53] [PASSED] DVI-A
[12:29:53] [PASSED] Composite
[12:29:53] [PASSED] SVIDEO
[12:29:53] [PASSED] LVDS
[12:29:53] [PASSED] Component
[12:29:53] [PASSED] DIN
[12:29:53] [PASSED] DP
[12:29:53] [PASSED] HDMI-A
[12:29:53] [PASSED] HDMI-B
[12:29:53] [PASSED] TV
[12:29:53] [PASSED] eDP
[12:29:53] [PASSED] Virtual
[12:29:53] [PASSED] DSI
[12:29:53] [PASSED] DPI
[12:29:53] [PASSED] Writeback
[12:29:53] [PASSED] SPI
[12:29:53] [PASSED] USB
[12:29:53] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[12:29:53] =============== [PASSED] drmm_connector_init ===============
[12:29:53] ========= drm_connector_dynamic_init (6 subtests) ==========
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_init
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_init_properties
[12:29:53] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[12:29:53] [PASSED] Unknown
[12:29:53] [PASSED] VGA
[12:29:53] [PASSED] DVI-I
[12:29:53] [PASSED] DVI-D
[12:29:53] [PASSED] DVI-A
[12:29:53] [PASSED] Composite
[12:29:53] [PASSED] SVIDEO
[12:29:53] [PASSED] LVDS
[12:29:53] [PASSED] Component
[12:29:53] [PASSED] DIN
[12:29:53] [PASSED] DP
[12:29:53] [PASSED] HDMI-A
[12:29:53] [PASSED] HDMI-B
[12:29:53] [PASSED] TV
[12:29:53] [PASSED] eDP
[12:29:53] [PASSED] Virtual
[12:29:53] [PASSED] DSI
[12:29:53] [PASSED] DPI
[12:29:53] [PASSED] Writeback
[12:29:53] [PASSED] SPI
[12:29:53] [PASSED] USB
[12:29:53] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[12:29:53] ======== drm_test_drm_connector_dynamic_init_name  =========
[12:29:53] [PASSED] Unknown
[12:29:53] [PASSED] VGA
[12:29:53] [PASSED] DVI-I
[12:29:53] [PASSED] DVI-D
[12:29:53] [PASSED] DVI-A
[12:29:53] [PASSED] Composite
[12:29:53] [PASSED] SVIDEO
[12:29:53] [PASSED] LVDS
[12:29:53] [PASSED] Component
[12:29:53] [PASSED] DIN
[12:29:53] [PASSED] DP
[12:29:53] [PASSED] HDMI-A
[12:29:53] [PASSED] HDMI-B
[12:29:53] [PASSED] TV
[12:29:53] [PASSED] eDP
[12:29:53] [PASSED] Virtual
[12:29:53] [PASSED] DSI
[12:29:53] [PASSED] DPI
[12:29:53] [PASSED] Writeback
[12:29:53] [PASSED] SPI
[12:29:53] [PASSED] USB
[12:29:53] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[12:29:53] =========== [PASSED] drm_connector_dynamic_init ============
[12:29:53] ==== drm_connector_dynamic_register_early (4 subtests) =====
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[12:29:53] ====== [PASSED] drm_connector_dynamic_register_early =======
[12:29:53] ======= drm_connector_dynamic_register (7 subtests) ========
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[12:29:53] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[12:29:53] ========= [PASSED] drm_connector_dynamic_register ==========
[12:29:53] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[12:29:53] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[12:29:53] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[12:29:53] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[12:29:53] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[12:29:53] ========== drm_test_get_tv_mode_from_name_valid  ===========
[12:29:53] [PASSED] NTSC
[12:29:53] [PASSED] NTSC-443
[12:29:53] [PASSED] NTSC-J
[12:29:53] [PASSED] PAL
[12:29:53] [PASSED] PAL-M
[12:29:53] [PASSED] PAL-N
[12:29:53] [PASSED] SECAM
[12:29:53] [PASSED] Mono
[12:29:53] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[12:29:53] [PASSED] drm_test_get_tv_mode_from_name_truncated
[12:29:53] ============ [PASSED] drm_get_tv_mode_from_name ============
[12:29:53] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[12:29:53] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[12:29:53] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[12:29:53] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[12:29:53] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[12:29:53] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[12:29:53] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[12:29:53] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[12:29:53] [PASSED] VIC 96
[12:29:53] [PASSED] VIC 97
[12:29:53] [PASSED] VIC 101
[12:29:53] [PASSED] VIC 102
[12:29:53] [PASSED] VIC 106
[12:29:53] [PASSED] VIC 107
[12:29:53] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[12:29:53] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[12:29:53] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[12:29:53] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[12:29:53] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[12:29:53] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[12:29:53] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[12:29:53] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[12:29:53] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[12:29:53] [PASSED] Automatic
[12:29:53] [PASSED] Full
[12:29:53] [PASSED] Limited 16:235
[12:29:53] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[12:29:53] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[12:29:53] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[12:29:53] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[12:29:53] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[12:29:53] [PASSED] RGB
[12:29:53] [PASSED] YUV 4:2:0
[12:29:53] [PASSED] YUV 4:2:2
[12:29:53] [PASSED] YUV 4:4:4
[12:29:53] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[12:29:53] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[12:29:53] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[12:29:53] ============= drm_damage_helper (21 subtests) ==============
[12:29:53] [PASSED] drm_test_damage_iter_no_damage
[12:29:53] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[12:29:53] [PASSED] drm_test_damage_iter_no_damage_src_moved
[12:29:53] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[12:29:53] [PASSED] drm_test_damage_iter_no_damage_not_visible
[12:29:53] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[12:29:53] [PASSED] drm_test_damage_iter_no_damage_no_fb
[12:29:53] [PASSED] drm_test_damage_iter_simple_damage
[12:29:53] [PASSED] drm_test_damage_iter_single_damage
[12:29:53] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[12:29:53] [PASSED] drm_test_damage_iter_single_damage_outside_src
[12:29:53] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[12:29:53] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[12:29:53] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[12:29:53] [PASSED] drm_test_damage_iter_single_damage_src_moved
[12:29:53] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[12:29:53] [PASSED] drm_test_damage_iter_damage
[12:29:53] [PASSED] drm_test_damage_iter_damage_one_intersect
[12:29:53] [PASSED] drm_test_damage_iter_damage_one_outside
[12:29:53] [PASSED] drm_test_damage_iter_damage_src_moved
[12:29:53] [PASSED] drm_test_damage_iter_damage_not_visible
[12:29:53] ================ [PASSED] drm_damage_helper ================
[12:29:53] ============== drm_dp_mst_helper (3 subtests) ==============
[12:29:53] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[12:29:53] [PASSED] Clock 154000 BPP 30 DSC disabled
[12:29:53] [PASSED] Clock 234000 BPP 30 DSC disabled
[12:29:53] [PASSED] Clock 297000 BPP 24 DSC disabled
[12:29:53] [PASSED] Clock 332880 BPP 24 DSC enabled
[12:29:53] [PASSED] Clock 324540 BPP 24 DSC enabled
[12:29:53] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[12:29:53] ============== drm_test_dp_mst_calc_pbn_div  ===============
[12:29:53] [PASSED] Link rate 2000000 lane count 4
[12:29:53] [PASSED] Link rate 2000000 lane count 2
[12:29:53] [PASSED] Link rate 2000000 lane count 1
[12:29:53] [PASSED] Link rate 1350000 lane count 4
[12:29:53] [PASSED] Link rate 1350000 lane count 2
[12:29:53] [PASSED] Link rate 1350000 lane count 1
[12:29:53] [PASSED] Link rate 1000000 lane count 4
[12:29:53] [PASSED] Link rate 1000000 lane count 2
[12:29:53] [PASSED] Link rate 1000000 lane count 1
[12:29:53] [PASSED] Link rate 810000 lane count 4
[12:29:53] [PASSED] Link rate 810000 lane count 2
[12:29:53] [PASSED] Link rate 810000 lane count 1
[12:29:53] [PASSED] Link rate 540000 lane count 4
[12:29:53] [PASSED] Link rate 540000 lane count 2
[12:29:53] [PASSED] Link rate 540000 lane count 1
[12:29:53] [PASSED] Link rate 270000 lane count 4
[12:29:53] [PASSED] Link rate 270000 lane count 2
[12:29:53] [PASSED] Link rate 270000 lane count 1
[12:29:53] [PASSED] Link rate 162000 lane count 4
[12:29:53] [PASSED] Link rate 162000 lane count 2
[12:29:53] [PASSED] Link rate 162000 lane count 1
[12:29:53] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[12:29:53] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[12:29:53] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[12:29:53] [PASSED] DP_POWER_UP_PHY with port number
[12:29:53] [PASSED] DP_POWER_DOWN_PHY with port number
[12:29:53] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[12:29:53] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[12:29:53] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[12:29:53] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[12:29:53] [PASSED] DP_QUERY_PAYLOAD with port number
[12:29:53] [PASSED] DP_QUERY_PAYLOAD with VCPI
[12:29:53] [PASSED] DP_REMOTE_DPCD_READ with port number
[12:29:53] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[12:29:53] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[12:29:53] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[12:29:53] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[12:29:53] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[12:29:53] [PASSED] DP_REMOTE_I2C_READ with port number
[12:29:53] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[12:29:53] [PASSED] DP_REMOTE_I2C_READ with transactions array
[12:29:53] [PASSED] DP_REMOTE_I2C_WRITE with port number
[12:29:53] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[12:29:53] [PASSED] DP_REMOTE_I2C_WRITE with data array
[12:29:53] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[12:29:53] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[12:29:53] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[12:29:53] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[12:29:53] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[12:29:53] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[12:29:53] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[12:29:53] ================ [PASSED] drm_dp_mst_helper ================
[12:29:53] ================== drm_exec (7 subtests) ===================
[12:29:53] [PASSED] sanitycheck
[12:29:53] [PASSED] test_lock
[12:29:53] [PASSED] test_lock_unlock
[12:29:53] [PASSED] test_duplicates
[12:29:53] [PASSED] test_prepare
[12:29:53] [PASSED] test_prepare_array
[12:29:53] [PASSED] test_multiple_loops
[12:29:53] ==================== [PASSED] drm_exec =====================
[12:29:53] =========== drm_format_helper_test (17 subtests) ===========
[12:29:53] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[12:29:53] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[12:29:53] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[12:29:53] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[12:29:53] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[12:29:53] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[12:29:53] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[12:29:53] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[12:29:53] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[12:29:53] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[12:29:53] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[12:29:53] ============== drm_test_fb_xrgb8888_to_mono  ===============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[12:29:53] ==================== drm_test_fb_swab  =====================
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ================ [PASSED] drm_test_fb_swab =================
[12:29:53] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[12:29:53] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[12:29:53] [PASSED] single_pixel_source_buffer
[12:29:53] [PASSED] single_pixel_clip_rectangle
[12:29:53] [PASSED] well_known_colors
[12:29:53] [PASSED] destination_pitch
[12:29:53] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[12:29:53] ================= drm_test_fb_clip_offset  =================
[12:29:53] [PASSED] pass through
[12:29:53] [PASSED] horizontal offset
[12:29:53] [PASSED] vertical offset
[12:29:53] [PASSED] horizontal and vertical offset
[12:29:53] [PASSED] horizontal offset (custom pitch)
[12:29:53] [PASSED] vertical offset (custom pitch)
[12:29:53] [PASSED] horizontal and vertical offset (custom pitch)
[12:29:53] ============= [PASSED] drm_test_fb_clip_offset =============
[12:29:53] =================== drm_test_fb_memcpy  ====================
[12:29:53] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[12:29:53] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[12:29:53] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[12:29:53] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[12:29:53] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[12:29:53] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[12:29:53] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[12:29:53] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[12:29:53] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[12:29:53] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[12:29:53] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[12:29:53] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[12:29:53] =============== [PASSED] drm_test_fb_memcpy ================
[12:29:53] ============= [PASSED] drm_format_helper_test ==============
[12:29:53] ================= drm_format (18 subtests) =================
[12:29:53] [PASSED] drm_test_format_block_width_invalid
[12:29:53] [PASSED] drm_test_format_block_width_one_plane
[12:29:53] [PASSED] drm_test_format_block_width_two_plane
[12:29:53] [PASSED] drm_test_format_block_width_three_plane
[12:29:53] [PASSED] drm_test_format_block_width_tiled
[12:29:53] [PASSED] drm_test_format_block_height_invalid
[12:29:53] [PASSED] drm_test_format_block_height_one_plane
[12:29:53] [PASSED] drm_test_format_block_height_two_plane
[12:29:53] [PASSED] drm_test_format_block_height_three_plane
[12:29:53] [PASSED] drm_test_format_block_height_tiled
[12:29:53] [PASSED] drm_test_format_min_pitch_invalid
[12:29:53] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[12:29:53] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[12:29:53] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[12:29:53] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[12:29:53] [PASSED] drm_test_format_min_pitch_two_plane
[12:29:53] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[12:29:53] [PASSED] drm_test_format_min_pitch_tiled
[12:29:53] =================== [PASSED] drm_format ====================
[12:29:53] ============== drm_framebuffer (10 subtests) ===============
[12:29:53] ========== drm_test_framebuffer_check_src_coords  ==========
[12:29:53] [PASSED] Success: source fits into fb
[12:29:53] [PASSED] Fail: overflowing fb with x-axis coordinate
[12:29:53] [PASSED] Fail: overflowing fb with y-axis coordinate
[12:29:53] [PASSED] Fail: overflowing fb with source width
[12:29:53] [PASSED] Fail: overflowing fb with source height
[12:29:53] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[12:29:53] [PASSED] drm_test_framebuffer_cleanup
[12:29:53] =============== drm_test_framebuffer_create  ===============
[12:29:53] [PASSED] ABGR8888 normal sizes
[12:29:53] [PASSED] ABGR8888 max sizes
[12:29:53] [PASSED] ABGR8888 pitch greater than min required
[12:29:53] [PASSED] ABGR8888 pitch less than min required
[12:29:53] [PASSED] ABGR8888 Invalid width
[12:29:53] [PASSED] ABGR8888 Invalid buffer handle
[12:29:53] [PASSED] No pixel format
[12:29:53] [PASSED] ABGR8888 Width 0
[12:29:53] [PASSED] ABGR8888 Height 0
[12:29:53] [PASSED] ABGR8888 Out of bound height * pitch combination
[12:29:53] [PASSED] ABGR8888 Large buffer offset
[12:29:53] [PASSED] ABGR8888 Buffer offset for inexistent plane
[12:29:53] [PASSED] ABGR8888 Invalid flag
[12:29:53] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[12:29:53] [PASSED] ABGR8888 Valid buffer modifier
[12:29:53] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[12:29:53] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[12:29:53] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[12:29:53] [PASSED] NV12 Normal sizes
[12:29:53] [PASSED] NV12 Max sizes
[12:29:53] [PASSED] NV12 Invalid pitch
[12:29:53] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[12:29:53] [PASSED] NV12 different  modifier per-plane
[12:29:53] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[12:29:53] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[12:29:53] [PASSED] NV12 Modifier for inexistent plane
[12:29:53] [PASSED] NV12 Handle for inexistent plane
[12:29:53] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[12:29:53] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[12:29:53] [PASSED] YVU420 Normal sizes
[12:29:53] [PASSED] YVU420 Max sizes
[12:29:53] [PASSED] YVU420 Invalid pitch
[12:29:53] [PASSED] YVU420 Different pitches
[12:29:53] [PASSED] YVU420 Different buffer offsets/pitches
[12:29:53] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[12:29:53] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[12:29:53] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[12:29:53] [PASSED] YVU420 Valid modifier
[12:29:53] [PASSED] YVU420 Different modifiers per plane
[12:29:53] [PASSED] YVU420 Modifier for inexistent plane
[12:29:53] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[12:29:53] [PASSED] X0L2 Normal sizes
[12:29:53] [PASSED] X0L2 Max sizes
[12:29:53] [PASSED] X0L2 Invalid pitch
[12:29:53] [PASSED] X0L2 Pitch greater than minimum required
[12:29:53] [PASSED] X0L2 Handle for inexistent plane
[12:29:53] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[12:29:53] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[12:29:53] [PASSED] X0L2 Valid modifier
[12:29:53] [PASSED] X0L2 Modifier for inexistent plane
[12:29:53] =========== [PASSED] drm_test_framebuffer_create ===========
[12:29:53] [PASSED] drm_test_framebuffer_free
[12:29:53] [PASSED] drm_test_framebuffer_init
[12:29:53] [PASSED] drm_test_framebuffer_init_bad_format
[12:29:53] [PASSED] drm_test_framebuffer_init_dev_mismatch
[12:29:53] [PASSED] drm_test_framebuffer_lookup
[12:29:53] [PASSED] drm_test_framebuffer_lookup_inexistent
[12:29:53] [PASSED] drm_test_framebuffer_modifiers_not_supported
[12:29:53] ================= [PASSED] drm_framebuffer =================
[12:29:53] ================ drm_gem_shmem (8 subtests) ================
[12:29:53] [PASSED] drm_gem_shmem_test_obj_create
[12:29:53] [PASSED] drm_gem_shmem_test_obj_create_private
[12:29:53] [PASSED] drm_gem_shmem_test_pin_pages
[12:29:53] [PASSED] drm_gem_shmem_test_vmap
[12:29:53] [PASSED] drm_gem_shmem_test_get_pages_sgt
[12:29:53] [PASSED] drm_gem_shmem_test_get_sg_table
[12:29:53] [PASSED] drm_gem_shmem_test_madvise
[12:29:53] [PASSED] drm_gem_shmem_test_purge
[12:29:53] ================== [PASSED] drm_gem_shmem ==================
[12:29:53] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[12:29:53] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[12:29:53] [PASSED] Automatic
[12:29:53] [PASSED] Full
[12:29:53] [PASSED] Limited 16:235
[12:29:53] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[12:29:53] [PASSED] drm_test_check_disable_connector
[12:29:53] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[12:29:53] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[12:29:53] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[12:29:53] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[12:29:53] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[12:29:53] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[12:29:53] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[12:29:53] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[12:29:53] [PASSED] drm_test_check_output_bpc_dvi
[12:29:53] [PASSED] drm_test_check_output_bpc_format_vic_1
[12:29:53] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[12:29:53] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[12:29:53] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[12:29:53] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[12:29:53] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[12:29:53] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[12:29:53] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[12:29:53] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[12:29:53] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[12:29:53] [PASSED] drm_test_check_broadcast_rgb_value
[12:29:53] [PASSED] drm_test_check_bpc_8_value
[12:29:53] [PASSED] drm_test_check_bpc_10_value
[12:29:53] [PASSED] drm_test_check_bpc_12_value
[12:29:53] [PASSED] drm_test_check_format_value
[12:29:53] [PASSED] drm_test_check_tmds_char_value
[12:29:53] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[12:29:53] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[12:29:53] [PASSED] drm_test_check_mode_valid
[12:29:53] [PASSED] drm_test_check_mode_valid_reject
[12:29:53] [PASSED] drm_test_check_mode_valid_reject_rate
[12:29:53] [PASSED] drm_test_check_mode_valid_reject_max_clock
[12:29:53] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[12:29:53] ================= drm_managed (2 subtests) =================
[12:29:53] [PASSED] drm_test_managed_release_action
[12:29:53] [PASSED] drm_test_managed_run_action
[12:29:53] =================== [PASSED] drm_managed ===================
[12:29:53] =================== drm_mm (6 subtests) ====================
[12:29:53] [PASSED] drm_test_mm_init
[12:29:53] [PASSED] drm_test_mm_debug
[12:29:53] [PASSED] drm_test_mm_align32
[12:29:53] [PASSED] drm_test_mm_align64
[12:29:53] [PASSED] drm_test_mm_lowest
[12:29:53] [PASSED] drm_test_mm_highest
[12:29:53] ===================== [PASSED] drm_mm ======================
[12:29:53] ============= drm_modes_analog_tv (5 subtests) =============
[12:29:53] [PASSED] drm_test_modes_analog_tv_mono_576i
[12:29:53] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[12:29:53] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[12:29:53] [PASSED] drm_test_modes_analog_tv_pal_576i
[12:29:53] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[12:29:53] =============== [PASSED] drm_modes_analog_tv ===============
[12:29:53] ============== drm_plane_helper (2 subtests) ===============
[12:29:53] =============== drm_test_check_plane_state  ================
[12:29:53] [PASSED] clipping_simple
[12:29:53] [PASSED] clipping_rotate_reflect
[12:29:53] [PASSED] positioning_simple
[12:29:53] [PASSED] upscaling
[12:29:53] [PASSED] downscaling
[12:29:53] [PASSED] rounding1
[12:29:53] [PASSED] rounding2
[12:29:53] [PASSED] rounding3
[12:29:53] [PASSED] rounding4
[12:29:53] =========== [PASSED] drm_test_check_plane_state ============
[12:29:53] =========== drm_test_check_invalid_plane_state  ============
[12:29:53] [PASSED] positioning_invalid
[12:29:53] [PASSED] upscaling_invalid
[12:29:53] [PASSED] downscaling_invalid
[12:29:53] ======= [PASSED] drm_test_check_invalid_plane_state ========
[12:29:53] ================ [PASSED] drm_plane_helper =================
[12:29:53] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[12:29:53] ====== drm_test_connector_helper_tv_get_modes_check  =======
[12:29:53] [PASSED] None
[12:29:53] [PASSED] PAL
[12:29:53] [PASSED] NTSC
[12:29:53] [PASSED] Both, NTSC Default
[12:29:53] [PASSED] Both, PAL Default
[12:29:53] [PASSED] Both, NTSC Default, with PAL on command-line
[12:29:53] [PASSED] Both, PAL Default, with NTSC on command-line
[12:29:53] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[12:29:53] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[12:29:53] ================== drm_rect (9 subtests) ===================
[12:29:53] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[12:29:53] [PASSED] drm_test_rect_clip_scaled_not_clipped
[12:29:53] [PASSED] drm_test_rect_clip_scaled_clipped
[12:29:53] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[12:29:53] ================= drm_test_rect_intersect  =================
[12:29:53] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[12:29:53] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[12:29:53] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[12:29:53] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[12:29:53] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[12:29:53] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[12:29:53] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[12:29:53] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[12:29:53] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[12:29:53] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[12:29:53] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[12:29:53] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[12:29:53] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[12:29:53] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[12:29:53] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[12:29:53] ============= [PASSED] drm_test_rect_intersect =============
[12:29:53] ================ drm_test_rect_calc_hscale  ================
[12:29:53] [PASSED] normal use
[12:29:53] [PASSED] out of max range
[12:29:53] [PASSED] out of min range
[12:29:53] [PASSED] zero dst
[12:29:53] [PASSED] negative src
[12:29:53] [PASSED] negative dst
[12:29:53] ============ [PASSED] drm_test_rect_calc_hscale ============
[12:29:53] ================ drm_test_rect_calc_vscale  ================
[12:29:53] [PASSED] normal use
[12:29:53] [PASSED] out of max range
[12:29:53] [PASSED] out of min range
[12:29:53] [PASSED] zero dst
[12:29:53] [PASSED] negative src
[12:29:53] [PASSED] negative dst
[12:29:53] ============ [PASSED] drm_test_rect_calc_vscale ============
[12:29:53] ================== drm_test_rect_rotate  ===================
[12:29:53] [PASSED] reflect-x
[12:29:53] [PASSED] reflect-y
[12:29:53] [PASSED] rotate-0
[12:29:53] [PASSED] rotate-90
[12:29:53] [PASSED] rotate-180
[12:29:53] [PASSED] rotate-270
stty: 'standard input': Inappropriate ioctl for device
[12:29:53] ============== [PASSED] drm_test_rect_rotate ===============
[12:29:53] ================ drm_test_rect_rotate_inv  =================
[12:29:53] [PASSED] reflect-x
[12:29:53] [PASSED] reflect-y
[12:29:53] [PASSED] rotate-0
[12:29:53] [PASSED] rotate-90
[12:29:53] [PASSED] rotate-180
[12:29:53] [PASSED] rotate-270
[12:29:53] ============ [PASSED] drm_test_rect_rotate_inv =============
[12:29:53] ==================== [PASSED] drm_rect =====================
[12:29:53] ============ drm_sysfb_modeset_test (1 subtest) ============
[12:29:53] ============ drm_test_sysfb_build_fourcc_list  =============
[12:29:53] [PASSED] no native formats
[12:29:53] [PASSED] XRGB8888 as native format
[12:29:53] [PASSED] remove duplicates
[12:29:53] [PASSED] convert alpha formats
[12:29:53] [PASSED] random formats
[12:29:53] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[12:29:53] ============= [PASSED] drm_sysfb_modeset_test ==============
[12:29:53] ============================================================
[12:29:53] Testing complete. Ran 616 tests: passed: 616
[12:29:53] Elapsed time: 23.350s total, 1.671s configuring, 21.510s building, 0.138s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[12:29:53] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[12:29:55] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[12:30:03] Starting KUnit Kernel (1/1)...
[12:30:03] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[12:30:03] ================= ttm_device (5 subtests) ==================
[12:30:03] [PASSED] ttm_device_init_basic
[12:30:03] [PASSED] ttm_device_init_multiple
[12:30:03] [PASSED] ttm_device_fini_basic
[12:30:03] [PASSED] ttm_device_init_no_vma_man
[12:30:03] ================== ttm_device_init_pools  ==================
[12:30:03] [PASSED] No DMA allocations, no DMA32 required
[12:30:03] [PASSED] DMA allocations, DMA32 required
[12:30:03] [PASSED] No DMA allocations, DMA32 required
[12:30:03] [PASSED] DMA allocations, no DMA32 required
[12:30:03] ============== [PASSED] ttm_device_init_pools ==============
[12:30:03] =================== [PASSED] ttm_device ====================
[12:30:03] ================== ttm_pool (8 subtests) ===================
[12:30:03] ================== ttm_pool_alloc_basic  ===================
[12:30:03] [PASSED] One page
[12:30:03] [PASSED] More than one page
[12:30:03] [PASSED] Above the allocation limit
[12:30:03] [PASSED] One page, with coherent DMA mappings enabled
[12:30:03] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[12:30:03] ============== [PASSED] ttm_pool_alloc_basic ===============
[12:30:03] ============== ttm_pool_alloc_basic_dma_addr  ==============
[12:30:03] [PASSED] One page
[12:30:03] [PASSED] More than one page
[12:30:03] [PASSED] Above the allocation limit
[12:30:03] [PASSED] One page, with coherent DMA mappings enabled
[12:30:03] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[12:30:03] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[12:30:03] [PASSED] ttm_pool_alloc_order_caching_match
[12:30:03] [PASSED] ttm_pool_alloc_caching_mismatch
[12:30:03] [PASSED] ttm_pool_alloc_order_mismatch
[12:30:03] [PASSED] ttm_pool_free_dma_alloc
[12:30:03] [PASSED] ttm_pool_free_no_dma_alloc
[12:30:03] [PASSED] ttm_pool_fini_basic
[12:30:03] ==================== [PASSED] ttm_pool =====================
[12:30:03] ================ ttm_resource (8 subtests) =================
[12:30:03] ================= ttm_resource_init_basic  =================
[12:30:03] [PASSED] Init resource in TTM_PL_SYSTEM
[12:30:03] [PASSED] Init resource in TTM_PL_VRAM
[12:30:03] [PASSED] Init resource in a private placement
[12:30:03] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[12:30:03] ============= [PASSED] ttm_resource_init_basic =============
[12:30:03] [PASSED] ttm_resource_init_pinned
[12:30:03] [PASSED] ttm_resource_fini_basic
[12:30:03] [PASSED] ttm_resource_manager_init_basic
[12:30:03] [PASSED] ttm_resource_manager_usage_basic
[12:30:03] [PASSED] ttm_resource_manager_set_used_basic
[12:30:03] [PASSED] ttm_sys_man_alloc_basic
[12:30:03] [PASSED] ttm_sys_man_free_basic
[12:30:03] ================== [PASSED] ttm_resource ===================
[12:30:03] =================== ttm_tt (15 subtests) ===================
[12:30:03] ==================== ttm_tt_init_basic  ====================
[12:30:03] [PASSED] Page-aligned size
[12:30:03] [PASSED] Extra pages requested
[12:30:03] ================ [PASSED] ttm_tt_init_basic ================
[12:30:03] [PASSED] ttm_tt_init_misaligned
[12:30:03] [PASSED] ttm_tt_fini_basic
[12:30:03] [PASSED] ttm_tt_fini_sg
[12:30:03] [PASSED] ttm_tt_fini_shmem
[12:30:03] [PASSED] ttm_tt_create_basic
[12:30:03] [PASSED] ttm_tt_create_invalid_bo_type
[12:30:03] [PASSED] ttm_tt_create_ttm_exists
[12:30:03] [PASSED] ttm_tt_create_failed
[12:30:03] [PASSED] ttm_tt_destroy_basic
[12:30:03] [PASSED] ttm_tt_populate_null_ttm
[12:30:03] [PASSED] ttm_tt_populate_populated_ttm
[12:30:03] [PASSED] ttm_tt_unpopulate_basic
[12:30:03] [PASSED] ttm_tt_unpopulate_empty_ttm
[12:30:03] [PASSED] ttm_tt_swapin_basic
[12:30:03] ===================== [PASSED] ttm_tt ======================
[12:30:03] =================== ttm_bo (14 subtests) ===================
[12:30:03] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[12:30:03] [PASSED] Cannot be interrupted and sleeps
[12:30:03] [PASSED] Cannot be interrupted, locks straight away
[12:30:03] [PASSED] Can be interrupted, sleeps
[12:30:03] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[12:30:03] [PASSED] ttm_bo_reserve_locked_no_sleep
[12:30:03] [PASSED] ttm_bo_reserve_no_wait_ticket
[12:30:03] [PASSED] ttm_bo_reserve_double_resv
[12:30:03] [PASSED] ttm_bo_reserve_interrupted
[12:30:03] [PASSED] ttm_bo_reserve_deadlock
[12:30:03] [PASSED] ttm_bo_unreserve_basic
[12:30:03] [PASSED] ttm_bo_unreserve_pinned
[12:30:03] [PASSED] ttm_bo_unreserve_bulk
[12:30:03] [PASSED] ttm_bo_put_basic
[12:30:03] [PASSED] ttm_bo_put_shared_resv
[12:30:03] [PASSED] ttm_bo_pin_basic
[12:30:03] [PASSED] ttm_bo_pin_unpin_resource
[12:30:03] [PASSED] ttm_bo_multiple_pin_one_unpin
[12:30:03] ===================== [PASSED] ttm_bo ======================
[12:30:03] ============== ttm_bo_validate (22 subtests) ===============
[12:30:03] ============== ttm_bo_init_reserved_sys_man  ===============
[12:30:03] [PASSED] Buffer object for userspace
[12:30:03] [PASSED] Kernel buffer object
[12:30:03] [PASSED] Shared buffer object
[12:30:03] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[12:30:03] ============== ttm_bo_init_reserved_mock_man  ==============
[12:30:03] [PASSED] Buffer object for userspace
[12:30:03] [PASSED] Kernel buffer object
[12:30:03] [PASSED] Shared buffer object
[12:30:03] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[12:30:03] [PASSED] ttm_bo_init_reserved_resv
[12:30:03] ================== ttm_bo_validate_basic  ==================
[12:30:03] [PASSED] Buffer object for userspace
[12:30:03] [PASSED] Kernel buffer object
[12:30:03] [PASSED] Shared buffer object
[12:30:03] ============== [PASSED] ttm_bo_validate_basic ==============
[12:30:03] [PASSED] ttm_bo_validate_invalid_placement
[12:30:03] ============= ttm_bo_validate_same_placement  ==============
[12:30:03] [PASSED] System manager
[12:30:03] [PASSED] VRAM manager
[12:30:03] ========= [PASSED] ttm_bo_validate_same_placement ==========
[12:30:03] [PASSED] ttm_bo_validate_failed_alloc
[12:30:03] [PASSED] ttm_bo_validate_pinned
[12:30:03] [PASSED] ttm_bo_validate_busy_placement
[12:30:03] ================ ttm_bo_validate_multihop  =================
[12:30:03] [PASSED] Buffer object for userspace
[12:30:03] [PASSED] Kernel buffer object
[12:30:03] [PASSED] Shared buffer object
[12:30:03] ============ [PASSED] ttm_bo_validate_multihop =============
[12:30:03] ========== ttm_bo_validate_no_placement_signaled  ==========
[12:30:03] [PASSED] Buffer object in system domain, no page vector
[12:30:03] [PASSED] Buffer object in system domain with an existing page vector
[12:30:03] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[12:30:03] ======== ttm_bo_validate_no_placement_not_signaled  ========
[12:30:03] [PASSED] Buffer object for userspace
[12:30:03] [PASSED] Kernel buffer object
[12:30:03] [PASSED] Shared buffer object
[12:30:03] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[12:30:03] [PASSED] ttm_bo_validate_move_fence_signaled
[12:30:03] ========= ttm_bo_validate_move_fence_not_signaled  =========
[12:30:03] [PASSED] Waits for GPU
[12:30:03] [PASSED] Tries to lock straight away
[12:30:03] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[12:30:03] [PASSED] ttm_bo_validate_swapout
[12:30:03] [PASSED] ttm_bo_validate_happy_evict
[12:30:03] [PASSED] ttm_bo_validate_all_pinned_evict
[12:30:03] [PASSED] ttm_bo_validate_allowed_only_evict
[12:30:03] [PASSED] ttm_bo_validate_deleted_evict
[12:30:03] [PASSED] ttm_bo_validate_busy_domain_evict
[12:30:03] [PASSED] ttm_bo_validate_evict_gutting
[12:30:03] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[12:30:03] ================= [PASSED] ttm_bo_validate =================
[12:30:03] ============================================================
[12:30:03] Testing complete. Ran 102 tests: passed: 102
[12:30:03] Elapsed time: 9.968s total, 1.633s configuring, 7.718s building, 0.527s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 48+ messages in thread

* ✗ CI.checksparse: warning for Handle Firmware reported Hardware Errors (rev4)
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (10 preceding siblings ...)
  2025-07-09 12:30 ` ✓ CI.KUnit: success " Patchwork
@ 2025-07-09 12:44 ` Patchwork
  2025-07-09 13:06 ` ✓ Xe.CI.BAT: success " Patchwork
  2025-07-09 15:02 ` ✗ Xe.CI.Full: failure " Patchwork
  13 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2025-07-09 12:44 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev4)
URL   : https://patchwork.freedesktop.org/series/149756/
State : warning

== Summary ==

+ trap cleanup EXIT
+ KERNEL=/kernel
+ MT=/root/linux/maintainer-tools
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools /root/linux/maintainer-tools
Cloning into '/root/linux/maintainer-tools'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ make -C /root/linux/maintainer-tools
make: Entering directory '/root/linux/maintainer-tools'
cc -O2 -g -Wextra -o remap-log remap-log.c
make: Leaving directory '/root/linux/maintainer-tools'
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ /root/linux/maintainer-tools/dim sparse --fast 20adfb60af27bc0e490b2d20609c3158ae2fbd26
Sparse version: 0.6.4 (Ubuntu: 0.6.4-4ubuntu3)
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/drm_drv.c:449:6: warning: context imbalance in 'drm_dev_enter' - different lock contexts for basic block
+drivers/gpu/drm/drm_drv.c: note: in included file (through include/linux/notifier.h, arch/x86/include/asm/uprobes.h, include/linux/uprobes.h, include/linux/mm_types.h, include/linux/mmzone.h, include/linux/gfp.h, ...):
+drivers/gpu/drm/drm_plane.c:213:24: warning: Using plain integer as NULL pointer
+drivers/gpu/drm/i915/display/intel_alpm.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_cdclk.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_ddi.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2032:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2032:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2032:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_hdcp.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_hotplug.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_pps.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_psr.c: note: in included file:
+drivers/gpu/drm/i915/gt/intel_reset.c:1572:12: warning: context imbalance in '_intel_gt_reset_lock' - different lock contexts for basic block
+drivers/gpu/drm/i915/gt/intel_sseu.c:598:17: error: too long token expansion
+drivers/gpu/drm/i915/i915_active.c:1063:16: warning: context imbalance in '__i915_active_fence_set' - different lock contexts for basic block
+drivers/gpu/drm/i915/i915_drm_client.c:92:9: error: incompatible types in comparison expression (different address spaces):
+drivers/gpu/drm/i915/i915_drm_client.c:92:9: error: incompatible types in comparison expression (different address spaces):
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    expected struct list_head const *list
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    got struct list_head [noderef] __rcu *pos
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head [noderef] __rcu *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head [noderef] __rcu *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9: warning: incorrect type in argument 1 (different address spaces)
+drivers/gpu/drm/i915/i915_irq.c:492:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:492:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:500:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:500:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:505:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:505:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:505:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:543:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:543:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:551:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:551:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:556:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:556:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:556:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:600:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:600:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:603:15: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:603:15: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:607:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:607:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/intel_uncore.c:1927:1: warning: context imbalance in 'fwtable_read8' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1928:1: warning: context imbalance in 'fwtable_read16' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1929:1: warning: context imbalance in 'fwtable_read32' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1930:1: warning: context imbalance in 'fwtable_read64' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1995:1: warning: context imbalance in 'gen6_write8' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1996:1: warning: context imbalance in 'gen6_write16' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1997:1: warning: context imbalance in 'gen6_write32' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:2017:1: warning: context imbalance in 'fwtable_write8' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:2018:1: warning: context imbalance in 'fwtable_write16' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:2019:1: warning: context imbalance in 'fwtable_write32' - unexpected unlock
+drivers/gpu/drm/i915/intel_wakeref.c:145:19: warning: context imbalance in 'wakeref_auto_timeout' - unexpected unlock
+drivers/gpu/drm/ttm/ttm_bo.c:1199:31: warning: symbol 'ttm_swap_ops' was not declared. Should it be static?
+drivers/gpu/drm/ttm/ttm_bo_util.c:329:38:    expected void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:329:38:    got void [noderef] __iomem *
+drivers/gpu/drm/ttm/ttm_bo_util.c:329:38: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/ttm/ttm_bo_util.c:332:38:    expected void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:332:38:    got void [noderef] __iomem *
+drivers/gpu/drm/ttm/ttm_bo_util.c:332:38: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/ttm/ttm_bo_util.c:335:38:    expected void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:335:38:    got void [noderef] __iomem *
+drivers/gpu/drm/ttm/ttm_bo_util.c:335:38: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/ttm/ttm_bo_util.c:467:28:    expected void volatile [noderef] __iomem *addr
+drivers/gpu/drm/ttm/ttm_bo_util.c:467:28:    got void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:467:28: warning: incorrect type in argument 1 (different address spaces)
+./include/linux/srcu.h:400:9: warning: context imbalance in 'drm_dev_exit' - unexpected unlock

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 48+ messages in thread

* ✓ Xe.CI.BAT: success for Handle Firmware reported Hardware Errors (rev4)
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (11 preceding siblings ...)
  2025-07-09 12:44 ` ✗ CI.checksparse: warning " Patchwork
@ 2025-07-09 13:06 ` Patchwork
  2025-07-09 15:02 ` ✗ Xe.CI.Full: failure " Patchwork
  13 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2025-07-09 13:06 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 6557 bytes --]

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev4)
URL   : https://patchwork.freedesktop.org/series/149756/
State : success

== Summary ==

CI Bug Log - changes from xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26_BAT -> xe-pw-149756v4_BAT
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (9 -> 8)
------------------------------

  Missing    (1): bat-adlp-vm 

Known issues
------------

  Here are the changes found in xe-pw-149756v4_BAT that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_addfb_basic@addfb25-y-tiled-small-legacy:
    - bat-lnl-2:          NOTRUN -> [SKIP][1] ([Intel XE#1466] / [Intel XE#2235])
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@kms_addfb_basic@addfb25-y-tiled-small-legacy.html

  * igt@kms_flip@basic-flip-vs-dpms:
    - bat-lnl-2:          NOTRUN -> [SKIP][2] ([Intel XE#2235] / [Intel XE#2482]) +3 other tests skip
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@kms_flip@basic-flip-vs-dpms.html

  * igt@kms_force_connector_basic@force-connector-state:
    - bat-lnl-2:          NOTRUN -> [SKIP][3] ([Intel XE#2235] / [Intel XE#352]) +2 other tests skip
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@kms_force_connector_basic@force-connector-state.html

  * igt@kms_frontbuffer_tracking@basic:
    - bat-lnl-2:          NOTRUN -> [SKIP][4] ([Intel XE#2235] / [Intel XE#2548])
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@kms_frontbuffer_tracking@basic.html

  * igt@kms_hdmi_inject@inject-audio:
    - bat-lnl-2:          NOTRUN -> [SKIP][5] ([Intel XE#1470] / [Intel XE#2235] / [Intel XE#2853])
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@kms_hdmi_inject@inject-audio.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-xr24:
    - bat-lnl-2:          NOTRUN -> [SKIP][6] ([Intel XE#2235]) +13 other tests skip
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-xr24.html

  * igt@kms_psr@psr-cursor-plane-move:
    - bat-lnl-2:          NOTRUN -> [SKIP][7] ([Intel XE#2850] / [Intel XE#929]) +2 other tests skip
   [7]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@kms_psr@psr-cursor-plane-move.html

  * igt@sriov_basic@enable-vfs-autoprobe-off:
    - bat-lnl-2:          NOTRUN -> [SKIP][8] ([Intel XE#1091] / [Intel XE#2849]) +1 other test skip
   [8]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@sriov_basic@enable-vfs-autoprobe-off.html

  * igt@xe_evict@evict-beng-small:
    - bat-lnl-2:          NOTRUN -> [SKIP][9] ([Intel XE#688]) +11 other tests skip
   [9]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@xe_evict@evict-beng-small.html

  * igt@xe_live_ktest@xe_bo@xe_bo_evict_kunit:
    - bat-lnl-2:          NOTRUN -> [SKIP][10] ([Intel XE#2229]) +2 other tests skip
   [10]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@xe_live_ktest@xe_bo@xe_bo_evict_kunit.html

  * igt@xe_mmap@vram:
    - bat-lnl-2:          NOTRUN -> [SKIP][11] ([Intel XE#1416])
   [11]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@xe_mmap@vram.html

  * igt@xe_pat@pat-index-xehpc:
    - bat-lnl-2:          NOTRUN -> [SKIP][12] ([Intel XE#1420] / [Intel XE#2838])
   [12]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@xe_pat@pat-index-xehpc.html

  * igt@xe_pat@pat-index-xelp:
    - bat-lnl-2:          NOTRUN -> [SKIP][13] ([Intel XE#977])
   [13]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@xe_pat@pat-index-xelp.html

  * igt@xe_pat@pat-index-xelpg:
    - bat-lnl-2:          NOTRUN -> [SKIP][14] ([Intel XE#2236] / [Intel XE#979])
   [14]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@xe_pat@pat-index-xelpg.html

  * igt@xe_sriov_flr@flr-vf1-clear:
    - bat-lnl-2:          NOTRUN -> [SKIP][15] ([Intel XE#3342])
   [15]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@xe_sriov_flr@flr-vf1-clear.html

  
#### Possible fixes ####

  * igt@intel_sysfs_debugfs@xe-sysfs-read-all-entries:
    - bat-lnl-2:          [ABORT][16] -> [PASS][17]
   [16]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/bat-lnl-2/igt@intel_sysfs_debugfs@xe-sysfs-read-all-entries.html
   [17]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/bat-lnl-2/igt@intel_sysfs_debugfs@xe-sysfs-read-all-entries.html

  
  [Intel XE#1091]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1091
  [Intel XE#1416]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1416
  [Intel XE#1420]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1420
  [Intel XE#1466]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1466
  [Intel XE#1470]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1470
  [Intel XE#2229]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2229
  [Intel XE#2235]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2235
  [Intel XE#2236]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2236
  [Intel XE#2482]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2482
  [Intel XE#2548]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2548
  [Intel XE#2838]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2838
  [Intel XE#2849]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2849
  [Intel XE#2850]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2850
  [Intel XE#2853]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2853
  [Intel XE#3342]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3342
  [Intel XE#352]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/352
  [Intel XE#688]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/688
  [Intel XE#929]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/929
  [Intel XE#977]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/977
  [Intel XE#979]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/979


Build changes
-------------

  * Linux: xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26 -> xe-pw-149756v4

  IGT_8447: 8447
  xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26: 20adfb60af27bc0e490b2d20609c3158ae2fbd26
  xe-pw-149756v4: 149756v4

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/index.html

[-- Attachment #2: Type: text/html, Size: 7794 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
@ 2025-07-09 13:41   ` Simona Vetter
  2025-07-09 14:09     ` Christian König
  2025-07-09 14:46     ` Riana Tauro
  0 siblings, 2 replies; 48+ messages in thread
From: Simona Vetter @ 2025-07-09 13:41 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida,
	Christian König, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> Certain errors can cause the device to be wedged and may
> require a vendor specific recovery method to restore normal
> operation.
> 
> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> must provide additional recovery documentation if this method
> is used.
> 
> v2: fix documentation (Raag)
> 
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>

I'm not really understanding what this is useful for, maybe concrete
example in the form of driver code that uses this, and some tool or
documentation steps that should be taken for recovery?

The issues I'm seeing here is that eventually we'll get different
vendor-specific recovery steps, and maybe even on the same device, and
that leads us to an enumeration issue. Since it's just a string and an
enum I think it'd be better to just allocate a new one every time there's
a new strange recovery method instead of this opaque approach.

Cheers, Sima

> ---
>  Documentation/gpu/drm-uapi.rst | 9 +++++----
>  drivers/gpu/drm/drm_drv.c      | 2 ++
>  include/drm/drm_device.h       | 4 ++++
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 263e5a97c080..c33070bdb347 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -421,10 +421,10 @@ Recovery
>  Current implementation defines three recovery methods, out of which, drivers
>  can use any one, multiple or none. Method(s) of choice will be sent in the
>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> -more side-effects. If driver is unsure about recovery or method is unknown
> -(like soft/hard system reboot, firmware flashing, physical device replacement
> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> -will be sent instead.
> +more side-effects. If recovery method is specific to vendor
> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> +specific documentation for further recovery steps. If driver is unsure about
> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>  
>  Userspace consumers can parse this event and attempt recovery as per the
>  following expectations.
> @@ -435,6 +435,7 @@ following expectations.
>      none            optional telemetry collection
>      rebind          unbind + bind driver
>      bus-reset       unbind + bus reset/re-enumeration + bind
> +    vendor-specific vendor specific recovery method
>      unknown         consumer policy
>      =============== ========================================
>  
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index cdd591b11488..0ac723a46a91 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>  		return "rebind";
>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
>  		return "bus-reset";
> +	case DRM_WEDGE_RECOVERY_VENDOR:
> +		return "vendor-specific";
>  	default:
>  		return NULL;
>  	}
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index 08b3b2467c4c..08a087f149ff 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -26,10 +26,14 @@ struct pci_controller;
>   * Recovery methods for wedged device in order of less to more side-effects.
>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>   * use any one, multiple (or'd) or none depending on their needs.
> + *
> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> + * details.
>   */
>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>  
>  /**
>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> -- 
> 2.47.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 13:41   ` Simona Vetter
@ 2025-07-09 14:09     ` Christian König
  2025-07-09 14:18       ` Raag Jadav
  2025-07-09 14:46     ` Riana Tauro
  1 sibling, 1 reply; 48+ messages in thread
From: Christian König @ 2025-07-09 14:09 UTC (permalink / raw)
  To: Simona Vetter, Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida, David Airlie,
	dri-devel

On 09.07.25 15:41, Simona Vetter wrote:
> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>> Certain errors can cause the device to be wedged and may
>> require a vendor specific recovery method to restore normal
>> operation.
>>
>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>> must provide additional recovery documentation if this method
>> is used.
>>
>> v2: fix documentation (Raag)
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: <dri-devel@lists.freedesktop.org>
>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> 
> I'm not really understanding what this is useful for, maybe concrete
> example in the form of driver code that uses this, and some tool or
> documentation steps that should be taken for recovery?

The recovery method for this particular case is to flash in a new firmware.

> The issues I'm seeing here is that eventually we'll get different
> vendor-specific recovery steps, and maybe even on the same device, and
> that leads us to an enumeration issue. Since it's just a string and an
> enum I think it'd be better to just allocate a new one every time there's
> a new strange recovery method instead of this opaque approach.

That is exactly the opposite of what we discussed so far.

The original idea was to add a firmware-flush recovery method which looked a bit wage since it didn't give any information on what to do exactly.

That's why I suggested to add a more generic vendor-specific event with refers to the documentation and system log to see what actually needs to be done.

Otherwise we would end up with events like firmware-flash, update FW image A, update FW image B, FW version mismatch etc....

Regards,
Christian.

> 
> Cheers, Sima
> 
>> ---
>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
>>  drivers/gpu/drm/drm_drv.c      | 2 ++
>>  include/drm/drm_device.h       | 4 ++++
>>  3 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 263e5a97c080..c33070bdb347 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -421,10 +421,10 @@ Recovery
>>  Current implementation defines three recovery methods, out of which, drivers
>>  can use any one, multiple or none. Method(s) of choice will be sent in the
>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>> -more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +more side-effects. If recovery method is specific to vendor
>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>> +specific documentation for further recovery steps. If driver is unsure about
>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>  
>>  Userspace consumers can parse this event and attempt recovery as per the
>>  following expectations.
>> @@ -435,6 +435,7 @@ following expectations.
>>      none            optional telemetry collection
>>      rebind          unbind + bind driver
>>      bus-reset       unbind + bus reset/re-enumeration + bind
>> +    vendor-specific vendor specific recovery method
>>      unknown         consumer policy
>>      =============== ========================================
>>  
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index cdd591b11488..0ac723a46a91 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>  		return "rebind";
>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>  		return "bus-reset";
>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>> +		return "vendor-specific";
>>  	default:
>>  		return NULL;
>>  	}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index 08b3b2467c4c..08a087f149ff 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -26,10 +26,14 @@ struct pci_controller;
>>   * Recovery methods for wedged device in order of less to more side-effects.
>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>   * use any one, multiple (or'd) or none depending on their needs.
>> + *
>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>> + * details.
>>   */
>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>  
>>  /**
>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 14:09     ` Christian König
@ 2025-07-09 14:18       ` Raag Jadav
  2025-07-09 16:52         ` Rodrigo Vivi
  0 siblings, 1 reply; 48+ messages in thread
From: Raag Jadav @ 2025-07-09 14:18 UTC (permalink / raw)
  To: Christian König
  Cc: Simona Vetter, Riana Tauro, intel-xe, anshuman.gupta,
	rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> On 09.07.25 15:41, Simona Vetter wrote:
> > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> >> Certain errors can cause the device to be wedged and may
> >> require a vendor specific recovery method to restore normal
> >> operation.
> >>
> >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> >> must provide additional recovery documentation if this method
> >> is used.
> >>
> >> v2: fix documentation (Raag)
> >>
> >> Cc: André Almeida <andrealmeid@igalia.com>
> >> Cc: Christian König <christian.koenig@amd.com>
> >> Cc: David Airlie <airlied@gmail.com>
> >> Cc: <dri-devel@lists.freedesktop.org>
> >> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> >> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > 
> > I'm not really understanding what this is useful for, maybe concrete
> > example in the form of driver code that uses this, and some tool or
> > documentation steps that should be taken for recovery?
> 
> The recovery method for this particular case is to flash in a new firmware.
> 
> > The issues I'm seeing here is that eventually we'll get different
> > vendor-specific recovery steps, and maybe even on the same device, and
> > that leads us to an enumeration issue. Since it's just a string and an
> > enum I think it'd be better to just allocate a new one every time there's
> > a new strange recovery method instead of this opaque approach.
> 
> That is exactly the opposite of what we discussed so far.
> 
> The original idea was to add a firmware-flush recovery method which looked a bit wage since it didn't give any information on what to do exactly.
> 
> That's why I suggested to add a more generic vendor-specific event with refers to the documentation and system log to see what actually needs to be done.
> 
> Otherwise we would end up with events like firmware-flash, update FW image A, update FW image B, FW version mismatch etc....

Agree. Any newly allocated method that is specific to a vendor is going to
be opaque anyway, since it can't be generic for all drivers. This just helps
reduce the noise in DRM core.

And yes, there could be different vendor-specific cases for the same driver
and the driver should be able to provide the means to distinguish between
them.

Raag

> >> ---
> >>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> >>  drivers/gpu/drm/drm_drv.c      | 2 ++
> >>  include/drm/drm_device.h       | 4 ++++
> >>  3 files changed, 11 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >> index 263e5a97c080..c33070bdb347 100644
> >> --- a/Documentation/gpu/drm-uapi.rst
> >> +++ b/Documentation/gpu/drm-uapi.rst
> >> @@ -421,10 +421,10 @@ Recovery
> >>  Current implementation defines three recovery methods, out of which, drivers
> >>  can use any one, multiple or none. Method(s) of choice will be sent in the
> >>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >> -more side-effects. If driver is unsure about recovery or method is unknown
> >> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >> -will be sent instead.
> >> +more side-effects. If recovery method is specific to vendor
> >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> >> +specific documentation for further recovery steps. If driver is unsure about
> >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> >>  
> >>  Userspace consumers can parse this event and attempt recovery as per the
> >>  following expectations.
> >> @@ -435,6 +435,7 @@ following expectations.
> >>      none            optional telemetry collection
> >>      rebind          unbind + bind driver
> >>      bus-reset       unbind + bus reset/re-enumeration + bind
> >> +    vendor-specific vendor specific recovery method
> >>      unknown         consumer policy
> >>      =============== ========================================
> >>  
> >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >> index cdd591b11488..0ac723a46a91 100644
> >> --- a/drivers/gpu/drm/drm_drv.c
> >> +++ b/drivers/gpu/drm/drm_drv.c
> >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>  		return "rebind";
> >>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>  		return "bus-reset";
> >> +	case DRM_WEDGE_RECOVERY_VENDOR:
> >> +		return "vendor-specific";
> >>  	default:
> >>  		return NULL;
> >>  	}
> >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >> index 08b3b2467c4c..08a087f149ff 100644
> >> --- a/include/drm/drm_device.h
> >> +++ b/include/drm/drm_device.h
> >> @@ -26,10 +26,14 @@ struct pci_controller;
> >>   * Recovery methods for wedged device in order of less to more side-effects.
> >>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> >>   * use any one, multiple (or'd) or none depending on their needs.
> >> + *
> >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> >> + * details.
> >>   */
> >>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> >>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> >>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> >> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> >>  
> >>  /**
> >>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >> -- 
> >> 2.47.1
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 13:41   ` Simona Vetter
  2025-07-09 14:09     ` Christian König
@ 2025-07-09 14:46     ` Riana Tauro
  1 sibling, 0 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 14:46 UTC (permalink / raw)
  To: Simona Vetter
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida,
	Christian König, David Airlie, dri-devel

Hi Sima

On 7/9/2025 7:11 PM, Simona Vetter wrote:
> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>> Certain errors can cause the device to be wedged and may
>> require a vendor specific recovery method to restore normal
>> operation.
>>
>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>> must provide additional recovery documentation if this method
>> is used.
>>
>> v2: fix documentation (Raag)
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: <dri-devel@lists.freedesktop.org>
>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> 
> I'm not really understanding what this is useful for, maybe concrete
> example in the form of driver code that uses this, and some tool or
> documentation steps that should be taken for recovery?

example and documentation for vendor specific recovery are part of the 
same series.
patchwork link: https://patchwork.freedesktop.org/series/149756/

fwupd tool will be using this. This was the initial PR raised.
It is yet to be updated to use 'vendor-specific'

PR: https://github.com/fwupd/fwupd/pull/8922

> 
> The issues I'm seeing here is that eventually we'll get different
> vendor-specific recovery steps, and maybe even on the same device, and
> that leads us to an enumeration issue. Since it's just a string and an
> enum I think it'd be better to just allocate a new one every time there's
> a new strange recovery method instead of this opaque approach.

It started as a specific macro and string but based on review comments 
it was changed to generic macro.

Thanks
Riana

> 
> Cheers, Sima
> 
>> ---
>>   Documentation/gpu/drm-uapi.rst | 9 +++++----
>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>   include/drm/drm_device.h       | 4 ++++
>>   3 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 263e5a97c080..c33070bdb347 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -421,10 +421,10 @@ Recovery
>>   Current implementation defines three recovery methods, out of which, drivers
>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>> -more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +more side-effects. If recovery method is specific to vendor
>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>> +specific documentation for further recovery steps. If driver is unsure about
>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>   
>>   Userspace consumers can parse this event and attempt recovery as per the
>>   following expectations.
>> @@ -435,6 +435,7 @@ following expectations.
>>       none            optional telemetry collection
>>       rebind          unbind + bind driver
>>       bus-reset       unbind + bus reset/re-enumeration + bind
>> +    vendor-specific vendor specific recovery method
>>       unknown         consumer policy
>>       =============== ========================================
>>   
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index cdd591b11488..0ac723a46a91 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>   		return "rebind";
>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>   		return "bus-reset";
>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>> +		return "vendor-specific";
>>   	default:
>>   		return NULL;
>>   	}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index 08b3b2467c4c..08a087f149ff 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -26,10 +26,14 @@ struct pci_controller;
>>    * Recovery methods for wedged device in order of less to more side-effects.
>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>    * use any one, multiple (or'd) or none depending on their needs.
>> + *
>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>> + * details.
>>    */
>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>   
>>   /**
>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>> -- 
>> 2.47.1
>>
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* ✗ Xe.CI.Full: failure for Handle Firmware reported Hardware Errors (rev4)
  2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (12 preceding siblings ...)
  2025-07-09 13:06 ` ✓ Xe.CI.BAT: success " Patchwork
@ 2025-07-09 15:02 ` Patchwork
  13 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2025-07-09 15:02 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 95100 bytes --]

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev4)
URL   : https://patchwork.freedesktop.org/series/149756/
State : failure

== Summary ==

CI Bug Log - changes from xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26_FULL -> xe-pw-149756v4_FULL
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with xe-pw-149756v4_FULL absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in xe-pw-149756v4_FULL, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (4 -> 4)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in xe-pw-149756v4_FULL:

### IGT changes ###

#### Possible regressions ####

  * igt@xe_module_load@unload:
    - shard-dg2-set2:     [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-435/igt@xe_module_load@unload.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_module_load@unload.html

  * igt@xe_pm@s4-basic-exec:
    - shard-lnl:          [PASS][3] -> [DMESG-WARN][4]
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-lnl-7/igt@xe_pm@s4-basic-exec.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-lnl-1/igt@xe_pm@s4-basic-exec.html

  
New tests
---------

  New tests have been introduced between xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26_FULL and xe-pw-149756v4_FULL:

### New IGT tests (1) ###

  * igt@kms_flip@modeset-vs-vblank-race-interruptible@d-hdmi-a2:
    - Statuses : 1 pass(s)
    - Exec time: [1.79] s

  

Known issues
------------

  Here are the changes found in xe-pw-149756v4_FULL that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@fbdev@nullptr:
    - shard-dg2-set2:     [PASS][5] -> [SKIP][6] ([Intel XE#2134])
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@fbdev@nullptr.html
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@fbdev@nullptr.html

  * igt@kms_big_fb@4-tiled-8bpp-rotate-270:
    - shard-bmg:          NOTRUN -> [SKIP][7] ([Intel XE#2327])
   [7]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_big_fb@4-tiled-8bpp-rotate-270.html

  * igt@kms_big_fb@y-tiled-16bpp-rotate-180:
    - shard-bmg:          NOTRUN -> [SKIP][8] ([Intel XE#1124]) +4 other tests skip
   [8]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_big_fb@y-tiled-16bpp-rotate-180.html

  * igt@kms_ccs@ccs-on-another-bo-y-tiled-gen12-mc-ccs:
    - shard-bmg:          NOTRUN -> [SKIP][9] ([Intel XE#2887]) +5 other tests skip
   [9]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_ccs@ccs-on-another-bo-y-tiled-gen12-mc-ccs.html

  * igt@kms_ccs@crc-primary-basic-4-tiled-mtl-rc-ccs@pipe-b-hdmi-a-6:
    - shard-dg2-set2:     NOTRUN -> [SKIP][10] ([Intel XE#787]) +174 other tests skip
   [10]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_ccs@crc-primary-basic-4-tiled-mtl-rc-ccs@pipe-b-hdmi-a-6.html

  * igt@kms_ccs@crc-primary-basic-yf-tiled-ccs@pipe-d-dp-2:
    - shard-dg2-set2:     NOTRUN -> [SKIP][11] ([Intel XE#455] / [Intel XE#787]) +24 other tests skip
   [11]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_ccs@crc-primary-basic-yf-tiled-ccs@pipe-d-dp-2.html

  * igt@kms_ccs@crc-primary-suspend-4-tiled-mtl-rc-ccs-cc:
    - shard-bmg:          NOTRUN -> [SKIP][12] ([Intel XE#3432]) +1 other test skip
   [12]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_ccs@crc-primary-suspend-4-tiled-mtl-rc-ccs-cc.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-c-dp-4:
    - shard-dg2-set2:     NOTRUN -> [DMESG-WARN][13] ([Intel XE#1727] / [Intel XE#3113])
   [13]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-433/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-c-dp-4.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-d-hdmi-a-6:
    - shard-dg2-set2:     NOTRUN -> [INCOMPLETE][14] ([Intel XE#3124])
   [14]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-433/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-d-hdmi-a-6.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-b-dp-4:
    - shard-dg2-set2:     NOTRUN -> [INCOMPLETE][15] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#4212] / [Intel XE#4522])
   [15]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-435/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-b-dp-4.html

  * igt@kms_cdclk@plane-scaling@pipe-b-dp-4:
    - shard-dg2-set2:     NOTRUN -> [SKIP][16] ([Intel XE#4416]) +3 other tests skip
   [16]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_cdclk@plane-scaling@pipe-b-dp-4.html

  * igt@kms_chamelium_color@ctm-0-25:
    - shard-bmg:          NOTRUN -> [SKIP][17] ([Intel XE#2325])
   [17]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_chamelium_color@ctm-0-25.html

  * igt@kms_chamelium_edid@dp-edid-stress-resolution-4k:
    - shard-bmg:          NOTRUN -> [SKIP][18] ([Intel XE#2252]) +2 other tests skip
   [18]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_chamelium_edid@dp-edid-stress-resolution-4k.html

  * igt@kms_content_protection@content-type-change:
    - shard-bmg:          NOTRUN -> [SKIP][19] ([Intel XE#2341])
   [19]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_content_protection@content-type-change.html

  * igt@kms_content_protection@legacy@pipe-a-dp-4:
    - shard-dg2-set2:     NOTRUN -> [FAIL][20] ([Intel XE#1178])
   [20]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-463/igt@kms_content_protection@legacy@pipe-a-dp-4.html

  * igt@kms_content_protection@uevent@pipe-a-dp-2:
    - shard-bmg:          NOTRUN -> [FAIL][21] ([Intel XE#1188])
   [21]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-7/igt@kms_content_protection@uevent@pipe-a-dp-2.html

  * igt@kms_cursor_crc@cursor-offscreen-128x42:
    - shard-adlp:         [PASS][22] -> [DMESG-WARN][23] ([Intel XE#2953] / [Intel XE#4173]) +16 other tests dmesg-warn
   [22]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-adlp-6/igt@kms_cursor_crc@cursor-offscreen-128x42.html
   [23]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-adlp-1/igt@kms_cursor_crc@cursor-offscreen-128x42.html

  * igt@kms_cursor_crc@cursor-onscreen-512x512:
    - shard-bmg:          NOTRUN -> [SKIP][24] ([Intel XE#2321])
   [24]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_cursor_crc@cursor-onscreen-512x512.html

  * igt@kms_cursor_crc@cursor-rapid-movement-max-size:
    - shard-bmg:          NOTRUN -> [SKIP][25] ([Intel XE#2320]) +2 other tests skip
   [25]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_cursor_crc@cursor-rapid-movement-max-size.html

  * igt@kms_cursor_legacy@2x-flip-vs-cursor-legacy:
    - shard-bmg:          NOTRUN -> [SKIP][26] ([Intel XE#2291])
   [26]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_cursor_legacy@2x-flip-vs-cursor-legacy.html

  * igt@kms_cursor_legacy@2x-nonblocking-modeset-vs-cursor-atomic:
    - shard-bmg:          [PASS][27] -> [SKIP][28] ([Intel XE#2291])
   [27]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-1/igt@kms_cursor_legacy@2x-nonblocking-modeset-vs-cursor-atomic.html
   [28]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-6/igt@kms_cursor_legacy@2x-nonblocking-modeset-vs-cursor-atomic.html

  * igt@kms_dp_linktrain_fallback@dp-fallback:
    - shard-bmg:          [PASS][29] -> [SKIP][30] ([Intel XE#4294])
   [29]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-2/igt@kms_dp_linktrain_fallback@dp-fallback.html
   [30]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_dp_linktrain_fallback@dp-fallback.html

  * igt@kms_draw_crc@fill-fb:
    - shard-dg2-set2:     [PASS][31] -> [SKIP][32] ([Intel XE#2351] / [Intel XE#4208]) +5 other tests skip
   [31]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_draw_crc@fill-fb.html
   [32]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_draw_crc@fill-fb.html

  * igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-different-formats:
    - shard-bmg:          NOTRUN -> [SKIP][33] ([Intel XE#4422])
   [33]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-different-formats.html

  * igt@kms_feature_discovery@psr1:
    - shard-bmg:          NOTRUN -> [SKIP][34] ([Intel XE#2374])
   [34]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_feature_discovery@psr1.html

  * igt@kms_flip@2x-flip-vs-dpms-on-nop:
    - shard-bmg:          [PASS][35] -> [SKIP][36] ([Intel XE#2316]) +6 other tests skip
   [35]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-2/igt@kms_flip@2x-flip-vs-dpms-on-nop.html
   [36]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_flip@2x-flip-vs-dpms-on-nop.html

  * igt@kms_flip@2x-flip-vs-wf_vblank:
    - shard-bmg:          NOTRUN -> [SKIP][37] ([Intel XE#2316]) +1 other test skip
   [37]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_flip@2x-flip-vs-wf_vblank.html

  * igt@kms_flip@basic-plain-flip@b-hdmi-a1:
    - shard-adlp:         [PASS][38] -> [DMESG-WARN][39] ([Intel XE#4543]) +4 other tests dmesg-warn
   [38]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-adlp-6/igt@kms_flip@basic-plain-flip@b-hdmi-a1.html
   [39]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-adlp-1/igt@kms_flip@basic-plain-flip@b-hdmi-a1.html

  * igt@kms_flip@busy-flip:
    - shard-dg2-set2:     [PASS][40] -> [SKIP][41] ([Intel XE#4208] / [i915#2575]) +65 other tests skip
   [40]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_flip@busy-flip.html
   [41]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_flip@busy-flip.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1:
    - shard-lnl:          [PASS][42] -> [FAIL][43] ([Intel XE#301]) +3 other tests fail
   [42]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-lnl-7/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html
   [43]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-lnl-6/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html

  * igt@kms_flip@flip-vs-suspend:
    - shard-bmg:          [PASS][44] -> [INCOMPLETE][45] ([Intel XE#2049] / [Intel XE#2597]) +1 other test incomplete
   [44]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-3/igt@kms_flip@flip-vs-suspend.html
   [45]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-1/igt@kms_flip@flip-vs-suspend.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling:
    - shard-bmg:          NOTRUN -> [SKIP][46] ([Intel XE#2293] / [Intel XE#2380]) +1 other test skip
   [46]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling@pipe-a-valid-mode:
    - shard-bmg:          NOTRUN -> [SKIP][47] ([Intel XE#2293]) +1 other test skip
   [47]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling@pipe-a-valid-mode.html
    - shard-dg2-set2:     NOTRUN -> [SKIP][48] ([Intel XE#455]) +2 other tests skip
   [48]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling@pipe-a-valid-mode.html

  * igt@kms_frontbuffer_tracking@drrs-1p-primscrn-spr-indfb-fullscreen:
    - shard-bmg:          NOTRUN -> [SKIP][49] ([Intel XE#2311]) +7 other tests skip
   [49]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_frontbuffer_tracking@drrs-1p-primscrn-spr-indfb-fullscreen.html

  * igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-pri-indfb-draw-mmap-wc:
    - shard-bmg:          NOTRUN -> [SKIP][50] ([Intel XE#2312]) +5 other tests skip
   [50]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-pri-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-indfb-pgflip-blt:
    - shard-bmg:          NOTRUN -> [SKIP][51] ([Intel XE#5390]) +5 other tests skip
   [51]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_frontbuffer_tracking@fbc-1p-primscrn-indfb-pgflip-blt.html

  * igt@kms_frontbuffer_tracking@fbc-tiling-y:
    - shard-bmg:          NOTRUN -> [SKIP][52] ([Intel XE#2352])
   [52]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_frontbuffer_tracking@fbc-tiling-y.html

  * igt@kms_frontbuffer_tracking@fbcpsr-rgb565-draw-mmap-wc:
    - shard-bmg:          NOTRUN -> [SKIP][53] ([Intel XE#2313]) +6 other tests skip
   [53]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_frontbuffer_tracking@fbcpsr-rgb565-draw-mmap-wc.html

  * igt@kms_joiner@basic-force-big-joiner:
    - shard-bmg:          [PASS][54] -> [SKIP][55] ([Intel XE#3012])
   [54]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-2/igt@kms_joiner@basic-force-big-joiner.html
   [55]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_joiner@basic-force-big-joiner.html

  * igt@kms_pm_dc@dc3co-vpb-simulation:
    - shard-bmg:          NOTRUN -> [SKIP][56] ([Intel XE#2391])
   [56]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_pm_dc@dc3co-vpb-simulation.html

  * igt@kms_psr2_sf@fbc-pr-cursor-plane-move-continuous-sf:
    - shard-bmg:          NOTRUN -> [SKIP][57] ([Intel XE#1489]) +3 other tests skip
   [57]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_psr2_sf@fbc-pr-cursor-plane-move-continuous-sf.html

  * igt@kms_psr@pr-sprite-render:
    - shard-bmg:          NOTRUN -> [SKIP][58] ([Intel XE#2234] / [Intel XE#2850]) +7 other tests skip
   [58]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_psr@pr-sprite-render.html

  * igt@kms_rotation_crc@primary-rotation-270:
    - shard-bmg:          NOTRUN -> [SKIP][59] ([Intel XE#3414] / [Intel XE#3904])
   [59]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_rotation_crc@primary-rotation-270.html

  * igt@kms_rotation_crc@primary-rotation-90:
    - shard-dg2-set2:     NOTRUN -> [SKIP][60] ([Intel XE#4208] / [i915#2575]) +2 other tests skip
   [60]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_rotation_crc@primary-rotation-90.html

  * igt@kms_vblank@ts-continuation-dpms-suspend:
    - shard-bmg:          [PASS][61] -> [INCOMPLETE][62] ([Intel XE#4488]) +1 other test incomplete
   [61]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-5/igt@kms_vblank@ts-continuation-dpms-suspend.html
   [62]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-6/igt@kms_vblank@ts-continuation-dpms-suspend.html

  * igt@kms_vrr@seamless-rr-switch-drrs:
    - shard-bmg:          NOTRUN -> [SKIP][63] ([Intel XE#1499]) +1 other test skip
   [63]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_vrr@seamless-rr-switch-drrs.html

  * igt@xe_compute_preempt@compute-threadgroup-preempt@engine-drm_xe_engine_class_compute:
    - shard-dg2-set2:     NOTRUN -> [SKIP][64] ([Intel XE#1280] / [Intel XE#455])
   [64]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_compute_preempt@compute-threadgroup-preempt@engine-drm_xe_engine_class_compute.html

  * igt@xe_eudebug@basic-vm-bind-metadata-discovery:
    - shard-bmg:          NOTRUN -> [SKIP][65] ([Intel XE#4837]) +4 other tests skip
   [65]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@xe_eudebug@basic-vm-bind-metadata-discovery.html

  * igt@xe_evict_ccs@evict-overcommit-standalone-instantfree-reopen:
    - shard-dg2-set2:     [PASS][66] -> [SKIP][67] ([Intel XE#4208]) +133 other tests skip
   [66]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_evict_ccs@evict-overcommit-standalone-instantfree-reopen.html
   [67]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_evict_ccs@evict-overcommit-standalone-instantfree-reopen.html

  * igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-rebind:
    - shard-dg2-set2:     [PASS][68] -> [SKIP][69] ([Intel XE#1392]) +4 other tests skip
   [68]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-rebind.html
   [69]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@xe_exec_basic@multigpu-no-exec-bindexecqueue-userptr-rebind.html

  * igt@xe_exec_basic@multigpu-once-bindexecqueue-userptr-invalidate:
    - shard-bmg:          NOTRUN -> [SKIP][70] ([Intel XE#2322]) +1 other test skip
   [70]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@xe_exec_basic@multigpu-once-bindexecqueue-userptr-invalidate.html

  * igt@xe_exec_system_allocator@threads-many-mmap-new-huge-nomemset:
    - shard-bmg:          NOTRUN -> [SKIP][71] ([Intel XE#4943]) +9 other tests skip
   [71]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@xe_exec_system_allocator@threads-many-mmap-new-huge-nomemset.html

  * igt@xe_exec_system_allocator@threads-shared-vm-many-large-new-bo-map-nomemset:
    - shard-lnl:          [PASS][72] -> [FAIL][73] ([Intel XE#5018])
   [72]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-lnl-3/igt@xe_exec_system_allocator@threads-shared-vm-many-large-new-bo-map-nomemset.html
   [73]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-lnl-5/igt@xe_exec_system_allocator@threads-shared-vm-many-large-new-bo-map-nomemset.html

  * igt@xe_exec_system_allocator@threads-shared-vm-many-mmap-shared-nomemset:
    - shard-dg2-set2:     NOTRUN -> [SKIP][74] ([Intel XE#4208]) +19 other tests skip
   [74]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_exec_system_allocator@threads-shared-vm-many-mmap-shared-nomemset.html

  * igt@xe_module_load@reload-no-display:
    - shard-dg2-set2:     [PASS][75] -> [FAIL][76] ([Intel XE#4208])
   [75]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@xe_module_load@reload-no-display.html
   [76]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_module_load@reload-no-display.html

  * igt@xe_oa@oa-exponents@ccs-0:
    - shard-lnl:          [PASS][77] -> [TIMEOUT][78] ([Intel XE#5339]) +1 other test timeout
   [77]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-lnl-6/igt@xe_oa@oa-exponents@ccs-0.html
   [78]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-lnl-7/igt@xe_oa@oa-exponents@ccs-0.html

  * igt@xe_oa@oa-tlb-invalidate:
    - shard-bmg:          NOTRUN -> [SKIP][79] ([Intel XE#2248])
   [79]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@xe_oa@oa-tlb-invalidate.html

  * igt@xe_pat@pat-index-xehpc:
    - shard-bmg:          NOTRUN -> [SKIP][80] ([Intel XE#1420])
   [80]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@xe_pat@pat-index-xehpc.html

  * igt@xe_pat@pat-index-xelpg:
    - shard-bmg:          NOTRUN -> [SKIP][81] ([Intel XE#2236])
   [81]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@xe_pat@pat-index-xelpg.html

  * igt@xe_pm@s3-vm-bind-userptr:
    - shard-adlp:         [PASS][82] -> [DMESG-WARN][83] ([Intel XE#2953] / [Intel XE#4173] / [Intel XE#569])
   [82]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-adlp-6/igt@xe_pm@s3-vm-bind-userptr.html
   [83]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-adlp-1/igt@xe_pm@s3-vm-bind-userptr.html

  * igt@xe_pmu@gt-frequency:
    - shard-dg2-set2:     [PASS][84] -> [FAIL][85] ([Intel XE#5166]) +1 other test fail
   [84]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_pmu@gt-frequency.html
   [85]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@xe_pmu@gt-frequency.html

  * igt@xe_pxp@pxp-termination-key-update-post-suspend:
    - shard-bmg:          NOTRUN -> [SKIP][86] ([Intel XE#4733]) +1 other test skip
   [86]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@xe_pxp@pxp-termination-key-update-post-suspend.html

  * igt@xe_sriov_auto_provisioning@selfconfig-reprovision-reduce-numvfs:
    - shard-bmg:          NOTRUN -> [SKIP][87] ([Intel XE#4130]) +1 other test skip
   [87]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@xe_sriov_auto_provisioning@selfconfig-reprovision-reduce-numvfs.html

  * igt@xe_vm@bind-array-enobufs:
    - shard-dg2-set2:     [PASS][88] -> [DMESG-FAIL][89] ([Intel XE#3876])
   [88]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-463/igt@xe_vm@bind-array-enobufs.html
   [89]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-433/igt@xe_vm@bind-array-enobufs.html

  
#### Possible fixes ####

  * igt@fbdev@info:
    - shard-dg2-set2:     [SKIP][90] ([Intel XE#2134]) -> [PASS][91]
   [90]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@fbdev@info.html
   [91]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@fbdev@info.html

  * igt@intel_sysfs_debugfs@xe-debugfs-read-all-entries-display-on:
    - shard-dg2-set2:     [SKIP][92] ([Intel XE#4208] / [Intel XE#4618]) -> [PASS][93] +1 other test pass
   [92]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@intel_sysfs_debugfs@xe-debugfs-read-all-entries-display-on.html
   [93]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@intel_sysfs_debugfs@xe-debugfs-read-all-entries-display-on.html

  * igt@kms_async_flips@alternate-sync-async-flip-atomic@pipe-a-dp-2:
    - shard-bmg:          [FAIL][94] ([Intel XE#3718] / [Intel XE#827]) -> [PASS][95] +1 other test pass
   [94]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-8/igt@kms_async_flips@alternate-sync-async-flip-atomic@pipe-a-dp-2.html
   [95]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-3/igt@kms_async_flips@alternate-sync-async-flip-atomic@pipe-a-dp-2.html

  * igt@kms_async_flips@async-flip-suspend-resume:
    - shard-bmg:          [INCOMPLETE][96] ([Intel XE#4912]) -> [PASS][97] +1 other test pass
   [96]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-6/igt@kms_async_flips@async-flip-suspend-resume.html
   [97]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-4/igt@kms_async_flips@async-flip-suspend-resume.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip-async-flip:
    - shard-adlp:         [DMESG-FAIL][98] ([Intel XE#4543]) -> [PASS][99]
   [98]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-adlp-1/igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip-async-flip.html
   [99]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-adlp-9/igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-hflip-async-flip.html

  * igt@kms_ccs@crc-primary-basic-4-tiled-dg2-rc-ccs-cc:
    - shard-dg2-set2:     [SKIP][100] ([Intel XE#2351] / [Intel XE#4208]) -> [PASS][101] +3 other tests pass
   [100]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_ccs@crc-primary-basic-4-tiled-dg2-rc-ccs-cc.html
   [101]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_ccs@crc-primary-basic-4-tiled-dg2-rc-ccs-cc.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-b-dp-4:
    - shard-dg2-set2:     [INCOMPLETE][102] ([Intel XE#3124] / [Intel XE#4345]) -> [PASS][103]
   [102]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-463/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-b-dp-4.html
   [103]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-433/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-b-dp-4.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-b-hdmi-a-6:
    - shard-dg2-set2:     [DMESG-WARN][104] ([Intel XE#1727] / [Intel XE#3113]) -> [PASS][105] +1 other test pass
   [104]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-463/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-b-hdmi-a-6.html
   [105]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-433/igt@kms_ccs@random-ccs-data-4-tiled-dg2-mc-ccs@pipe-b-hdmi-a-6.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-a-dp-4:
    - shard-dg2-set2:     [INCOMPLETE][106] ([Intel XE#3124]) -> [PASS][107]
   [106]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-a-dp-4.html
   [107]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-435/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs@pipe-a-dp-4.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size:
    - shard-bmg:          [SKIP][108] ([Intel XE#2291]) -> [PASS][109] +2 other tests pass
   [108]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-5/igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size.html
   [109]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-7/igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size.html

  * igt@kms_dither@fb-8bpc-vs-panel-6bpc:
    - shard-bmg:          [SKIP][110] ([Intel XE#1340]) -> [PASS][111]
   [110]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-5/igt@kms_dither@fb-8bpc-vs-panel-6bpc.html
   [111]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-7/igt@kms_dither@fb-8bpc-vs-panel-6bpc.html

  * igt@kms_flip@2x-flip-vs-dpms-on-nop-interruptible:
    - shard-bmg:          [SKIP][112] ([Intel XE#2316]) -> [PASS][113] +1 other test pass
   [112]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-6/igt@kms_flip@2x-flip-vs-dpms-on-nop-interruptible.html
   [113]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-3/igt@kms_flip@2x-flip-vs-dpms-on-nop-interruptible.html

  * igt@kms_flip@flip-vs-absolute-wf_vblank:
    - shard-dg2-set2:     [SKIP][114] ([Intel XE#4208] / [i915#2575]) -> [PASS][115] +55 other tests pass
   [114]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_flip@flip-vs-absolute-wf_vblank.html
   [115]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_flip@flip-vs-absolute-wf_vblank.html

  * igt@kms_flip@flip-vs-rmfb-interruptible:
    - shard-adlp:         [DMESG-WARN][116] ([Intel XE#4543] / [Intel XE#5208]) -> [PASS][117]
   [116]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-adlp-4/igt@kms_flip@flip-vs-rmfb-interruptible.html
   [117]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-adlp-2/igt@kms_flip@flip-vs-rmfb-interruptible.html

  * igt@kms_flip@flip-vs-rmfb-interruptible@b-hdmi-a1:
    - shard-adlp:         [DMESG-WARN][118] ([Intel XE#4543]) -> [PASS][119] +1 other test pass
   [118]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-adlp-4/igt@kms_flip@flip-vs-rmfb-interruptible@b-hdmi-a1.html
   [119]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-adlp-2/igt@kms_flip@flip-vs-rmfb-interruptible@b-hdmi-a1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-4tile-to-64bpp-4tile-downscaling:
    - shard-dg2-set2:     [SKIP][120] ([Intel XE#4208]) -> [PASS][121] +136 other tests pass
   [120]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_flip_scaled_crc@flip-32bpp-4tile-to-64bpp-4tile-downscaling.html
   [121]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_flip_scaled_crc@flip-32bpp-4tile-to-64bpp-4tile-downscaling.html

  * igt@kms_hdr@static-toggle-dpms:
    - shard-bmg:          [SKIP][122] ([Intel XE#1503]) -> [PASS][123]
   [122]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-6/igt@kms_hdr@static-toggle-dpms.html
   [123]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-3/igt@kms_hdr@static-toggle-dpms.html

  * igt@kms_vrr@negative-basic:
    - shard-bmg:          [SKIP][124] ([Intel XE#1499]) -> [PASS][125]
   [124]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-6/igt@kms_vrr@negative-basic.html
   [125]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-3/igt@kms_vrr@negative-basic.html

  * igt@xe_exec_basic@multigpu-no-exec-basic-defer-bind:
    - shard-dg2-set2:     [SKIP][126] ([Intel XE#1392]) -> [PASS][127] +3 other tests pass
   [126]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@xe_exec_basic@multigpu-no-exec-basic-defer-bind.html
   [127]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-463/igt@xe_exec_basic@multigpu-no-exec-basic-defer-bind.html

  * igt@xe_pm@s2idle-basic-exec:
    - shard-adlp:         [DMESG-WARN][128] ([Intel XE#2953] / [Intel XE#4173]) -> [PASS][129] +7 other tests pass
   [128]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-adlp-3/igt@xe_pm@s2idle-basic-exec.html
   [129]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-adlp-2/igt@xe_pm@s2idle-basic-exec.html

  
#### Warnings ####

  * igt@kms_addfb_basic@addfb25-y-tiled-small-legacy:
    - shard-dg2-set2:     [SKIP][130] ([Intel XE#623]) -> [SKIP][131] ([Intel XE#4208] / [i915#2575])
   [130]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_addfb_basic@addfb25-y-tiled-small-legacy.html
   [131]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_addfb_basic@addfb25-y-tiled-small-legacy.html

  * igt@kms_big_fb@4-tiled-16bpp-rotate-270:
    - shard-dg2-set2:     [SKIP][132] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][133] ([Intel XE#316])
   [132]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_big_fb@4-tiled-16bpp-rotate-270.html
   [133]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_big_fb@4-tiled-16bpp-rotate-270.html

  * igt@kms_big_fb@4-tiled-8bpp-rotate-270:
    - shard-dg2-set2:     [SKIP][134] ([Intel XE#4208]) -> [SKIP][135] ([Intel XE#316]) +3 other tests skip
   [134]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_big_fb@4-tiled-8bpp-rotate-270.html
   [135]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_big_fb@4-tiled-8bpp-rotate-270.html

  * igt@kms_big_fb@linear-8bpp-rotate-270:
    - shard-dg2-set2:     [SKIP][136] ([Intel XE#316]) -> [SKIP][137] ([Intel XE#4208])
   [136]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_big_fb@linear-8bpp-rotate-270.html
   [137]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_big_fb@linear-8bpp-rotate-270.html

  * igt@kms_big_fb@y-tiled-64bpp-rotate-270:
    - shard-dg2-set2:     [SKIP][138] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][139] ([Intel XE#1124]) +2 other tests skip
   [138]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_big_fb@y-tiled-64bpp-rotate-270.html
   [139]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_big_fb@y-tiled-64bpp-rotate-270.html

  * igt@kms_big_fb@y-tiled-addfb:
    - shard-dg2-set2:     [SKIP][140] ([Intel XE#619]) -> [SKIP][141] ([Intel XE#2351] / [Intel XE#4208])
   [140]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_big_fb@y-tiled-addfb.html
   [141]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_big_fb@y-tiled-addfb.html

  * igt@kms_big_fb@y-tiled-addfb-size-overflow:
    - shard-dg2-set2:     [SKIP][142] ([Intel XE#610]) -> [SKIP][143] ([Intel XE#4208])
   [142]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_big_fb@y-tiled-addfb-size-overflow.html
   [143]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_big_fb@y-tiled-addfb-size-overflow.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0-hflip:
    - shard-dg2-set2:     [SKIP][144] ([Intel XE#1124]) -> [SKIP][145] ([Intel XE#2351] / [Intel XE#4208]) +1 other test skip
   [144]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0-hflip.html
   [145]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0-hflip.html

  * igt@kms_big_fb@yf-tiled-16bpp-rotate-0:
    - shard-dg2-set2:     [SKIP][146] ([Intel XE#4208]) -> [SKIP][147] ([Intel XE#1124]) +6 other tests skip
   [146]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_big_fb@yf-tiled-16bpp-rotate-0.html
   [147]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_big_fb@yf-tiled-16bpp-rotate-0.html

  * igt@kms_big_fb@yf-tiled-32bpp-rotate-180:
    - shard-dg2-set2:     [SKIP][148] ([Intel XE#1124]) -> [SKIP][149] ([Intel XE#4208]) +2 other tests skip
   [148]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_big_fb@yf-tiled-32bpp-rotate-180.html
   [149]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_big_fb@yf-tiled-32bpp-rotate-180.html

  * igt@kms_big_fb@yf-tiled-addfb-size-offset-overflow:
    - shard-dg2-set2:     [SKIP][150] ([Intel XE#607]) -> [SKIP][151] ([Intel XE#4208])
   [150]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_big_fb@yf-tiled-addfb-size-offset-overflow.html
   [151]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_big_fb@yf-tiled-addfb-size-offset-overflow.html

  * igt@kms_bw@connected-linear-tiling-3-displays-3840x2160p:
    - shard-dg2-set2:     [SKIP][152] ([Intel XE#4208] / [i915#2575]) -> [SKIP][153] ([Intel XE#2191])
   [152]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_bw@connected-linear-tiling-3-displays-3840x2160p.html
   [153]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_bw@connected-linear-tiling-3-displays-3840x2160p.html

  * igt@kms_bw@connected-linear-tiling-4-displays-3840x2160p:
    - shard-dg2-set2:     [SKIP][154] ([Intel XE#2191]) -> [SKIP][155] ([Intel XE#4208] / [i915#2575])
   [154]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_bw@connected-linear-tiling-4-displays-3840x2160p.html
   [155]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_bw@connected-linear-tiling-4-displays-3840x2160p.html

  * igt@kms_bw@linear-tiling-1-displays-2560x1440p:
    - shard-dg2-set2:     [SKIP][156] ([Intel XE#4208] / [i915#2575]) -> [SKIP][157] ([Intel XE#367])
   [156]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_bw@linear-tiling-1-displays-2560x1440p.html
   [157]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_bw@linear-tiling-1-displays-2560x1440p.html

  * igt@kms_bw@linear-tiling-3-displays-2560x1440p:
    - shard-dg2-set2:     [SKIP][158] ([Intel XE#367]) -> [SKIP][159] ([Intel XE#4208] / [i915#2575]) +1 other test skip
   [158]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_bw@linear-tiling-3-displays-2560x1440p.html
   [159]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_bw@linear-tiling-3-displays-2560x1440p.html

  * igt@kms_ccs@bad-pixel-format-4-tiled-mtl-rc-ccs-cc:
    - shard-dg2-set2:     [SKIP][160] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][161] ([Intel XE#455] / [Intel XE#787]) +2 other tests skip
   [160]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_ccs@bad-pixel-format-4-tiled-mtl-rc-ccs-cc.html
   [161]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_ccs@bad-pixel-format-4-tiled-mtl-rc-ccs-cc.html

  * igt@kms_ccs@bad-pixel-format-y-tiled-gen12-rc-ccs:
    - shard-dg2-set2:     [SKIP][162] ([Intel XE#455] / [Intel XE#787]) -> [SKIP][163] ([Intel XE#2351] / [Intel XE#4208]) +1 other test skip
   [162]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_ccs@bad-pixel-format-y-tiled-gen12-rc-ccs.html
   [163]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_ccs@bad-pixel-format-y-tiled-gen12-rc-ccs.html

  * igt@kms_ccs@ccs-on-another-bo-y-tiled-gen12-rc-ccs:
    - shard-dg2-set2:     [SKIP][164] ([Intel XE#4208]) -> [SKIP][165] ([Intel XE#455] / [Intel XE#787]) +6 other tests skip
   [164]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_ccs@ccs-on-another-bo-y-tiled-gen12-rc-ccs.html
   [165]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_ccs@ccs-on-another-bo-y-tiled-gen12-rc-ccs.html

  * igt@kms_ccs@crc-primary-rotation-180-y-tiled-gen12-mc-ccs:
    - shard-dg2-set2:     [SKIP][166] ([Intel XE#455] / [Intel XE#787]) -> [SKIP][167] ([Intel XE#4208]) +7 other tests skip
   [166]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_ccs@crc-primary-rotation-180-y-tiled-gen12-mc-ccs.html
   [167]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_ccs@crc-primary-rotation-180-y-tiled-gen12-mc-ccs.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs:
    - shard-dg2-set2:     [INCOMPLETE][168] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#3124] / [Intel XE#4345]) -> [INCOMPLETE][169] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#4212] / [Intel XE#4345] / [Intel XE#4522])
   [168]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs.html
   [169]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-435/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs.html

  * igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc:
    - shard-dg2-set2:     [INCOMPLETE][170] ([Intel XE#1727] / [Intel XE#3113] / [Intel XE#3124]) -> [SKIP][171] ([Intel XE#4208])
   [170]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc.html
   [171]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_ccs@random-ccs-data-4-tiled-dg2-rc-ccs-cc.html

  * igt@kms_ccs@random-ccs-data-4-tiled-lnl-ccs:
    - shard-dg2-set2:     [SKIP][172] ([Intel XE#2907]) -> [SKIP][173] ([Intel XE#4208]) +1 other test skip
   [172]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_ccs@random-ccs-data-4-tiled-lnl-ccs.html
   [173]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_ccs@random-ccs-data-4-tiled-lnl-ccs.html

  * igt@kms_chamelium_color@ctm-limited-range:
    - shard-dg2-set2:     [SKIP][174] ([Intel XE#4208] / [i915#2575]) -> [SKIP][175] ([Intel XE#306]) +1 other test skip
   [174]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_chamelium_color@ctm-limited-range.html
   [175]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_chamelium_color@ctm-limited-range.html

  * igt@kms_chamelium_hpd@hdmi-hpd:
    - shard-dg2-set2:     [SKIP][176] ([Intel XE#373]) -> [SKIP][177] ([Intel XE#4208] / [i915#2575]) +8 other tests skip
   [176]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_chamelium_hpd@hdmi-hpd.html
   [177]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_chamelium_hpd@hdmi-hpd.html

  * igt@kms_chamelium_hpd@vga-hpd:
    - shard-dg2-set2:     [SKIP][178] ([Intel XE#4208] / [i915#2575]) -> [SKIP][179] ([Intel XE#373]) +7 other tests skip
   [178]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_chamelium_hpd@vga-hpd.html
   [179]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_chamelium_hpd@vga-hpd.html

  * igt@kms_content_protection@atomic:
    - shard-dg2-set2:     [FAIL][180] ([Intel XE#1178]) -> [SKIP][181] ([Intel XE#4208] / [i915#2575])
   [180]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_content_protection@atomic.html
   [181]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_content_protection@atomic.html

  * igt@kms_content_protection@dp-mst-lic-type-0:
    - shard-dg2-set2:     [SKIP][182] ([Intel XE#307]) -> [SKIP][183] ([Intel XE#4208] / [i915#2575])
   [182]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_content_protection@dp-mst-lic-type-0.html
   [183]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_content_protection@dp-mst-lic-type-0.html

  * igt@kms_content_protection@dp-mst-lic-type-1:
    - shard-dg2-set2:     [SKIP][184] ([Intel XE#4208] / [i915#2575]) -> [SKIP][185] ([Intel XE#307])
   [184]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_content_protection@dp-mst-lic-type-1.html
   [185]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_content_protection@dp-mst-lic-type-1.html

  * igt@kms_content_protection@lic-type-0:
    - shard-bmg:          [FAIL][186] ([Intel XE#1178]) -> [SKIP][187] ([Intel XE#2341]) +1 other test skip
   [186]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-2/igt@kms_content_protection@lic-type-0.html
   [187]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_content_protection@lic-type-0.html

  * igt@kms_content_protection@uevent:
    - shard-bmg:          [SKIP][188] ([Intel XE#2341]) -> [FAIL][189] ([Intel XE#1188])
   [188]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-5/igt@kms_content_protection@uevent.html
   [189]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-7/igt@kms_content_protection@uevent.html

  * igt@kms_cursor_crc@cursor-onscreen-512x512:
    - shard-dg2-set2:     [SKIP][190] ([Intel XE#4208] / [i915#2575]) -> [SKIP][191] ([Intel XE#308]) +2 other tests skip
   [190]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_cursor_crc@cursor-onscreen-512x512.html
   [191]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_cursor_crc@cursor-onscreen-512x512.html

  * igt@kms_dsc@dsc-with-output-formats-with-bpc:
    - shard-dg2-set2:     [SKIP][192] ([Intel XE#4208]) -> [SKIP][193] ([Intel XE#455]) +2 other tests skip
   [192]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_dsc@dsc-with-output-formats-with-bpc.html
   [193]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_dsc@dsc-with-output-formats-with-bpc.html

  * igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-different-formats:
    - shard-dg2-set2:     [SKIP][194] ([Intel XE#4208]) -> [SKIP][195] ([Intel XE#4422])
   [194]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-different-formats.html
   [195]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_fbc_dirty_rect@fbc-dirty-rectangle-different-formats.html

  * igt@kms_feature_discovery@psr1:
    - shard-dg2-set2:     [SKIP][196] ([Intel XE#4208] / [i915#2575]) -> [SKIP][197] ([Intel XE#1135])
   [196]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_feature_discovery@psr1.html
   [197]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_feature_discovery@psr1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-yftile-to-32bpp-yftileccs-downscaling:
    - shard-dg2-set2:     [SKIP][198] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][199] ([Intel XE#455]) +1 other test skip
   [198]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_flip_scaled_crc@flip-32bpp-yftile-to-32bpp-yftileccs-downscaling.html
   [199]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_flip_scaled_crc@flip-32bpp-yftile-to-32bpp-yftileccs-downscaling.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling:
    - shard-dg2-set2:     [SKIP][200] ([Intel XE#455]) -> [SKIP][201] ([Intel XE#2351] / [Intel XE#4208]) +2 other tests skip
   [200]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html
   [201]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html

  * igt@kms_frontbuffer_tracking@drrs-2p-pri-indfb-multidraw:
    - shard-bmg:          [SKIP][202] ([Intel XE#2312]) -> [SKIP][203] ([Intel XE#2311]) +5 other tests skip
   [202]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-6/igt@kms_frontbuffer_tracking@drrs-2p-pri-indfb-multidraw.html
   [203]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-3/igt@kms_frontbuffer_tracking@drrs-2p-pri-indfb-multidraw.html

  * igt@kms_frontbuffer_tracking@drrs-suspend:
    - shard-dg2-set2:     [SKIP][204] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][205] ([Intel XE#651]) +10 other tests skip
   [204]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_frontbuffer_tracking@drrs-suspend.html
   [205]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_frontbuffer_tracking@drrs-suspend.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-mmap-wc:
    - shard-bmg:          [SKIP][206] ([Intel XE#2312]) -> [SKIP][207] ([Intel XE#5390]) +3 other tests skip
   [206]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-5/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-mmap-wc.html
   [207]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-7/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt:
    - shard-bmg:          [SKIP][208] ([Intel XE#5390]) -> [SKIP][209] ([Intel XE#2312]) +5 other tests skip
   [208]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-2/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt.html
   [209]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-pgflip-blt.html

  * igt@kms_frontbuffer_tracking@fbc-tiling-y:
    - shard-dg2-set2:     [SKIP][210] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][211] ([Intel XE#658])
   [210]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_frontbuffer_tracking@fbc-tiling-y.html
   [211]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_frontbuffer_tracking@fbc-tiling-y.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-1p-primscrn-spr-indfb-onoff:
    - shard-dg2-set2:     [SKIP][212] ([Intel XE#651]) -> [SKIP][213] ([Intel XE#2351] / [Intel XE#4208]) +8 other tests skip
   [212]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_frontbuffer_tracking@fbcdrrs-1p-primscrn-spr-indfb-onoff.html
   [213]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_frontbuffer_tracking@fbcdrrs-1p-primscrn-spr-indfb-onoff.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc:
    - shard-bmg:          [SKIP][214] ([Intel XE#2311]) -> [SKIP][215] ([Intel XE#2312]) +5 other tests skip
   [214]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-2/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc.html
   [215]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-pri-shrfb-draw-mmap-wc:
    - shard-dg2-set2:     [SKIP][216] ([Intel XE#4208]) -> [SKIP][217] ([Intel XE#651]) +10 other tests skip
   [216]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-pri-shrfb-draw-mmap-wc.html
   [217]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-pri-shrfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-indfb-scaledprimary:
    - shard-dg2-set2:     [SKIP][218] ([Intel XE#651]) -> [SKIP][219] ([Intel XE#4208]) +12 other tests skip
   [218]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_frontbuffer_tracking@fbcdrrs-indfb-scaledprimary.html
   [219]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_frontbuffer_tracking@fbcdrrs-indfb-scaledprimary.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff:
    - shard-bmg:          [SKIP][220] ([Intel XE#2313]) -> [SKIP][221] ([Intel XE#2312]) +9 other tests skip
   [220]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-2/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff.html
   [221]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-spr-indfb-onoff.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-cur-indfb-draw-blt:
    - shard-bmg:          [SKIP][222] ([Intel XE#2312]) -> [SKIP][223] ([Intel XE#2313]) +7 other tests skip
   [222]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-6/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-cur-indfb-draw-blt.html
   [223]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-3/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-cur-indfb-draw-blt.html

  * igt@kms_frontbuffer_tracking@plane-fbc-rte:
    - shard-dg2-set2:     [SKIP][224] ([Intel XE#4208]) -> [SKIP][225] ([Intel XE#1158])
   [224]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_frontbuffer_tracking@plane-fbc-rte.html
   [225]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_frontbuffer_tracking@plane-fbc-rte.html

  * igt@kms_frontbuffer_tracking@psr-1p-primscrn-pri-shrfb-draw-blt:
    - shard-dg2-set2:     [SKIP][226] ([Intel XE#4208]) -> [SKIP][227] ([Intel XE#653]) +12 other tests skip
   [226]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_frontbuffer_tracking@psr-1p-primscrn-pri-shrfb-draw-blt.html
   [227]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_frontbuffer_tracking@psr-1p-primscrn-pri-shrfb-draw-blt.html

  * igt@kms_frontbuffer_tracking@psr-2p-primscrn-pri-shrfb-draw-mmap-wc:
    - shard-dg2-set2:     [SKIP][228] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][229] ([Intel XE#653]) +7 other tests skip
   [228]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_frontbuffer_tracking@psr-2p-primscrn-pri-shrfb-draw-mmap-wc.html
   [229]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_frontbuffer_tracking@psr-2p-primscrn-pri-shrfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@psr-2p-primscrn-shrfb-msflip-blt:
    - shard-dg2-set2:     [SKIP][230] ([Intel XE#653]) -> [SKIP][231] ([Intel XE#2351] / [Intel XE#4208]) +5 other tests skip
   [230]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_frontbuffer_tracking@psr-2p-primscrn-shrfb-msflip-blt.html
   [231]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_frontbuffer_tracking@psr-2p-primscrn-shrfb-msflip-blt.html

  * igt@kms_frontbuffer_tracking@psr-slowdraw:
    - shard-dg2-set2:     [SKIP][232] ([Intel XE#653]) -> [SKIP][233] ([Intel XE#4208]) +12 other tests skip
   [232]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_frontbuffer_tracking@psr-slowdraw.html
   [233]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_frontbuffer_tracking@psr-slowdraw.html

  * igt@kms_hdr@brightness-with-hdr:
    - shard-bmg:          [SKIP][234] ([Intel XE#3374] / [Intel XE#3544]) -> [SKIP][235] ([Intel XE#3544])
   [234]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-6/igt@kms_hdr@brightness-with-hdr.html
   [235]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-5/igt@kms_hdr@brightness-with-hdr.html

  * igt@kms_hdr@invalid-hdr:
    - shard-dg2-set2:     [SKIP][236] ([Intel XE#455]) -> [SKIP][237] ([Intel XE#4208] / [i915#2575]) +7 other tests skip
   [236]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_hdr@invalid-hdr.html
   [237]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_hdr@invalid-hdr.html

  * igt@kms_joiner@basic-big-joiner:
    - shard-dg2-set2:     [SKIP][238] ([Intel XE#4208]) -> [SKIP][239] ([Intel XE#346])
   [238]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_joiner@basic-big-joiner.html
   [239]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_joiner@basic-big-joiner.html

  * igt@kms_joiner@basic-ultra-joiner:
    - shard-dg2-set2:     [SKIP][240] ([Intel XE#2927]) -> [SKIP][241] ([Intel XE#4208])
   [240]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_joiner@basic-ultra-joiner.html
   [241]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_joiner@basic-ultra-joiner.html

  * igt@kms_joiner@invalid-modeset-force-ultra-joiner:
    - shard-dg2-set2:     [SKIP][242] ([Intel XE#4208]) -> [SKIP][243] ([Intel XE#2925])
   [242]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_joiner@invalid-modeset-force-ultra-joiner.html
   [243]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_joiner@invalid-modeset-force-ultra-joiner.html

  * igt@kms_joiner@switch-modeset-ultra-joiner-big-joiner:
    - shard-dg2-set2:     [SKIP][244] ([Intel XE#2925]) -> [SKIP][245] ([Intel XE#4208])
   [244]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_joiner@switch-modeset-ultra-joiner-big-joiner.html
   [245]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_joiner@switch-modeset-ultra-joiner-big-joiner.html

  * igt@kms_pipe_stress@stress-xrgb8888-ytiled:
    - shard-dg2-set2:     [SKIP][246] ([Intel XE#4208]) -> [SKIP][247] ([Intel XE#4359])
   [246]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_pipe_stress@stress-xrgb8888-ytiled.html
   [247]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_pipe_stress@stress-xrgb8888-ytiled.html

  * igt@kms_plane_cursor@viewport:
    - shard-dg2-set2:     [FAIL][248] ([Intel XE#616]) -> [SKIP][249] ([Intel XE#4208] / [i915#2575])
   [248]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_plane_cursor@viewport.html
   [249]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_plane_cursor@viewport.html

  * igt@kms_plane_multiple@2x-tiling-y:
    - shard-bmg:          [SKIP][250] ([Intel XE#5021]) -> [SKIP][251] ([Intel XE#4596])
   [250]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-bmg-1/igt@kms_plane_multiple@2x-tiling-y.html
   [251]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-bmg-6/igt@kms_plane_multiple@2x-tiling-y.html
    - shard-dg2-set2:     [SKIP][252] ([Intel XE#4208] / [i915#2575]) -> [SKIP][253] ([Intel XE#5021])
   [252]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_plane_multiple@2x-tiling-y.html
   [253]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_plane_multiple@2x-tiling-y.html

  * igt@kms_pm_backlight@fade-with-suspend:
    - shard-dg2-set2:     [SKIP][254] ([Intel XE#870]) -> [SKIP][255] ([Intel XE#4208])
   [254]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_pm_backlight@fade-with-suspend.html
   [255]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_pm_backlight@fade-with-suspend.html

  * igt@kms_pm_dc@dc5-psr:
    - shard-dg2-set2:     [SKIP][256] ([Intel XE#1129]) -> [SKIP][257] ([Intel XE#4208])
   [256]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_pm_dc@dc5-psr.html
   [257]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_pm_dc@dc5-psr.html

  * igt@kms_pm_dc@dc6-dpms:
    - shard-dg2-set2:     [SKIP][258] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][259] ([Intel XE#908])
   [258]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_pm_dc@dc6-dpms.html
   [259]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_pm_dc@dc6-dpms.html

  * igt@kms_psr2_sf@fbc-pr-overlay-plane-update-sf-dmg-area:
    - shard-dg2-set2:     [SKIP][260] ([Intel XE#4208]) -> [SKIP][261] ([Intel XE#1489]) +4 other tests skip
   [260]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_psr2_sf@fbc-pr-overlay-plane-update-sf-dmg-area.html
   [261]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@kms_psr2_sf@fbc-pr-overlay-plane-update-sf-dmg-area.html

  * igt@kms_psr2_sf@pr-overlay-plane-move-continuous-exceed-fully-sf:
    - shard-dg2-set2:     [SKIP][262] ([Intel XE#1489]) -> [SKIP][263] ([Intel XE#4208]) +4 other tests skip
   [262]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_psr2_sf@pr-overlay-plane-move-continuous-exceed-fully-sf.html
   [263]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_psr2_sf@pr-overlay-plane-move-continuous-exceed-fully-sf.html

  * igt@kms_psr@fbc-psr-sprite-plane-move:
    - shard-dg2-set2:     [SKIP][264] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][265] ([Intel XE#2850] / [Intel XE#929]) +2 other tests skip
   [264]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_psr@fbc-psr-sprite-plane-move.html
   [265]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_psr@fbc-psr-sprite-plane-move.html

  * igt@kms_psr@fbc-psr2-basic:
    - shard-dg2-set2:     [SKIP][266] ([Intel XE#4208]) -> [SKIP][267] ([Intel XE#2850] / [Intel XE#929]) +7 other tests skip
   [266]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_psr@fbc-psr2-basic.html
   [267]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_psr@fbc-psr2-basic.html

  * igt@kms_psr@fbc-psr2-sprite-plane-move:
    - shard-dg2-set2:     [SKIP][268] ([Intel XE#2850] / [Intel XE#929]) -> [SKIP][269] ([Intel XE#4208]) +9 other tests skip
   [268]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_psr@fbc-psr2-sprite-plane-move.html
   [269]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_psr@fbc-psr2-sprite-plane-move.html

  * igt@kms_psr@psr-sprite-plane-move:
    - shard-dg2-set2:     [SKIP][270] ([Intel XE#2850] / [Intel XE#929]) -> [SKIP][271] ([Intel XE#2351] / [Intel XE#4208])
   [270]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@kms_psr@psr-sprite-plane-move.html
   [271]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_psr@psr-sprite-plane-move.html

  * igt@kms_psr_stress_test@flip-primary-invalidate-overlay:
    - shard-dg2-set2:     [SKIP][272] ([Intel XE#2351] / [Intel XE#4208]) -> [SKIP][273] ([Intel XE#2939])
   [272]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_psr_stress_test@flip-primary-invalidate-overlay.html
   [273]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_psr_stress_test@flip-primary-invalidate-overlay.html

  * igt@kms_rotation_crc@primary-y-tiled-reflect-x-90:
    - shard-dg2-set2:     [SKIP][274] ([Intel XE#4208] / [i915#2575]) -> [SKIP][275] ([Intel XE#3414]) +1 other test skip
   [274]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_rotation_crc@primary-y-tiled-reflect-x-90.html
   [275]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_rotation_crc@primary-y-tiled-reflect-x-90.html

  * igt@kms_rotation_crc@primary-yf-tiled-reflect-x-180:
    - shard-dg2-set2:     [SKIP][276] ([Intel XE#1127]) -> [SKIP][277] ([Intel XE#4208] / [i915#2575])
   [276]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@kms_rotation_crc@primary-yf-tiled-reflect-x-180.html
   [277]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_rotation_crc@primary-yf-tiled-reflect-x-180.html

  * igt@kms_tiled_display@basic-test-pattern-with-chamelium:
    - shard-dg2-set2:     [SKIP][278] ([Intel XE#362]) -> [SKIP][279] ([Intel XE#1500])
   [278]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-463/igt@kms_tiled_display@basic-test-pattern-with-chamelium.html
   [279]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-434/igt@kms_tiled_display@basic-test-pattern-with-chamelium.html

  * igt@kms_vrr@cmrr:
    - shard-dg2-set2:     [SKIP][280] ([Intel XE#2168]) -> [SKIP][281] ([Intel XE#4208] / [i915#2575])
   [280]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@kms_vrr@cmrr.html
   [281]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@kms_vrr@cmrr.html

  * igt@kms_vrr@flip-basic-fastset:
    - shard-dg2-set2:     [SKIP][282] ([Intel XE#4208] / [i915#2575]) -> [SKIP][283] ([Intel XE#455]) +6 other tests skip
   [282]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_vrr@flip-basic-fastset.html
   [283]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@kms_vrr@flip-basic-fastset.html

  * igt@kms_vrr@lobf:
    - shard-dg2-set2:     [SKIP][284] ([Intel XE#4208] / [i915#2575]) -> [SKIP][285] ([Intel XE#2168])
   [284]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@kms_vrr@lobf.html
   [285]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@kms_vrr@lobf.html

  * igt@sriov_basic@enable-vfs-autoprobe-off:
    - shard-dg2-set2:     [SKIP][286] ([Intel XE#4208] / [i915#2575]) -> [SKIP][287] ([Intel XE#1091] / [Intel XE#2849])
   [286]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@sriov_basic@enable-vfs-autoprobe-off.html
   [287]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@sriov_basic@enable-vfs-autoprobe-off.html

  * igt@sriov_basic@enable-vfs-bind-unbind-each-numvfs-all:
    - shard-dg2-set2:     [SKIP][288] ([Intel XE#1091] / [Intel XE#2849]) -> [SKIP][289] ([Intel XE#4208] / [i915#2575])
   [288]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@sriov_basic@enable-vfs-bind-unbind-each-numvfs-all.html
   [289]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@sriov_basic@enable-vfs-bind-unbind-each-numvfs-all.html

  * igt@xe_compute_preempt@compute-preempt-many-all-ram:
    - shard-dg2-set2:     [SKIP][290] ([Intel XE#455]) -> [SKIP][291] ([Intel XE#4208]) +1 other test skip
   [290]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_compute_preempt@compute-preempt-many-all-ram.html
   [291]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_compute_preempt@compute-preempt-many-all-ram.html

  * igt@xe_compute_preempt@compute-threadgroup-preempt:
    - shard-dg2-set2:     [SKIP][292] ([Intel XE#4208]) -> [SKIP][293] ([Intel XE#1280] / [Intel XE#455])
   [292]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_compute_preempt@compute-threadgroup-preempt.html
   [293]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_compute_preempt@compute-threadgroup-preempt.html

  * igt@xe_copy_basic@mem-copy-linear-0xfffe:
    - shard-dg2-set2:     [SKIP][294] ([Intel XE#1123]) -> [SKIP][295] ([Intel XE#4208])
   [294]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@xe_copy_basic@mem-copy-linear-0xfffe.html
   [295]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_copy_basic@mem-copy-linear-0xfffe.html

  * igt@xe_copy_basic@mem-set-linear-0xfffe:
    - shard-dg2-set2:     [SKIP][296] ([Intel XE#4208]) -> [SKIP][297] ([Intel XE#1126])
   [296]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_copy_basic@mem-set-linear-0xfffe.html
   [297]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@xe_copy_basic@mem-set-linear-0xfffe.html

  * igt@xe_eu_stall@blocking-read:
    - shard-dg2-set2:     [SKIP][298] ([Intel XE#4208]) -> [SKIP][299] ([Intel XE#5419])
   [298]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_eu_stall@blocking-read.html
   [299]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_eu_stall@blocking-read.html

  * igt@xe_eu_stall@invalid-sampling-rate:
    - shard-dg2-set2:     [SKIP][300] ([Intel XE#5419]) -> [SKIP][301] ([Intel XE#4208])
   [300]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@xe_eu_stall@invalid-sampling-rate.html
   [301]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_eu_stall@invalid-sampling-rate.html

  * igt@xe_eudebug@basic-vm-access-parameters:
    - shard-dg2-set2:     [SKIP][302] ([Intel XE#4837]) -> [SKIP][303] ([Intel XE#4208]) +7 other tests skip
   [302]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@xe_eudebug@basic-vm-access-parameters.html
   [303]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_eudebug@basic-vm-access-parameters.html

  * igt@xe_eudebug@vm-bind-clear-faultable:
    - shard-dg2-set2:     [SKIP][304] ([Intel XE#4208]) -> [SKIP][305] ([Intel XE#4837]) +8 other tests skip
   [304]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_eudebug@vm-bind-clear-faultable.html
   [305]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@xe_eudebug@vm-bind-clear-faultable.html

  * igt@xe_eudebug_sriov@deny-eudebug:
    - shard-dg2-set2:     [SKIP][306] ([Intel XE#4518]) -> [SKIP][307] ([Intel XE#4208])
   [306]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_eudebug_sriov@deny-eudebug.html
   [307]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_eudebug_sriov@deny-eudebug.html

  * igt@xe_exec_basic@multigpu-once-basic:
    - shard-dg2-set2:     [SKIP][308] ([Intel XE#1392]) -> [SKIP][309] ([Intel XE#4208])
   [308]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@xe_exec_basic@multigpu-once-basic.html
   [309]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_exec_basic@multigpu-once-basic.html

  * igt@xe_exec_basic@multigpu-once-bindexecqueue-rebind:
    - shard-dg2-set2:     [SKIP][310] ([Intel XE#4208]) -> [SKIP][311] ([Intel XE#1392]) +2 other tests skip
   [310]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_exec_basic@multigpu-once-bindexecqueue-rebind.html
   [311]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@xe_exec_basic@multigpu-once-bindexecqueue-rebind.html

  * igt@xe_exec_fault_mode@once-rebind-imm:
    - shard-dg2-set2:     [SKIP][312] ([Intel XE#288]) -> [SKIP][313] ([Intel XE#4208]) +16 other tests skip
   [312]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_exec_fault_mode@once-rebind-imm.html
   [313]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_exec_fault_mode@once-rebind-imm.html

  * igt@xe_exec_fault_mode@once-userptr-rebind:
    - shard-dg2-set2:     [SKIP][314] ([Intel XE#4208]) -> [SKIP][315] ([Intel XE#288]) +18 other tests skip
   [314]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_exec_fault_mode@once-userptr-rebind.html
   [315]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@xe_exec_fault_mode@once-userptr-rebind.html

  * igt@xe_exec_mix_modes@exec-spinner-interrupted-dma-fence:
    - shard-dg2-set2:     [SKIP][316] ([Intel XE#4208]) -> [SKIP][317] ([Intel XE#2360])
   [316]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_exec_mix_modes@exec-spinner-interrupted-dma-fence.html
   [317]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@xe_exec_mix_modes@exec-spinner-interrupted-dma-fence.html

  * igt@xe_exec_mix_modes@exec-spinner-interrupted-lr:
    - shard-dg2-set2:     [SKIP][318] ([Intel XE#2360]) -> [SKIP][319] ([Intel XE#4208])
   [318]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@xe_exec_mix_modes@exec-spinner-interrupted-lr.html
   [319]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_exec_mix_modes@exec-spinner-interrupted-lr.html

  * igt@xe_exec_system_allocator@threads-many-large-mmap-shared-remap-dontunmap-eocheck:
    - shard-dg2-set2:     [SKIP][320] ([Intel XE#4208]) -> [SKIP][321] ([Intel XE#4915]) +183 other tests skip
   [320]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_exec_system_allocator@threads-many-large-mmap-shared-remap-dontunmap-eocheck.html
   [321]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_exec_system_allocator@threads-many-large-mmap-shared-remap-dontunmap-eocheck.html

  * igt@xe_exec_system_allocator@threads-shared-vm-many-large-execqueues-new-bo-map-nomemset:
    - shard-lnl:          [FAIL][322] ([Intel XE#5018]) -> [FAIL][323] ([Intel XE#4937])
   [322]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-lnl-7/igt@xe_exec_system_allocator@threads-shared-vm-many-large-execqueues-new-bo-map-nomemset.html
   [323]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-lnl-1/igt@xe_exec_system_allocator@threads-shared-vm-many-large-execqueues-new-bo-map-nomemset.html

  * igt@xe_exec_system_allocator@threads-shared-vm-many-large-malloc:
    - shard-dg2-set2:     [SKIP][324] ([Intel XE#4915]) -> [SKIP][325] ([Intel XE#4208]) +172 other tests skip
   [324]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@xe_exec_system_allocator@threads-shared-vm-many-large-malloc.html
   [325]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_exec_system_allocator@threads-shared-vm-many-large-malloc.html

  * igt@xe_huc_copy@huc_copy:
    - shard-dg2-set2:     [SKIP][326] ([Intel XE#4208]) -> [SKIP][327] ([Intel XE#255])
   [326]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_huc_copy@huc_copy.html
   [327]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_huc_copy@huc_copy.html

  * igt@xe_media_fill@media-fill:
    - shard-dg2-set2:     [SKIP][328] ([Intel XE#560]) -> [SKIP][329] ([Intel XE#4208])
   [328]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_media_fill@media-fill.html
   [329]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_media_fill@media-fill.html

  * igt@xe_oa@create-destroy-userspace-config:
    - shard-dg2-set2:     [SKIP][330] ([Intel XE#4208]) -> [SKIP][331] ([Intel XE#2541] / [Intel XE#3573]) +2 other tests skip
   [330]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_oa@create-destroy-userspace-config.html
   [331]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_oa@create-destroy-userspace-config.html

  * igt@xe_oa@mmio-triggered-reports-read:
    - shard-dg2-set2:     [SKIP][332] ([Intel XE#4208]) -> [SKIP][333] ([Intel XE#5103])
   [332]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_oa@mmio-triggered-reports-read.html
   [333]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@xe_oa@mmio-triggered-reports-read.html

  * igt@xe_oa@non-privileged-access-vaddr:
    - shard-dg2-set2:     [SKIP][334] ([Intel XE#2541] / [Intel XE#3573]) -> [SKIP][335] ([Intel XE#4208]) +4 other tests skip
   [334]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_oa@non-privileged-access-vaddr.html
   [335]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_oa@non-privileged-access-vaddr.html

  * igt@xe_oa@syncs-syncobj-wait:
    - shard-dg2-set2:     [SKIP][336] ([Intel XE#4208]) -> [SKIP][337] ([Intel XE#2541] / [Intel XE#3573] / [Intel XE#4501])
   [336]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_oa@syncs-syncobj-wait.html
   [337]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@xe_oa@syncs-syncobj-wait.html

  * igt@xe_oa@syncs-userptr-wait-cfg:
    - shard-dg2-set2:     [SKIP][338] ([Intel XE#2541] / [Intel XE#3573] / [Intel XE#4501]) -> [SKIP][339] ([Intel XE#4208])
   [338]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@xe_oa@syncs-userptr-wait-cfg.html
   [339]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_oa@syncs-userptr-wait-cfg.html

  * igt@xe_pat@pat-index-xehpc:
    - shard-dg2-set2:     [SKIP][340] ([Intel XE#4208]) -> [SKIP][341] ([Intel XE#2838] / [Intel XE#979])
   [340]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_pat@pat-index-xehpc.html
   [341]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_pat@pat-index-xehpc.html

  * igt@xe_pm@d3cold-mmap-system:
    - shard-dg2-set2:     [SKIP][342] ([Intel XE#4208]) -> [SKIP][343] ([Intel XE#2284] / [Intel XE#366]) +2 other tests skip
   [342]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_pm@d3cold-mmap-system.html
   [343]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-432/igt@xe_pm@d3cold-mmap-system.html

  * igt@xe_pmu@all-fn-engine-activity-load:
    - shard-dg2-set2:     [SKIP][344] ([Intel XE#4650]) -> [SKIP][345] ([Intel XE#4208])
   [344]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_pmu@all-fn-engine-activity-load.html
   [345]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_pmu@all-fn-engine-activity-load.html

  * igt@xe_pxp@pxp-stale-bo-bind-post-suspend:
    - shard-dg2-set2:     [SKIP][346] ([Intel XE#4208]) -> [SKIP][347] ([Intel XE#4733]) +2 other tests skip
   [346]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_pxp@pxp-stale-bo-bind-post-suspend.html
   [347]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@xe_pxp@pxp-stale-bo-bind-post-suspend.html

  * igt@xe_query@multigpu-query-cs-cycles:
    - shard-dg2-set2:     [SKIP][348] ([Intel XE#4208]) -> [SKIP][349] ([Intel XE#944])
   [348]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_query@multigpu-query-cs-cycles.html
   [349]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@xe_query@multigpu-query-cs-cycles.html

  * igt@xe_query@multigpu-query-engines:
    - shard-dg2-set2:     [SKIP][350] ([Intel XE#944]) -> [SKIP][351] ([Intel XE#4208]) +2 other tests skip
   [350]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-433/igt@xe_query@multigpu-query-engines.html
   [351]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_query@multigpu-query-engines.html

  * igt@xe_render_copy@render-stress-1-copies:
    - shard-dg2-set2:     [SKIP][352] ([Intel XE#4814]) -> [SKIP][353] ([Intel XE#4208])
   [352]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-466/igt@xe_render_copy@render-stress-1-copies.html
   [353]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_render_copy@render-stress-1-copies.html

  * igt@xe_render_copy@render-stress-2-copies:
    - shard-dg2-set2:     [SKIP][354] ([Intel XE#4208]) -> [SKIP][355] ([Intel XE#4814])
   [354]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_render_copy@render-stress-2-copies.html
   [355]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_render_copy@render-stress-2-copies.html

  * igt@xe_sriov_auto_provisioning@selfconfig-reprovision-reduce-numvfs:
    - shard-dg2-set2:     [SKIP][356] ([Intel XE#4208]) -> [SKIP][357] ([Intel XE#4130])
   [356]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_sriov_auto_provisioning@selfconfig-reprovision-reduce-numvfs.html
   [357]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-436/igt@xe_sriov_auto_provisioning@selfconfig-reprovision-reduce-numvfs.html

  * igt@xe_sriov_flr@flr-vf1-clear:
    - shard-dg2-set2:     [SKIP][358] ([Intel XE#4208]) -> [SKIP][359] ([Intel XE#3342])
   [358]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-464/igt@xe_sriov_flr@flr-vf1-clear.html
   [359]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-466/igt@xe_sriov_flr@flr-vf1-clear.html

  * igt@xe_sriov_flr@flr-vfs-parallel:
    - shard-dg2-set2:     [SKIP][360] ([Intel XE#4273]) -> [SKIP][361] ([Intel XE#4208])
   [360]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26/shard-dg2-432/igt@xe_sriov_flr@flr-vfs-parallel.html
   [361]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/shard-dg2-464/igt@xe_sriov_flr@flr-vfs-parallel.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [Intel XE#1091]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1091
  [Intel XE#1123]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1123
  [Intel XE#1124]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1124
  [Intel XE#1126]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1126
  [Intel XE#1127]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1127
  [Intel XE#1129]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1129
  [Intel XE#1135]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1135
  [Intel XE#1158]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1158
  [Intel XE#1178]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1178
  [Intel XE#1188]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1188
  [Intel XE#1280]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1280
  [Intel XE#1340]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1340
  [Intel XE#1392]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1392
  [Intel XE#1420]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1420
  [Intel XE#1489]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1489
  [Intel XE#1499]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1499
  [Intel XE#1500]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1500
  [Intel XE#1503]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1503
  [Intel XE#1727]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1727
  [Intel XE#2049]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2049
  [Intel XE#2134]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2134
  [Intel XE#2168]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2168
  [Intel XE#2191]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2191
  [Intel XE#2234]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2234
  [Intel XE#2236]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2236
  [Intel XE#2248]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2248
  [Intel XE#2252]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2252
  [Intel XE#2284]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2284
  [Intel XE#2291]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2291
  [Intel XE#2293]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2293
  [Intel XE#2311]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2311
  [Intel XE#2312]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2312
  [Intel XE#2313]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2313
  [Intel XE#2316]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2316
  [Intel XE#2320]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2320
  [Intel XE#2321]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2321
  [Intel XE#2322]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2322
  [Intel XE#2325]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2325
  [Intel XE#2327]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2327
  [Intel XE#2341]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2341
  [Intel XE#2351]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2351
  [Intel XE#2352]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2352
  [Intel XE#2360]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2360
  [Intel XE#2374]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2374
  [Intel XE#2380]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2380
  [Intel XE#2391]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2391
  [Intel XE#2541]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2541
  [Intel XE#255]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/255
  [Intel XE#2597]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2597
  [Intel XE#2838]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2838
  [Intel XE#2849]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2849
  [Intel XE#2850]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2850
  [Intel XE#288]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/288
  [Intel XE#2887]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2887
  [Intel XE#2907]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2907
  [Intel XE#2925]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2925
  [Intel XE#2927]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2927
  [Intel XE#2939]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2939
  [Intel XE#2953]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2953
  [Intel XE#301]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/301
  [Intel XE#3012]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3012
  [Intel XE#306]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/306
  [Intel XE#307]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/307
  [Intel XE#308]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/308
  [Intel XE#3113]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3113
  [Intel XE#3124]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3124
  [Intel XE#316]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/316
  [Intel XE#3342]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3342
  [Intel XE#3374]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3374
  [Intel XE#3414]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3414
  [Intel XE#3432]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3432
  [Intel XE#346]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/346
  [Intel XE#3544]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3544
  [Intel XE#3573]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3573
  [Intel XE#362]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/362
  [Intel XE#366]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/366
  [Intel XE#367]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/367
  [Intel XE#3718]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3718
  [Intel XE#373]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/373
  [Intel XE#3876]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3876
  [Intel XE#3904]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3904
  [Intel XE#4130]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4130
  [Intel XE#4173]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4173
  [Intel XE#4208]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4208
  [Intel XE#4212]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4212
  [Intel XE#4273]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4273
  [Intel XE#4294]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4294
  [Intel XE#4345]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4345
  [Intel XE#4359]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4359
  [Intel XE#4416]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4416
  [Intel XE#4422]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4422
  [Intel XE#4488]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4488
  [Intel XE#4501]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4501
  [Intel XE#4518]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4518
  [Intel XE#4522]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4522
  [Intel XE#4543]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4543
  [Intel XE#455]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/455
  [Intel XE#4596]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4596
  [Intel XE#4618]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4618
  [Intel XE#4650]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4650
  [Intel XE#4733]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4733
  [Intel XE#4814]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4814
  [Intel XE#4837]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4837
  [Intel XE#4912]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4912
  [Intel XE#4915]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4915
  [Intel XE#4937]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4937
  [Intel XE#4943]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4943
  [Intel XE#5018]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5018
  [Intel XE#5021]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5021
  [Intel XE#5103]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5103
  [Intel XE#5166]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5166
  [Intel XE#5208]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5208
  [Intel XE#5300]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5300
  [Intel XE#5339]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5339
  [Intel XE#5390]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5390
  [Intel XE#5419]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5419
  [Intel XE#560]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/560
  [Intel XE#569]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/569
  [Intel XE#607]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/607
  [Intel XE#610]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/610
  [Intel XE#616]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/616
  [Intel XE#619]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/619
  [Intel XE#623]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/623
  [Intel XE#651]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/651
  [Intel XE#653]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/653
  [Intel XE#658]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/658
  [Intel XE#787]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/787
  [Intel XE#827]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/827
  [Intel XE#870]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/870
  [Intel XE#908]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/908
  [Intel XE#929]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/929
  [Intel XE#944]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/944
  [Intel XE#979]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/979
  [i915#2575]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/2575


Build changes
-------------

  * Linux: xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26 -> xe-pw-149756v4

  IGT_8447: 8447
  xe-3381-20adfb60af27bc0e490b2d20609c3158ae2fbd26: 20adfb60af27bc0e490b2d20609c3158ae2fbd26
  xe-pw-149756v4: 149756v4

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v4/index.html

[-- Attachment #2: Type: text/html, Size: 120382 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 14:18       ` Raag Jadav
@ 2025-07-09 16:52         ` Rodrigo Vivi
  2025-07-10  9:01           ` Simona Vetter
  0 siblings, 1 reply; 48+ messages in thread
From: Rodrigo Vivi @ 2025-07-09 16:52 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Christian König, Simona Vetter, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > On 09.07.25 15:41, Simona Vetter wrote:
> > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > >> Certain errors can cause the device to be wedged and may
> > >> require a vendor specific recovery method to restore normal
> > >> operation.
> > >>
> > >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > >> must provide additional recovery documentation if this method
> > >> is used.
> > >>
> > >> v2: fix documentation (Raag)
> > >>
> > >> Cc: André Almeida <andrealmeid@igalia.com>
> > >> Cc: Christian König <christian.koenig@amd.com>
> > >> Cc: David Airlie <airlied@gmail.com>
> > >> Cc: <dri-devel@lists.freedesktop.org>
> > >> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > >> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > 
> > > I'm not really understanding what this is useful for, maybe concrete
> > > example in the form of driver code that uses this, and some tool or
> > > documentation steps that should be taken for recovery?

The case here is when FW underneath identified something badly corrupted on
FW land and decided that only a firmware-flashing could solve the day and
raise interrupt to the driver. At that point we want to wedge, but immediately
hint the admin the recommended action.

> > 
> > The recovery method for this particular case is to flash in a new firmware.
> > 
> > > The issues I'm seeing here is that eventually we'll get different
> > > vendor-specific recovery steps, and maybe even on the same device, and
> > > that leads us to an enumeration issue. Since it's just a string and an
> > > enum I think it'd be better to just allocate a new one every time there's
> > > a new strange recovery method instead of this opaque approach.
> > 
> > That is exactly the opposite of what we discussed so far.
> > 
> > The original idea was to add a firmware-flush recovery method which looked a bit wage since it didn't give any information on what to do exactly.
> > 
> > That's why I suggested to add a more generic vendor-specific event with refers to the documentation and system log to see what actually needs to be done.
> > 
> > Otherwise we would end up with events like firmware-flash, update FW image A, update FW image B, FW version mismatch etc....
> 
> Agree. Any newly allocated method that is specific to a vendor is going to
> be opaque anyway, since it can't be generic for all drivers. This just helps
> reduce the noise in DRM core.
> 
> And yes, there could be different vendor-specific cases for the same driver
> and the driver should be able to provide the means to distinguish between
> them.

Sim, what's your take on this then?

Should we get back to the original idea of firmware-flash?

> 
> Raag
> 
> > >> ---
> > >>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > >>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > >>  include/drm/drm_device.h       | 4 ++++
> > >>  3 files changed, 11 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > >> index 263e5a97c080..c33070bdb347 100644
> > >> --- a/Documentation/gpu/drm-uapi.rst
> > >> +++ b/Documentation/gpu/drm-uapi.rst
> > >> @@ -421,10 +421,10 @@ Recovery
> > >>  Current implementation defines three recovery methods, out of which, drivers
> > >>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > >>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > >> -more side-effects. If driver is unsure about recovery or method is unknown
> > >> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > >> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > >> -will be sent instead.
> > >> +more side-effects. If recovery method is specific to vendor
> > >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > >> +specific documentation for further recovery steps. If driver is unsure about
> > >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > >>  
> > >>  Userspace consumers can parse this event and attempt recovery as per the
> > >>  following expectations.
> > >> @@ -435,6 +435,7 @@ following expectations.
> > >>      none            optional telemetry collection
> > >>      rebind          unbind + bind driver
> > >>      bus-reset       unbind + bus reset/re-enumeration + bind
> > >> +    vendor-specific vendor specific recovery method
> > >>      unknown         consumer policy
> > >>      =============== ========================================
> > >>  
> > >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > >> index cdd591b11488..0ac723a46a91 100644
> > >> --- a/drivers/gpu/drm/drm_drv.c
> > >> +++ b/drivers/gpu/drm/drm_drv.c
> > >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >>  		return "rebind";
> > >>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > >>  		return "bus-reset";
> > >> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > >> +		return "vendor-specific";
> > >>  	default:
> > >>  		return NULL;
> > >>  	}
> > >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > >> index 08b3b2467c4c..08a087f149ff 100644
> > >> --- a/include/drm/drm_device.h
> > >> +++ b/include/drm/drm_device.h
> > >> @@ -26,10 +26,14 @@ struct pci_controller;
> > >>   * Recovery methods for wedged device in order of less to more side-effects.
> > >>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > >>   * use any one, multiple (or'd) or none depending on their needs.
> > >> + *
> > >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > >> + * details.
> > >>   */
> > >>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > >>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > >>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > >> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > >>  
> > >>  /**
> > >>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > >> -- 
> > >> 2.47.1
> > >>
> > > 
> > 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/9] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-09 11:20 ` [PATCH v4 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
@ 2025-07-09 17:26   ` Matthew Brost
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Brost @ 2025-07-09 17:26 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

On Wed, Jul 09, 2025 at 04:50:14PM +0530, Riana Tauro wrote:
> Userspace should be notified after setting the device as wedged.
> Re-order function calls to set gt wedged before sending uevent.
> 
> Cc: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 0b73cb72bad1..8a5bb7b6d09b 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   * xe_device_declare_wedged - Declare device wedged
>   * @xe: xe device instance
>   *
> - * This is a final state that can only be cleared with a module
> + * This is a final state that can only be cleared with the recovery method
> + * specified in the drm wedged uevent. The default recovery method is
>   * re-probe (unbind + bind).
> + *
>   * In this state every IOCTL will be blocked so the GT cannot be used.
>   * In general it will be called upon any critical error such as gt reset
>   * failure or guc loading failure. Userspace will be notified of this state
> @@ -1158,13 +1160,15 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>  			dev_name(xe->drm.dev));
> +	}
> +
> +	for_each_gt(gt, xe, id)
> +		xe_gt_declare_wedged(gt);
>  
> +	if (xe_device_wedged(xe)) {
>  		/* Notify userspace of wedged device */
>  		drm_dev_wedged_event(&xe->drm,
>  				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
>  				     NULL);
>  	}
> -
> -	for_each_gt(gt, xe, id)
> -		xe_gt_declare_wedged(gt);
>  }
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-09 11:20 ` [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
@ 2025-07-09 23:44   ` Umesh Nerlige Ramappa
  2025-07-10  5:59     ` Riana Tauro
  0 siblings, 1 reply; 48+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-07-09 23:44 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

On Wed, Jul 09, 2025 at 04:50:17PM +0530, Riana Tauro wrote:
>Certain runtime firmware errors can cause the device to be in a unusable
>state requiring a firmware flash to restore normal operation.
>Runtime Survivability Mode indicates firmware flash is necessary by
>wedging the device and exposing survivability mode sysfs.
>
>The below sysfs is an indication that device is in survivability mode
>
>/sys/bus/pci/devices/<device>/survivability_mode
>
>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>---
> drivers/gpu/drm/xe/xe_survivability_mode.c    | 42 ++++++++++++++++++-
> drivers/gpu/drm/xe/xe_survivability_mode.h    |  1 +
> .../gpu/drm/xe/xe_survivability_mode_types.h  |  1 +
> 3 files changed, 43 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
>index fefb027b1c84..ca1cfa13525a 100644
>--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>@@ -137,7 +137,8 @@ static ssize_t survivability_mode_show(struct device *dev,
> 	struct xe_survivability_info *info = survivability->info;
> 	int index = 0, count = 0;
>
>-	count += sysfs_emit_at(buff, count, "Survivability mode type: Boot\n");
>+	count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
>+			       survivability->type ? "Runtime" : "Boot");
>
> 	if (!check_boot_failure(xe))
> 		return count;
>@@ -288,6 +289,45 @@ bool xe_survivability_mode_is_requested(struct xe_device *xe)
> 	return check_boot_failure(xe);
> }
>
>+/**
>+ * xe_survivability_mode_runtime_enable - Initialize and enable runtime survivability mode
>+ * @xe: xe device instance
>+ *
>+ * Initialize survivability information and enable runtime survivability mode.
>+ * Runtime survivability mode is enabled when certain errors cause the device to be
>+ * in non-recoverable state. The device is declared wedged with the appropriate
>+ * recovery method and survivability mode sysfs exposed to userspace
>+ *
>+ * Return: 0 if runtime survivability mode is enabled or not requested, negative error

is the "not requested" still applicable here?


>+ * code otherwise.
>+ */
>+int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>+{
>+	struct xe_survivability *survivability = &xe->survivability;
>+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>+	int ret;
>+
>+	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform < XE_BATTLEMAGE) {

Do you think this condition can be better handled with a 
has_runtime_survivability for platforms that support it?

>+		dev_err(&pdev->dev, "Runtime Survivability Mode not supported\n");
>+		return -EINVAL;
>+	}
>+
>+	ret = init_survivability_mode(xe);
>+	if (ret)
>+		return ret;
>+
>+	ret = create_survivability_sysfs(pdev);
>+	if (ret)
>+		dev_err(&pdev->dev, "Failed to create survivability mode sysfs\n");

You do not return ret in the above if condition. Is that intenational?

Regards,
Umesh

>+
>+	survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME;
>+	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
>+	xe_device_declare_wedged(xe);
>+
>+	dev_err(&pdev->dev, "Runtime Survivability mode enabled\n");
>+	return 0;
>+}
>+
> /**
>  * xe_survivability_mode_boot_enable - Initialize and enable boot survivability mode
>  * @xe: xe device instance
>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
>index f6ee283ea5e8..1cc94226aa82 100644
>--- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>+++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>@@ -11,6 +11,7 @@
> struct xe_device;
>
> int xe_survivability_mode_boot_enable(struct xe_device *xe);
>+int xe_survivability_mode_runtime_enable(struct xe_device *xe);
> bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
> bool xe_survivability_mode_is_requested(struct xe_device *xe);
>
>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>index 5dce393498da..cd65a5d167c9 100644
>--- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>+++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>@@ -11,6 +11,7 @@
>
> enum xe_survivability_type {
> 	XE_SURVIVABILITY_TYPE_BOOT,
>+	XE_SURVIVABILITY_TYPE_RUNTIME,
> };
>
> struct xe_survivability_info {
>-- 
>2.47.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-09 23:44   ` Umesh Nerlige Ramappa
@ 2025-07-10  5:59     ` Riana Tauro
  2025-07-10 17:12       ` Umesh Nerlige Ramappa
  0 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-10  5:59 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

Hi Umesh

On 7/10/2025 5:14 AM, Umesh Nerlige Ramappa wrote:
> On Wed, Jul 09, 2025 at 04:50:17PM +0530, Riana Tauro wrote:
>> Certain runtime firmware errors can cause the device to be in a unusable
>> state requiring a firmware flash to restore normal operation.
>> Runtime Survivability Mode indicates firmware flash is necessary by
>> wedging the device and exposing survivability mode sysfs.
>>
>> The below sysfs is an indication that device is in survivability mode
>>
>> /sys/bus/pci/devices/<device>/survivability_mode
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_survivability_mode.c    | 42 ++++++++++++++++++-
>> drivers/gpu/drm/xe/xe_survivability_mode.h    |  1 +
>> .../gpu/drm/xe/xe_survivability_mode_types.h  |  1 +
>> 3 files changed, 43 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/ 
>> drm/xe/xe_survivability_mode.c
>> index fefb027b1c84..ca1cfa13525a 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -137,7 +137,8 @@ static ssize_t survivability_mode_show(struct 
>> device *dev,
>>     struct xe_survivability_info *info = survivability->info;
>>     int index = 0, count = 0;
>>
>> -    count += sysfs_emit_at(buff, count, "Survivability mode type: 
>> Boot\n");
>> +    count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
>> +                   survivability->type ? "Runtime" : "Boot");
>>
>>     if (!check_boot_failure(xe))
>>         return count;
>> @@ -288,6 +289,45 @@ bool xe_survivability_mode_is_requested(struct 
>> xe_device *xe)
>>     return check_boot_failure(xe);
>> }
>>
>> +/**
>> + * xe_survivability_mode_runtime_enable - Initialize and enable 
>> runtime survivability mode
>> + * @xe: xe device instance
>> + *
>> + * Initialize survivability information and enable runtime 
>> survivability mode.
>> + * Runtime survivability mode is enabled when certain errors cause 
>> the device to be
>> + * in non-recoverable state. The device is declared wedged with the 
>> appropriate
>> + * recovery method and survivability mode sysfs exposed to userspace
>> + *
>> + * Return: 0 if runtime survivability mode is enabled or not 
>> requested, negative error
> 
> is the "not requested" still applicable here?

Copied it from boot survivability. Not applicable, will remove this

> 
> 
>> + * code otherwise.
>> + */
>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>> +{
>> +    struct xe_survivability *survivability = &xe->survivability;
>> +    struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> +    int ret;
>> +
>> +    if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform < 
>> XE_BATTLEMAGE) {
> 
> Do you think this condition can be better handled with a 
> has_runtime_survivability for platforms that support it?

Was used once so added it here. Can be split out to a different function

> 
>> +        dev_err(&pdev->dev, "Runtime Survivability Mode not 
>> supported\n");
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = init_survivability_mode(xe);
>> +    if (ret)
>> +        return ret;
>> +
>> +    ret = create_survivability_sysfs(pdev);
>> +    if (ret)
>> +        dev_err(&pdev->dev, "Failed to create survivability mode 
>> sysfs\n");
> 
> You do not return ret in the above if condition. Is that intenational?

yeah this is intentional. The device has to be wedged since it is not 
usable on such errors even without the sysfs.

Thanks
Riana

> 
> Regards,
> Umesh
> 
>> +
>> +    survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME;
>> +    xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
>> +    xe_device_declare_wedged(xe);
>> +
>> +    dev_err(&pdev->dev, "Runtime Survivability mode enabled\n");
>> +    return 0;
>> +}
>> +
>> /**
>>  * xe_survivability_mode_boot_enable - Initialize and enable boot 
>> survivability mode
>>  * @xe: xe device instance
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/ 
>> drm/xe/xe_survivability_mode.h
>> index f6ee283ea5e8..1cc94226aa82 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> @@ -11,6 +11,7 @@
>> struct xe_device;
>>
>> int xe_survivability_mode_boot_enable(struct xe_device *xe);
>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe);
>> bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
>> bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/ 
>> drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index 5dce393498da..cd65a5d167c9 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -11,6 +11,7 @@
>>
>> enum xe_survivability_type {
>>     XE_SURVIVABILITY_TYPE_BOOT,
>> +    XE_SURVIVABILITY_TYPE_RUNTIME,
>> };
>>
>> struct xe_survivability_info {
>> -- 
>> 2.47.1
>>



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 16:52         ` Rodrigo Vivi
@ 2025-07-10  9:01           ` Simona Vetter
  2025-07-10  9:37             ` Christian König
  0 siblings, 1 reply; 48+ messages in thread
From: Simona Vetter @ 2025-07-10  9:01 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Raag Jadav, Christian König, Simona Vetter, Riana Tauro,
	intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > On 09.07.25 15:41, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > >> Certain errors can cause the device to be wedged and may
> > > >> require a vendor specific recovery method to restore normal
> > > >> operation.
> > > >>
> > > >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > >> must provide additional recovery documentation if this method
> > > >> is used.
> > > >>
> > > >> v2: fix documentation (Raag)
> > > >>
> > > >> Cc: André Almeida <andrealmeid@igalia.com>
> > > >> Cc: Christian König <christian.koenig@amd.com>
> > > >> Cc: David Airlie <airlied@gmail.com>
> > > >> Cc: <dri-devel@lists.freedesktop.org>
> > > >> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > >> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > > 
> > > > I'm not really understanding what this is useful for, maybe concrete
> > > > example in the form of driver code that uses this, and some tool or
> > > > documentation steps that should be taken for recovery?
> 
> The case here is when FW underneath identified something badly corrupted on
> FW land and decided that only a firmware-flashing could solve the day and
> raise interrupt to the driver. At that point we want to wedge, but immediately
> hint the admin the recommended action.
> 
> > > 
> > > The recovery method for this particular case is to flash in a new firmware.
> > > 
> > > > The issues I'm seeing here is that eventually we'll get different
> > > > vendor-specific recovery steps, and maybe even on the same device, and
> > > > that leads us to an enumeration issue. Since it's just a string and an
> > > > enum I think it'd be better to just allocate a new one every time there's
> > > > a new strange recovery method instead of this opaque approach.
> > > 
> > > That is exactly the opposite of what we discussed so far.

Sorry, I missed that context.

> > > The original idea was to add a firmware-flush recovery method which
> > > looked a bit wage since it didn't give any information on what to do
> > > exactly.
> > > 
> > > That's why I suggested to add a more generic vendor-specific event
> > > with refers to the documentation and system log to see what actually
> > > needs to be done.
> > > 
> > > Otherwise we would end up with events like firmware-flash, update FW
> > > image A, update FW image B, FW version mismatch etc....

Yeah, that's kinda what I expect to happen, and we have enough numbers for
this all to not be an issue.

> > Agree. Any newly allocated method that is specific to a vendor is going to
> > be opaque anyway, since it can't be generic for all drivers. This just helps
> > reduce the noise in DRM core.
> > 
> > And yes, there could be different vendor-specific cases for the same driver
> > and the driver should be able to provide the means to distinguish between
> > them.
> 
> Sim, what's your take on this then?
> 
> Should we get back to the original idea of firmware-flash?

Maybe intel-firmware-flash or something, meaning prefix with the vendor?

The reason I think it should be specific is because I'm assuming you want
to script this. And if you have a big fleet with different vendors, then
"vendor-specific" doesn't tell you enough. But if it's something like
$vendor-$magic_step then it does become scriptable, and we do have have a
place to put some documentation on what you should do instead.

If the point of this interface isn't that it's scriptable, then I'm not
sure why it needs to be an uevent?

I guess if you all want to stick with vendor-specific then I think that's
ok with me too, but the docs should at least explain how to figure out
from the uevent which vendor you're on with a small example. What I'm
worried is that if we have this on multiple drivers userspace will
otherwise make a complete mess and might want to run the wrong recovery
steps.

I think ideally, no matter what, we'd have a concrete driver patch which
then also comes with the documentation for what exactly you're supposed to
do as something you can script. And not just this stand-alone patch here.

Cheers, Sima
> 
> > 
> > Raag
> > 
> > > >> ---
> > > >>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > >>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > > >>  include/drm/drm_device.h       | 4 ++++
> > > >>  3 files changed, 11 insertions(+), 4 deletions(-)
> > > >>
> > > >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > >> index 263e5a97c080..c33070bdb347 100644
> > > >> --- a/Documentation/gpu/drm-uapi.rst
> > > >> +++ b/Documentation/gpu/drm-uapi.rst
> > > >> @@ -421,10 +421,10 @@ Recovery
> > > >>  Current implementation defines three recovery methods, out of which, drivers
> > > >>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > > >>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > >> -more side-effects. If driver is unsure about recovery or method is unknown
> > > >> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > >> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > >> -will be sent instead.
> > > >> +more side-effects. If recovery method is specific to vendor
> > > >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > >> +specific documentation for further recovery steps. If driver is unsure about
> > > >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > >>  
> > > >>  Userspace consumers can parse this event and attempt recovery as per the
> > > >>  following expectations.
> > > >> @@ -435,6 +435,7 @@ following expectations.
> > > >>      none            optional telemetry collection
> > > >>      rebind          unbind + bind driver
> > > >>      bus-reset       unbind + bus reset/re-enumeration + bind
> > > >> +    vendor-specific vendor specific recovery method
> > > >>      unknown         consumer policy
> > > >>      =============== ========================================
> > > >>  
> > > >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > >> index cdd591b11488..0ac723a46a91 100644
> > > >> --- a/drivers/gpu/drm/drm_drv.c
> > > >> +++ b/drivers/gpu/drm/drm_drv.c
> > > >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > >>  		return "rebind";
> > > >>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > >>  		return "bus-reset";
> > > >> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > >> +		return "vendor-specific";
> > > >>  	default:
> > > >>  		return NULL;
> > > >>  	}
> > > >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > >> index 08b3b2467c4c..08a087f149ff 100644
> > > >> --- a/include/drm/drm_device.h
> > > >> +++ b/include/drm/drm_device.h
> > > >> @@ -26,10 +26,14 @@ struct pci_controller;
> > > >>   * Recovery methods for wedged device in order of less to more side-effects.
> > > >>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > >>   * use any one, multiple (or'd) or none depending on their needs.
> > > >> + *
> > > >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > >> + * details.
> > > >>   */
> > > >>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > >>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > >>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > >> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > >>  
> > > >>  /**
> > > >>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > >> -- 
> > > >> 2.47.1
> > > >>
> > > > 
> > > 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10  9:01           ` Simona Vetter
@ 2025-07-10  9:37             ` Christian König
  2025-07-10 10:24               ` Raag Jadav
  2025-07-11  8:59               ` Simona Vetter
  0 siblings, 2 replies; 48+ messages in thread
From: Christian König @ 2025-07-10  9:37 UTC (permalink / raw)
  To: Simona Vetter, Rodrigo Vivi
  Cc: Raag Jadav, Riana Tauro, intel-xe, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban, André Almeida, David Airlie, dri-devel

On 10.07.25 11:01, Simona Vetter wrote:
> On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
>> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
>>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
>>>> On 09.07.25 15:41, Simona Vetter wrote:
>>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>>>>>> Certain errors can cause the device to be wedged and may
>>>>>> require a vendor specific recovery method to restore normal
>>>>>> operation.
>>>>>>
>>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>>>>>> must provide additional recovery documentation if this method
>>>>>> is used.
>>>>>>
>>>>>> v2: fix documentation (Raag)
>>>>>>
>>>>>> Cc: André Almeida <andrealmeid@igalia.com>
>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>> Cc: <dri-devel@lists.freedesktop.org>
>>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>>>
>>>>> I'm not really understanding what this is useful for, maybe concrete
>>>>> example in the form of driver code that uses this, and some tool or
>>>>> documentation steps that should be taken for recovery?
>>
>> The case here is when FW underneath identified something badly corrupted on
>> FW land and decided that only a firmware-flashing could solve the day and
>> raise interrupt to the driver. At that point we want to wedge, but immediately
>> hint the admin the recommended action.
>>
>>>>
>>>> The recovery method for this particular case is to flash in a new firmware.
>>>>
>>>>> The issues I'm seeing here is that eventually we'll get different
>>>>> vendor-specific recovery steps, and maybe even on the same device, and
>>>>> that leads us to an enumeration issue. Since it's just a string and an
>>>>> enum I think it'd be better to just allocate a new one every time there's
>>>>> a new strange recovery method instead of this opaque approach.
>>>>
>>>> That is exactly the opposite of what we discussed so far.
> 
> Sorry, I missed that context.
> 
>>>> The original idea was to add a firmware-flush recovery method which
>>>> looked a bit wage since it didn't give any information on what to do
>>>> exactly.
>>>>
>>>> That's why I suggested to add a more generic vendor-specific event
>>>> with refers to the documentation and system log to see what actually
>>>> needs to be done.
>>>>
>>>> Otherwise we would end up with events like firmware-flash, update FW
>>>> image A, update FW image B, FW version mismatch etc....
> 
> Yeah, that's kinda what I expect to happen, and we have enough numbers for
> this all to not be an issue.
> 
>>> Agree. Any newly allocated method that is specific to a vendor is going to
>>> be opaque anyway, since it can't be generic for all drivers. This just helps
>>> reduce the noise in DRM core.
>>>
>>> And yes, there could be different vendor-specific cases for the same driver
>>> and the driver should be able to provide the means to distinguish between
>>> them.
>>
>> Sim, what's your take on this then?
>>
>> Should we get back to the original idea of firmware-flash?
> 
> Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> 
> The reason I think it should be specific is because I'm assuming you want
> to script this. And if you have a big fleet with different vendors, then
> "vendor-specific" doesn't tell you enough. But if it's something like
> $vendor-$magic_step then it does become scriptable, and we do have have a
> place to put some documentation on what you should do instead.
> 
> If the point of this interface isn't that it's scriptable, then I'm not
> sure why it needs to be an uevent?

You should probably read up on the previous discussion, cause that is exactly what I asked as well :)

And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.

In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.

Regards,
Christian.

> I guess if you all want to stick with vendor-specific then I think that's
> ok with me too, but the docs should at least explain how to figure out
> from the uevent which vendor you're on with a small example. What I'm
> worried is that if we have this on multiple drivers userspace will
> otherwise make a complete mess and might want to run the wrong recovery
> steps.
> 
> I think ideally, no matter what, we'd have a concrete driver patch which
> then also comes with the documentation for what exactly you're supposed to
> do as something you can script. And not just this stand-alone patch here.
> 
> Cheers, Sima
>>
>>>
>>> Raag
>>>
>>>>>> ---
>>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
>>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
>>>>>>  include/drm/drm_device.h       | 4 ++++
>>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>>>> index 263e5a97c080..c33070bdb347 100644
>>>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>>>> @@ -421,10 +421,10 @@ Recovery
>>>>>>  Current implementation defines three recovery methods, out of which, drivers
>>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
>>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
>>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>>>>> -will be sent instead.
>>>>>> +more side-effects. If recovery method is specific to vendor
>>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>>>>>> +specific documentation for further recovery steps. If driver is unsure about
>>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>>>>>  
>>>>>>  Userspace consumers can parse this event and attempt recovery as per the
>>>>>>  following expectations.
>>>>>> @@ -435,6 +435,7 @@ following expectations.
>>>>>>      none            optional telemetry collection
>>>>>>      rebind          unbind + bind driver
>>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
>>>>>> +    vendor-specific vendor specific recovery method
>>>>>>      unknown         consumer policy
>>>>>>      =============== ========================================
>>>>>>  
>>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>>>> index cdd591b11488..0ac723a46a91 100644
>>>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>>>>  		return "rebind";
>>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>>>>  		return "bus-reset";
>>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>>>>>> +		return "vendor-specific";
>>>>>>  	default:
>>>>>>  		return NULL;
>>>>>>  	}
>>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>>>>> index 08b3b2467c4c..08a087f149ff 100644
>>>>>> --- a/include/drm/drm_device.h
>>>>>> +++ b/include/drm/drm_device.h
>>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
>>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
>>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>>>>>   * use any one, multiple (or'd) or none depending on their needs.
>>>>>> + *
>>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>>>>>> + * details.
>>>>>>   */
>>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>>>>>  
>>>>>>  /**
>>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>>>>> -- 
>>>>>> 2.47.1
>>>>>>
>>>>>
>>>>
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10  9:37             ` Christian König
@ 2025-07-10 10:24               ` Raag Jadav
  2025-07-10 19:00                 ` Rodrigo Vivi
  2025-07-11  8:59               ` Simona Vetter
  1 sibling, 1 reply; 48+ messages in thread
From: Raag Jadav @ 2025-07-10 10:24 UTC (permalink / raw)
  To: Christian König
  Cc: Simona Vetter, Rodrigo Vivi, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> On 10.07.25 11:01, Simona Vetter wrote:
> > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> >>>> On 09.07.25 15:41, Simona Vetter wrote:
> >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> >>>>>> Certain errors can cause the device to be wedged and may
> >>>>>> require a vendor specific recovery method to restore normal
> >>>>>> operation.
> >>>>>>
> >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> >>>>>> must provide additional recovery documentation if this method
> >>>>>> is used.
> >>>>>>
> >>>>>> v2: fix documentation (Raag)
> >>>>>>
> >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> >>>>>> Cc: Christian König <christian.koenig@amd.com>
> >>>>>> Cc: David Airlie <airlied@gmail.com>
> >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> >>>>>
> >>>>> I'm not really understanding what this is useful for, maybe concrete
> >>>>> example in the form of driver code that uses this, and some tool or
> >>>>> documentation steps that should be taken for recovery?
> >>
> >> The case here is when FW underneath identified something badly corrupted on
> >> FW land and decided that only a firmware-flashing could solve the day and
> >> raise interrupt to the driver. At that point we want to wedge, but immediately
> >> hint the admin the recommended action.
> >>
> >>>>
> >>>> The recovery method for this particular case is to flash in a new firmware.
> >>>>
> >>>>> The issues I'm seeing here is that eventually we'll get different
> >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> >>>>> that leads us to an enumeration issue. Since it's just a string and an
> >>>>> enum I think it'd be better to just allocate a new one every time there's
> >>>>> a new strange recovery method instead of this opaque approach.
> >>>>
> >>>> That is exactly the opposite of what we discussed so far.
> > 
> > Sorry, I missed that context.
> > 
> >>>> The original idea was to add a firmware-flush recovery method which
> >>>> looked a bit wage since it didn't give any information on what to do
> >>>> exactly.
> >>>>
> >>>> That's why I suggested to add a more generic vendor-specific event
> >>>> with refers to the documentation and system log to see what actually
> >>>> needs to be done.
> >>>>
> >>>> Otherwise we would end up with events like firmware-flash, update FW
> >>>> image A, update FW image B, FW version mismatch etc....
> > 
> > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > this all to not be an issue.
> > 
> >>> Agree. Any newly allocated method that is specific to a vendor is going to
> >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> >>> reduce the noise in DRM core.
> >>>
> >>> And yes, there could be different vendor-specific cases for the same driver
> >>> and the driver should be able to provide the means to distinguish between
> >>> them.
> >>
> >> Sim, what's your take on this then?
> >>
> >> Should we get back to the original idea of firmware-flash?
> > 
> > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > 
> > The reason I think it should be specific is because I'm assuming you want
> > to script this. And if you have a big fleet with different vendors, then
> > "vendor-specific" doesn't tell you enough. But if it's something like
> > $vendor-$magic_step then it does become scriptable, and we do have have a
> > place to put some documentation on what you should do instead.
> > 
> > If the point of this interface isn't that it's scriptable, then I'm not
> > sure why it needs to be an uevent?
> 
> You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> 
> And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> 
> In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.

Yes, and since the recovery procedure is defined and known to the consumer,
it can potentially be automated (atleast for non-firmware cases).

> > I guess if you all want to stick with vendor-specific then I think that's
> > ok with me too, but the docs should at least explain how to figure out
> > from the uevent which vendor you're on with a small example. What I'm
> > worried is that if we have this on multiple drivers userspace will
> > otherwise make a complete mess and might want to run the wrong recovery
> > steps.

The device id along with driver can be identified from uevent (probably
available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
already knows if the device fits the criteria for recovery.

> > I think ideally, no matter what, we'd have a concrete driver patch which
> > then also comes with the documentation for what exactly you're supposed to
> > do as something you can script. And not just this stand-alone patch here.

Perhaps the rest of the series didn't make it to dri-devel, which will answer
most of the above.

Raag

> >>>>>> ---
> >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> >>>>>>  include/drm/drm_device.h       | 4 ++++
> >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >>>>>> index 263e5a97c080..c33070bdb347 100644
> >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> >>>>>> @@ -421,10 +421,10 @@ Recovery
> >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >>>>>> -will be sent instead.
> >>>>>> +more side-effects. If recovery method is specific to vendor
> >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> >>>>>>  
> >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> >>>>>>  following expectations.
> >>>>>> @@ -435,6 +435,7 @@ following expectations.
> >>>>>>      none            optional telemetry collection
> >>>>>>      rebind          unbind + bind driver
> >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> >>>>>> +    vendor-specific vendor specific recovery method
> >>>>>>      unknown         consumer policy
> >>>>>>      =============== ========================================
> >>>>>>  
> >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >>>>>> index cdd591b11488..0ac723a46a91 100644
> >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>>>>>  		return "rebind";
> >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>>>>>  		return "bus-reset";
> >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> >>>>>> +		return "vendor-specific";
> >>>>>>  	default:
> >>>>>>  		return NULL;
> >>>>>>  	}
> >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> >>>>>> --- a/include/drm/drm_device.h
> >>>>>> +++ b/include/drm/drm_device.h
> >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> >>>>>> + *
> >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> >>>>>> + * details.
> >>>>>>   */
> >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> >>>>>>  
> >>>>>>  /**
> >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >>>>>> -- 
> >>>>>> 2.47.1
> >>>>>>
> >>>>>
> >>>>
> > 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-10  5:59     ` Riana Tauro
@ 2025-07-10 17:12       ` Umesh Nerlige Ramappa
  2025-07-11  5:23         ` Riana Tauro
  0 siblings, 1 reply; 48+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-07-10 17:12 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

On Thu, Jul 10, 2025 at 11:29:44AM +0530, Riana Tauro wrote:
>Hi Umesh
>
>On 7/10/2025 5:14 AM, Umesh Nerlige Ramappa wrote:
>>On Wed, Jul 09, 2025 at 04:50:17PM +0530, Riana Tauro wrote:
>>>Certain runtime firmware errors can cause the device to be in a unusable
>>>state requiring a firmware flash to restore normal operation.
>>>Runtime Survivability Mode indicates firmware flash is necessary by
>>>wedging the device and exposing survivability mode sysfs.
>>>
>>>The below sysfs is an indication that device is in survivability mode
>>>
>>>/sys/bus/pci/devices/<device>/survivability_mode
>>>
>>>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>---
>>>drivers/gpu/drm/xe/xe_survivability_mode.c    | 42 ++++++++++++++++++-
>>>drivers/gpu/drm/xe/xe_survivability_mode.h    |  1 +
>>>.../gpu/drm/xe/xe_survivability_mode_types.h  |  1 +
>>>3 files changed, 43 insertions(+), 1 deletion(-)
>>>
>>>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c 
>>>b/drivers/gpu/ drm/xe/xe_survivability_mode.c
>>>index fefb027b1c84..ca1cfa13525a 100644
>>>--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>>>+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>>>@@ -137,7 +137,8 @@ static ssize_t survivability_mode_show(struct 
>>>device *dev,
>>>    struct xe_survivability_info *info = survivability->info;
>>>    int index = 0, count = 0;
>>>
>>>-    count += sysfs_emit_at(buff, count, "Survivability mode type: 
>>>Boot\n");
>>>+    count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
>>>+                   survivability->type ? "Runtime" : "Boot");
>>>
>>>    if (!check_boot_failure(xe))
>>>        return count;
>>>@@ -288,6 +289,45 @@ bool 
>>>xe_survivability_mode_is_requested(struct xe_device *xe)
>>>    return check_boot_failure(xe);
>>>}
>>>
>>>+/**
>>>+ * xe_survivability_mode_runtime_enable - Initialize and enable 
>>>runtime survivability mode
>>>+ * @xe: xe device instance
>>>+ *
>>>+ * Initialize survivability information and enable runtime 
>>>survivability mode.
>>>+ * Runtime survivability mode is enabled when certain errors 
>>>cause the device to be
>>>+ * in non-recoverable state. The device is declared wedged with 
>>>the appropriate
>>>+ * recovery method and survivability mode sysfs exposed to userspace
>>>+ *
>>>+ * Return: 0 if runtime survivability mode is enabled or not 
>>>requested, negative error
>>
>>is the "not requested" still applicable here?
>
>Copied it from boot survivability. Not applicable, will remove this
>
>>
>>
>>>+ * code otherwise.
>>>+ */
>>>+int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>>>+{
>>>+    struct xe_survivability *survivability = &xe->survivability;
>>>+    struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>>+    int ret;
>>>+
>>>+    if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform < 
>>>XE_BATTLEMAGE) {
>>
>>Do you think this condition can be better handled with a 
>>has_runtime_survivability for platforms that support it?
>
>Was used once so added it here. Can be split out to a different function

Oh, not a different function. I mean a has_* property. More like entries 
defined in xe_pci_types.h under struct xe_graphics_desc.

Regards,
Umesh

>>
>>>+        dev_err(&pdev->dev, "Runtime Survivability Mode not 
>>>supported\n");
>>>+        return -EINVAL;
>>>+    }
>>>+
>>>+    ret = init_survivability_mode(xe);
>>>+    if (ret)
>>>+        return ret;
>>>+
>>>+    ret = create_survivability_sysfs(pdev);
>>>+    if (ret)
>>>+        dev_err(&pdev->dev, "Failed to create survivability mode 
>>>sysfs\n");
>>
>>You do not return ret in the above if condition. Is that intenational?
>
>yeah this is intentional. The device has to be wedged since it is not 
>usable on such errors even without the sysfs.
>
>Thanks
>Riana
>
>>
>>Regards,
>>Umesh
>>
>>>+
>>>+    survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME;
>>>+    xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
>>>+    xe_device_declare_wedged(xe);
>>>+
>>>+    dev_err(&pdev->dev, "Runtime Survivability mode enabled\n");
>>>+    return 0;
>>>+}
>>>+
>>>/**
>>> * xe_survivability_mode_boot_enable - Initialize and enable boot 
>>>survivability mode
>>> * @xe: xe device instance
>>>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h 
>>>b/drivers/gpu/ drm/xe/xe_survivability_mode.h
>>>index f6ee283ea5e8..1cc94226aa82 100644
>>>--- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>>>+++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>>>@@ -11,6 +11,7 @@
>>>struct xe_device;
>>>
>>>int xe_survivability_mode_boot_enable(struct xe_device *xe);
>>>+int xe_survivability_mode_runtime_enable(struct xe_device *xe);
>>>bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
>>>bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>>
>>>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/ 
>>>drivers/gpu/drm/xe/xe_survivability_mode_types.h
>>>index 5dce393498da..cd65a5d167c9 100644
>>>--- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>>>+++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>>>@@ -11,6 +11,7 @@
>>>
>>>enum xe_survivability_type {
>>>    XE_SURVIVABILITY_TYPE_BOOT,
>>>+    XE_SURVIVABILITY_TYPE_RUNTIME,
>>>};
>>>
>>>struct xe_survivability_info {
>>>-- 
>>>2.47.1
>>>
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 10:24               ` Raag Jadav
@ 2025-07-10 19:00                 ` Rodrigo Vivi
  2025-07-10 21:46                   ` Raag Jadav
  2025-07-11  8:56                   ` Simona Vetter
  0 siblings, 2 replies; 48+ messages in thread
From: Rodrigo Vivi @ 2025-07-10 19:00 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Christian König, Simona Vetter, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > On 10.07.25 11:01, Simona Vetter wrote:
> > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > >>>> On 09.07.25 15:41, Simona Vetter wrote:
> > >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > >>>>>> Certain errors can cause the device to be wedged and may
> > >>>>>> require a vendor specific recovery method to restore normal
> > >>>>>> operation.
> > >>>>>>
> > >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > >>>>>> must provide additional recovery documentation if this method
> > >>>>>> is used.
> > >>>>>>
> > >>>>>> v2: fix documentation (Raag)
> > >>>>>>
> > >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> > >>>>>> Cc: Christian König <christian.koenig@amd.com>
> > >>>>>> Cc: David Airlie <airlied@gmail.com>
> > >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> > >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > >>>>>
> > >>>>> I'm not really understanding what this is useful for, maybe concrete
> > >>>>> example in the form of driver code that uses this, and some tool or
> > >>>>> documentation steps that should be taken for recovery?
> > >>
> > >> The case here is when FW underneath identified something badly corrupted on
> > >> FW land and decided that only a firmware-flashing could solve the day and
> > >> raise interrupt to the driver. At that point we want to wedge, but immediately
> > >> hint the admin the recommended action.
> > >>
> > >>>>
> > >>>> The recovery method for this particular case is to flash in a new firmware.
> > >>>>
> > >>>>> The issues I'm seeing here is that eventually we'll get different
> > >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> > >>>>> that leads us to an enumeration issue. Since it's just a string and an
> > >>>>> enum I think it'd be better to just allocate a new one every time there's
> > >>>>> a new strange recovery method instead of this opaque approach.
> > >>>>
> > >>>> That is exactly the opposite of what we discussed so far.
> > > 
> > > Sorry, I missed that context.
> > > 
> > >>>> The original idea was to add a firmware-flush recovery method which
> > >>>> looked a bit wage since it didn't give any information on what to do
> > >>>> exactly.
> > >>>>
> > >>>> That's why I suggested to add a more generic vendor-specific event
> > >>>> with refers to the documentation and system log to see what actually
> > >>>> needs to be done.
> > >>>>
> > >>>> Otherwise we would end up with events like firmware-flash, update FW
> > >>>> image A, update FW image B, FW version mismatch etc....
> > > 
> > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > this all to not be an issue.
> > > 
> > >>> Agree. Any newly allocated method that is specific to a vendor is going to
> > >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> > >>> reduce the noise in DRM core.
> > >>>
> > >>> And yes, there could be different vendor-specific cases for the same driver
> > >>> and the driver should be able to provide the means to distinguish between
> > >>> them.
> > >>
> > >> Sim, what's your take on this then?
> > >>
> > >> Should we get back to the original idea of firmware-flash?
> > > 
> > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > 
> > > The reason I think it should be specific is because I'm assuming you want
> > > to script this. And if you have a big fleet with different vendors, then
> > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > place to put some documentation on what you should do instead.
> > > 
> > > If the point of this interface isn't that it's scriptable, then I'm not
> > > sure why it needs to be an uevent?
> > 
> > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > 
> > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.

I also don't like the idea or even the thought of scripting something like
a firmware-flash. But only to fail with a better pin point to make admin
lives easier with a notification.

> > 
> > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.

Exactly, the hardware and firmware are giving the indication of what should be
done. It is not 'unknown'.

> 
> Yes, and since the recovery procedure is defined and known to the consumer,
> it can potentially be automated (atleast for non-firmware cases).
> 
> > > I guess if you all want to stick with vendor-specific then I think that's

Well, I would honestly prefer a direct firmware-flash, but if that is not
usable by other vendors and there's a push back on that, let's go with
the vendor-specific then.

> > > ok with me too, but the docs should at least explain how to figure out
> > > from the uevent which vendor you're on with a small example. What I'm
> > > worried is that if we have this on multiple drivers userspace will
> > > otherwise make a complete mess and might want to run the wrong recovery
> > > steps.
> 
> The device id along with driver can be identified from uevent (probably
> available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> already knows if the device fits the criteria for recovery.
> 
> > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > then also comes with the documentation for what exactly you're supposed to
> > > do as something you can script. And not just this stand-alone patch here.
> 
> Perhaps the rest of the series didn't make it to dri-devel, which will answer
> most of the above.

Riana, could you please try to provide a bit more documentation like Sima
asked and re-send the entire series to dri-devel?

Thanks,
Rodrigo.

> 
> Raag
> 
> > >>>>>> ---
> > >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > >>>>>>  include/drm/drm_device.h       | 4 ++++
> > >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> > >>>>>>
> > >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > >>>>>> index 263e5a97c080..c33070bdb347 100644
> > >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> > >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> > >>>>>> @@ -421,10 +421,10 @@ Recovery
> > >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> > >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> > >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > >>>>>> -will be sent instead.
> > >>>>>> +more side-effects. If recovery method is specific to vendor
> > >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> > >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > >>>>>>  
> > >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> > >>>>>>  following expectations.
> > >>>>>> @@ -435,6 +435,7 @@ following expectations.
> > >>>>>>      none            optional telemetry collection
> > >>>>>>      rebind          unbind + bind driver
> > >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> > >>>>>> +    vendor-specific vendor specific recovery method
> > >>>>>>      unknown         consumer policy
> > >>>>>>      =============== ========================================
> > >>>>>>  
> > >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > >>>>>> index cdd591b11488..0ac723a46a91 100644
> > >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> > >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> > >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >>>>>>  		return "rebind";
> > >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > >>>>>>  		return "bus-reset";
> > >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > >>>>>> +		return "vendor-specific";
> > >>>>>>  	default:
> > >>>>>>  		return NULL;
> > >>>>>>  	}
> > >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> > >>>>>> --- a/include/drm/drm_device.h
> > >>>>>> +++ b/include/drm/drm_device.h
> > >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> > >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> > >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> > >>>>>> + *
> > >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > >>>>>> + * details.
> > >>>>>>   */
> > >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > >>>>>>  
> > >>>>>>  /**
> > >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > >>>>>> -- 
> > >>>>>> 2.47.1
> > >>>>>>
> > >>>>>
> > >>>>
> > > 
> > 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 7/9] drm/xe: Add support to handle hardware errors
  2025-07-09 11:20 ` [PATCH v4 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
@ 2025-07-10 21:09   ` Umesh Nerlige Ramappa
  2025-07-11  5:35     ` Riana Tauro
  0 siblings, 1 reply; 48+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-07-10 21:09 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban,
	Himal Prasad Ghimiray

Resending since it got lost earlier...

On Wed, Jul 09, 2025 at 04:50:19PM +0530, Riana Tauro wrote:
>Gfx device reports two classes of errors: uncorrectable and
>correctable. Depending on the severity uncorrectable errors are
>further classified as non fatal and fatal
>
>Correctable and non-fatal errors are reported as MSI's and bits in
>the Master Interrupt Register indicate the class of the error.
>The source of the error is then read from the Device Error Source
>Register.

nit: Since Fatal is a separate category, maybe a split here into a 
separate paragraph and some formatting would be good.

>Fatal errors are reported as PCIe errors
>When a PCIe error is asserted, the OS will perform a device warm reset
>which causes the driver to reload. The error registers are sticky
>and the values are maintained through a warm reset
>
>Add basic support to handle these errors
>
>Bspec: 50875, 53073, 53074, 53075, 53076
>
>Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>---
> drivers/gpu/drm/xe/Makefile                |   1 +
> drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
> drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
> drivers/gpu/drm/xe/xe_hw_error.c           | 108 +++++++++++++++++++++
> drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
> drivers/gpu/drm/xe/xe_irq.c                |   4 +
> 6 files changed, 144 insertions(+)
> create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
> create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
>
>diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>index 1d97e5b63f4e..fea8ee3b0785 100644
>--- a/drivers/gpu/drm/xe/Makefile
>+++ b/drivers/gpu/drm/xe/Makefile
>@@ -73,6 +73,7 @@ xe-y += xe_bb.o \
> 	xe_hw_engine.o \
> 	xe_hw_engine_class_sysfs.o \
> 	xe_hw_engine_group.o \
>+	xe_hw_error.o \
> 	xe_hw_fence.o \
> 	xe_irq.o \
> 	xe_lrc.o \
>diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>new file mode 100644
>index 000000000000..ed9b81fb28a0
>--- /dev/null
>+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>@@ -0,0 +1,15 @@
>+/* SPDX-License-Identifier: MIT */
>+/*
>+ * Copyright © 2025 Intel Corporation
>+ */
>+
>+#ifndef _XE_HW_ERROR_REGS_H_
>+#define _XE_HW_ERROR_REGS_H_
>+
>+#define DEV_ERR_STAT_NONFATAL			0x100178
>+#define DEV_ERR_STAT_CORRECTABLE		0x10017c
>+#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>+								  DEV_ERR_STAT_CORRECTABLE, \
>+								  DEV_ERR_STAT_NONFATAL))

For x = 1 and x = 2, I don't see the above result in correct values. Can 
you please double check?

What about DEV_ERR_STAT_FATAL?

Rest looks good,

Umesh

>+
>+#endif
>diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>index f0ecfcac4003..2758b64cec9e 100644
>--- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>+++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>@@ -18,6 +18,7 @@
> #define GFX_MSTR_IRQ				XE_REG(0x190010, XE_REG_OPTION_VF)
> #define   MASTER_IRQ				REG_BIT(31)
> #define   GU_MISC_IRQ				REG_BIT(29)
>+#define   ERROR_IRQ(x)				REG_BIT(26 + (x))
> #define   DISPLAY_IRQ				REG_BIT(16)
> #define   GT_DW_IRQ(x)				REG_BIT(x)
>
>diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>new file mode 100644
>index 000000000000..0f2590839900
>--- /dev/null
>+++ b/drivers/gpu/drm/xe/xe_hw_error.c
>@@ -0,0 +1,108 @@
>+// SPDX-License-Identifier: MIT
>+/*
>+ * Copyright © 2025 Intel Corporation
>+ */
>+
>+#include "regs/xe_hw_error_regs.h"
>+#include "regs/xe_irq_regs.h"
>+
>+#include "xe_device.h"
>+#include "xe_hw_error.h"
>+#include "xe_mmio.h"
>+
>+/* Error categories reported by hardware */
>+enum hardware_error {
>+	HARDWARE_ERROR_CORRECTABLE = 0,
>+	HARDWARE_ERROR_NONFATAL = 1,
>+	HARDWARE_ERROR_FATAL = 2,
>+	HARDWARE_ERROR_MAX,
>+};
>+
>+static const char *hw_error_to_str(const enum hardware_error hw_err)
>+{
>+	switch (hw_err) {
>+	case HARDWARE_ERROR_CORRECTABLE:
>+		return "CORRECTABLE";
>+	case HARDWARE_ERROR_NONFATAL:
>+		return "NONFATAL";
>+	case HARDWARE_ERROR_FATAL:
>+		return "FATAL";
>+	default:
>+		return "UNKNOWN";
>+	}
>+}
>+
>+static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>+{
>+	const char *hw_err_str = hw_error_to_str(hw_err);
>+	struct xe_device *xe = tile_to_xe(tile);
>+	unsigned long flags;
>+	u32 err_src;
>+
>+	if (xe->info.platform != XE_BATTLEMAGE)
>+		return;
>+
>+	spin_lock_irqsave(&xe->irq.lock, flags);
>+	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
>+	if (!err_src) {
>+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
>+				    tile->id, hw_err_str);
>+		goto unlock;
>+	}
>+
>+	/* TODO: Process errrors per source */
>+
>+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>+
>+unlock:
>+	spin_unlock_irqrestore(&xe->irq.lock, flags);
>+}
>+
>+/**
>+ * xe_hw_error_irq_handler - irq handling for hw errors
>+ * @tile: tile instance
>+ * @master_ctl: value read from master interrupt register
>+ *
>+ * Xe platforms add three error bits to the master interrupt register to support error handling.
>+ * These three bits are used to convey the class of error FATAL, NONFATAL, or CORRECTABLE.
>+ * To process the interrupt, determine the source of error by reading the Device Error Source
>+ * Register that corresponds to the class of error being serviced.
>+ */
>+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>+{
>+	enum hardware_error hw_err;
>+
>+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>+		if (master_ctl & ERROR_IRQ(hw_err))
>+			hw_error_source_handler(tile, hw_err);
>+}
>+
>+/*
>+ * Process hardware errors during boot
>+ */
>+static void process_hw_errors(struct xe_device *xe)
>+{
>+	struct xe_tile *tile;
>+	u32 master_ctl;
>+	u8 id;
>+
>+	for_each_tile(tile, xe, id) {
>+		master_ctl = xe_mmio_read32(&tile->mmio, GFX_MSTR_IRQ);
>+		xe_hw_error_irq_handler(tile, master_ctl);
>+		xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ, master_ctl);
>+	}
>+}
>+
>+/**
>+ * xe_hw_error_init - Initialize hw errors
>+ * @xe: xe device instance
>+ *
>+ * Initialize and process hw errors
>+ */
>+void xe_hw_error_init(struct xe_device *xe)
>+{
>+	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>+		return;
>+
>+	process_hw_errors(xe);
>+}
>diff --git a/drivers/gpu/drm/xe/xe_hw_error.h b/drivers/gpu/drm/xe/xe_hw_error.h
>new file mode 100644
>index 000000000000..d86e28c5180c
>--- /dev/null
>+++ b/drivers/gpu/drm/xe/xe_hw_error.h
>@@ -0,0 +1,15 @@
>+/* SPDX-License-Identifier: MIT */
>+/*
>+ * Copyright © 2025 Intel Corporation
>+ */
>+#ifndef XE_HW_ERROR_H_
>+#define XE_HW_ERROR_H_
>+
>+#include <linux/types.h>
>+
>+struct xe_tile;
>+struct xe_device;
>+
>+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl);
>+void xe_hw_error_init(struct xe_device *xe);
>+#endif
>diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>index 5362d3174b06..24ccf3bec52c 100644
>--- a/drivers/gpu/drm/xe/xe_irq.c
>+++ b/drivers/gpu/drm/xe/xe_irq.c
>@@ -18,6 +18,7 @@
> #include "xe_gt.h"
> #include "xe_guc.h"
> #include "xe_hw_engine.h"
>+#include "xe_hw_error.h"
> #include "xe_memirq.h"
> #include "xe_mmio.h"
> #include "xe_pxp.h"
>@@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
> 		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>
> 		gt_irq_handler(tile, master_ctl, intr_dw, identity);
>+		xe_hw_error_irq_handler(tile, master_ctl);
>
> 		/*
> 		 * Display interrupts (including display backlight operations
>@@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
> 	int nvec = 1;
> 	int err;
>
>+	xe_hw_error_init(xe);
>+
> 	xe_irq_reset(xe);
>
> 	if (xe_device_has_msix(xe)) {
>-- 
>2.47.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 19:00                 ` Rodrigo Vivi
@ 2025-07-10 21:46                   ` Raag Jadav
  2025-07-11  5:17                     ` Riana Tauro
  2025-07-11  8:56                   ` Simona Vetter
  1 sibling, 1 reply; 48+ messages in thread
From: Raag Jadav @ 2025-07-10 21:46 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Christian König, Simona Vetter, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
> On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > >>>> On 09.07.25 15:41, Simona Vetter wrote:
> > > >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > >>>>>> Certain errors can cause the device to be wedged and may
> > > >>>>>> require a vendor specific recovery method to restore normal
> > > >>>>>> operation.
> > > >>>>>>
> > > >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > >>>>>> must provide additional recovery documentation if this method
> > > >>>>>> is used.
> > > >>>>>>
> > > >>>>>> v2: fix documentation (Raag)
> > > >>>>>>
> > > >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> > > >>>>>> Cc: Christian König <christian.koenig@amd.com>
> > > >>>>>> Cc: David Airlie <airlied@gmail.com>
> > > >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> > > >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > >>>>>
> > > >>>>> I'm not really understanding what this is useful for, maybe concrete
> > > >>>>> example in the form of driver code that uses this, and some tool or
> > > >>>>> documentation steps that should be taken for recovery?
> > > >>
> > > >> The case here is when FW underneath identified something badly corrupted on
> > > >> FW land and decided that only a firmware-flashing could solve the day and
> > > >> raise interrupt to the driver. At that point we want to wedge, but immediately
> > > >> hint the admin the recommended action.
> > > >>
> > > >>>>
> > > >>>> The recovery method for this particular case is to flash in a new firmware.
> > > >>>>
> > > >>>>> The issues I'm seeing here is that eventually we'll get different
> > > >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> > > >>>>> that leads us to an enumeration issue. Since it's just a string and an
> > > >>>>> enum I think it'd be better to just allocate a new one every time there's
> > > >>>>> a new strange recovery method instead of this opaque approach.
> > > >>>>
> > > >>>> That is exactly the opposite of what we discussed so far.
> > > > 
> > > > Sorry, I missed that context.
> > > > 
> > > >>>> The original idea was to add a firmware-flush recovery method which
> > > >>>> looked a bit wage since it didn't give any information on what to do
> > > >>>> exactly.
> > > >>>>
> > > >>>> That's why I suggested to add a more generic vendor-specific event
> > > >>>> with refers to the documentation and system log to see what actually
> > > >>>> needs to be done.
> > > >>>>
> > > >>>> Otherwise we would end up with events like firmware-flash, update FW
> > > >>>> image A, update FW image B, FW version mismatch etc....
> > > > 
> > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > this all to not be an issue.
> > > > 
> > > >>> Agree. Any newly allocated method that is specific to a vendor is going to
> > > >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> > > >>> reduce the noise in DRM core.
> > > >>>
> > > >>> And yes, there could be different vendor-specific cases for the same driver
> > > >>> and the driver should be able to provide the means to distinguish between
> > > >>> them.
> > > >>
> > > >> Sim, what's your take on this then?
> > > >>
> > > >> Should we get back to the original idea of firmware-flash?
> > > > 
> > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > 
> > > > The reason I think it should be specific is because I'm assuming you want
> > > > to script this. And if you have a big fleet with different vendors, then
> > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > place to put some documentation on what you should do instead.
> > > > 
> > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > sure why it needs to be an uevent?
> > > 
> > > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > > 
> > > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> 
> I also don't like the idea or even the thought of scripting something like
> a firmware-flash. But only to fail with a better pin point to make admin
> lives easier with a notification.
> 
> > > 
> > > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
> 
> Exactly, the hardware and firmware are giving the indication of what should be
> done. It is not 'unknown'.
> 
> > 
> > Yes, and since the recovery procedure is defined and known to the consumer,
> > it can potentially be automated (atleast for non-firmware cases).
> > 
> > > > I guess if you all want to stick with vendor-specific then I think that's
> 
> Well, I would honestly prefer a direct firmware-flash, but if that is not
> usable by other vendors and there's a push back on that, let's go with
> the vendor-specific then.

I think the procedure for firmware-flash is vendor specific, so the wedged event
alone is not sufficient either way. The consumer will need more guidance from
vendor documentation.

With vendor-specific method, the driver has the opportunity to cover as many
cases as it wants without having to create a new method everytime, and face the
same dilemma of being vendor agnostic.

> > > > ok with me too, but the docs should at least explain how to figure out
> > > > from the uevent which vendor you're on with a small example. What I'm
> > > > worried is that if we have this on multiple drivers userspace will
> > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > steps.
> > 
> > The device id along with driver can be identified from uevent (probably
> > available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> > already knows if the device fits the criteria for recovery.
> > 
> > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > then also comes with the documentation for what exactly you're supposed to
> > > > do as something you can script. And not just this stand-alone patch here.
> > 
> > Perhaps the rest of the series didn't make it to dri-devel, which will answer
> > most of the above.
> 
> Riana, could you please try to provide a bit more documentation like Sima
> asked and re-send the entire series to dri-devel?

With the ideas in this thread also documented so that we don't end up repeating
the same discussion.

Raag

> > > >>>>>> ---
> > > >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > > >>>>>>  include/drm/drm_device.h       | 4 ++++
> > > >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> > > >>>>>>
> > > >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> index 263e5a97c080..c33070bdb347 100644
> > > >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> > > >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> @@ -421,10 +421,10 @@ Recovery
> > > >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> > > >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > > >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> > > >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > >>>>>> -will be sent instead.
> > > >>>>>> +more side-effects. If recovery method is specific to vendor
> > > >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> > > >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > >>>>>>  
> > > >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> > > >>>>>>  following expectations.
> > > >>>>>> @@ -435,6 +435,7 @@ following expectations.
> > > >>>>>>      none            optional telemetry collection
> > > >>>>>>      rebind          unbind + bind driver
> > > >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> > > >>>>>> +    vendor-specific vendor specific recovery method
> > > >>>>>>      unknown         consumer policy
> > > >>>>>>      =============== ========================================
> > > >>>>>>  
> > > >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> index cdd591b11488..0ac723a46a91 100644
> > > >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> > > >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > >>>>>>  		return "rebind";
> > > >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > >>>>>>  		return "bus-reset";
> > > >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > >>>>>> +		return "vendor-specific";
> > > >>>>>>  	default:
> > > >>>>>>  		return NULL;
> > > >>>>>>  	}
> > > >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> > > >>>>>> --- a/include/drm/drm_device.h
> > > >>>>>> +++ b/include/drm/drm_device.h
> > > >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> > > >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> > > >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> > > >>>>>> + *
> > > >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > >>>>>> + * details.
> > > >>>>>>   */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > >>>>>>  
> > > >>>>>>  /**
> > > >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > >>>>>> -- 
> > > >>>>>> 2.47.1
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > > 
> > > 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-09 11:20 ` [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
@ 2025-07-11  0:36   ` Umesh Nerlige Ramappa
  2025-07-11  5:46     ` Riana Tauro
  0 siblings, 1 reply; 48+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-07-11  0:36 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

On Wed, Jul 09, 2025 at 04:50:20PM +0530, Riana Tauro wrote:
>Add support to handle CSC firmware reported errors. When CSC firmware
>errors are encoutered, a error interrupt is received by the GFX device as
>a MSI interrupt.
>
>Device Source control registers indicates the source of the error as CSC
>The HEC error status register indicates that the error is firmware reported
>Depending on the type of error, the error cause is written to the HEC
>Firmware error register.
>
>On encountering such CSC firmware errors, the graphics device is
>non-recoverable from driver context. The only way to recover from these
>errors is firmware flash. The device is then wedged and userspace is
>notified with a drm uevent
>
>v2: use vendor recovery method with
>    runtime survivability (Christian, Rodrigo, Raag)
>
>v3: move declare wedged to runtime survivability mode (Rodrigo)
>
>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>---
> drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
> drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
> drivers/gpu/drm/xe/xe_device_types.h       |  3 +
> drivers/gpu/drm/xe/xe_hw_error.c           | 68 +++++++++++++++++++++-
> 4 files changed, 78 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>index 9b66cc972a63..180be82672ab 100644
>--- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>+++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>@@ -13,6 +13,8 @@
>
> /* Definitions of GSC H/W registers, bits, etc */
>
>+#define BMG_GSC_HECI1_BASE	0x373000
>+
> #define MTL_GSC_HECI1_BASE	0x00116000
> #define MTL_GSC_HECI2_BASE	0x00117000
>
>diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>index ed9b81fb28a0..c146b9ef44eb 100644
>--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>@@ -6,10 +6,15 @@
> #ifndef _XE_HW_ERROR_REGS_H_
> #define _XE_HW_ERROR_REGS_H_
>
>+#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
>+#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>+
>+#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
>+
> #define DEV_ERR_STAT_NONFATAL			0x100178
> #define DEV_ERR_STAT_CORRECTABLE		0x10017c
> #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
> 								  DEV_ERR_STAT_CORRECTABLE, \
> 								  DEV_ERR_STAT_NONFATAL))
>-
>+#define   XE_CSC_ERROR				BIT(17)
> #endif
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index ca300338e8c2..283d5c88758e 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -241,6 +241,9 @@ struct xe_tile {
> 	/** @memirq: Memory Based Interrupts. */
> 	struct xe_memirq memirq;
>
>+	/** @csc_hw_error_work: worker to report CSC HW errors */
>+	struct work_struct csc_hw_error_work;
>+
> 	/** @pcode: tile's PCODE */
> 	struct {
> 		/** @pcode.lock: protecting tile's PCODE mailbox data */
>diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>index 0f2590839900..7cc9b8a7fa1a 100644
>--- a/drivers/gpu/drm/xe/xe_hw_error.c
>+++ b/drivers/gpu/drm/xe/xe_hw_error.c
>@@ -3,12 +3,16 @@
>  * Copyright © 2025 Intel Corporation
>  */
>
>+#include "regs/xe_gsc_regs.h"
> #include "regs/xe_hw_error_regs.h"
> #include "regs/xe_irq_regs.h"
>
> #include "xe_device.h"
> #include "xe_hw_error.h"
> #include "xe_mmio.h"
>+#include "xe_survivability_mode.h"
>+
>+#define  HEC_UNCORR_FW_ERR_BITS 4
>
> /* Error categories reported by hardware */
> enum hardware_error {
>@@ -18,6 +22,13 @@ enum hardware_error {
> 	HARDWARE_ERROR_MAX,
> };
>
>+static const char * const hec_uncorrected_fw_errors[] = {
>+	"Fatal",
>+	"CSE Disabled",
>+	"FD Corruption",
>+	"Data Corruption"
>+};
>+
> static const char *hw_error_to_str(const enum hardware_error hw_err)
> {
> 	switch (hw_err) {
>@@ -32,6 +43,56 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
> 	}
> }
>
>+static void csc_hw_error_work(struct work_struct *work)
>+{
>+	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
>+	struct xe_device *xe = tile_to_xe(tile);
>+	int ret;
>+
>+	ret = xe_survivability_mode_runtime_enable(xe);

xe_survivability_mode_runtime_enable() returns if it's not BMG, not dgfx 
etc., so does it make sense to not even queue the work if those 
conditions are not met?

>+	if (ret)
>+		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");
>+}
>+
>+static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>+{
>+	const char *hw_err_str = hw_error_to_str(hw_err);
>+	struct xe_device *xe = tile_to_xe(tile);
>+	struct xe_mmio *mmio = &tile->mmio;
>+	u32 base, err_bit, err_src;
>+	unsigned long fw_err;
>+
>+	if (xe->info.platform != XE_BATTLEMAGE)
>+		return;
>+
>+	/* Not supported in BMG */
>+	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>+		return;
>+
>+	base = BMG_GSC_HECI1_BASE;
>+	lockdep_assert_held(&xe->irq.lock);
>+	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>+	if (!err_src) {
>+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
>+				    tile->id, hw_err_str);
>+		return;
>+	}
>+
>+	if (err_src & UNCORR_FW_REPORTED_ERR) {
>+		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>+		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>+			drm_err_ratelimited(&xe->drm, HW_ERR
>+					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
>+					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
>+					     err_bit);
>+
>+			schedule_work(&tile->csc_hw_error_work);
>+		}
>+	}
>+
>+	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>+}
>+
> static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> {
> 	const char *hw_err_str = hw_error_to_str(hw_err);
>@@ -50,7 +111,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
> 		goto unlock;
> 	}
>
>-	/* TODO: Process errrors per source */
>+	if (err_src & XE_CSC_ERROR)
>+		csc_hw_error_handler(tile, hw_err);
>
> 	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>
>@@ -101,8 +163,12 @@ static void process_hw_errors(struct xe_device *xe)
>  */
> void xe_hw_error_init(struct xe_device *xe)
> {
>+	struct xe_tile *tile = xe_device_get_root_tile(xe);
>+
> 	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
> 		return;
>
>+	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);

Same here, why have a worker if it's not BMG?

Also, reiterating a previous comment in another patch - if the feature 
can be defined as a has_ struct member in the pci/gt info that could 
streamline the checks.

Thanks,
Umesh

>+
> 	process_hw_errors(xe);
> }
>-- 
>2.47.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 21:46                   ` Raag Jadav
@ 2025-07-11  5:17                     ` Riana Tauro
  2025-07-11  6:08                       ` Raag Jadav
  0 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-11  5:17 UTC (permalink / raw)
  To: Raag Jadav, Rodrigo Vivi
  Cc: Christian König, Simona Vetter, intel-xe, anshuman.gupta,
	lucas.demarchi, aravind.iddamsetty, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida, David Airlie,
	dri-devel



On 7/11/2025 3:16 AM, Raag Jadav wrote:
> On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
>> On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
>>> On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
>>>> On 10.07.25 11:01, Simona Vetter wrote:
>>>>> On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
>>>>>> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
>>>>>>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
>>>>>>>> On 09.07.25 15:41, Simona Vetter wrote:
>>>>>>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>>>>>>>>>> Certain errors can cause the device to be wedged and may
>>>>>>>>>> require a vendor specific recovery method to restore normal
>>>>>>>>>> operation.
>>>>>>>>>>
>>>>>>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>>>>>>>>>> must provide additional recovery documentation if this method
>>>>>>>>>> is used.
>>>>>>>>>>
>>>>>>>>>> v2: fix documentation (Raag)
>>>>>>>>>>
>>>>>>>>>> Cc: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>> Cc: <dri-devel@lists.freedesktop.org>
>>>>>>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>>>>>>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>>>>>>>
>>>>>>>>> I'm not really understanding what this is useful for, maybe concrete
>>>>>>>>> example in the form of driver code that uses this, and some tool or
>>>>>>>>> documentation steps that should be taken for recovery?
>>>>>>
>>>>>> The case here is when FW underneath identified something badly corrupted on
>>>>>> FW land and decided that only a firmware-flashing could solve the day and
>>>>>> raise interrupt to the driver. At that point we want to wedge, but immediately
>>>>>> hint the admin the recommended action.
>>>>>>
>>>>>>>>
>>>>>>>> The recovery method for this particular case is to flash in a new firmware.
>>>>>>>>
>>>>>>>>> The issues I'm seeing here is that eventually we'll get different
>>>>>>>>> vendor-specific recovery steps, and maybe even on the same device, and
>>>>>>>>> that leads us to an enumeration issue. Since it's just a string and an
>>>>>>>>> enum I think it'd be better to just allocate a new one every time there's
>>>>>>>>> a new strange recovery method instead of this opaque approach.
>>>>>>>>
>>>>>>>> That is exactly the opposite of what we discussed so far.
>>>>>
>>>>> Sorry, I missed that context.
>>>>>
>>>>>>>> The original idea was to add a firmware-flush recovery method which
>>>>>>>> looked a bit wage since it didn't give any information on what to do
>>>>>>>> exactly.
>>>>>>>>
>>>>>>>> That's why I suggested to add a more generic vendor-specific event
>>>>>>>> with refers to the documentation and system log to see what actually
>>>>>>>> needs to be done.
>>>>>>>>
>>>>>>>> Otherwise we would end up with events like firmware-flash, update FW
>>>>>>>> image A, update FW image B, FW version mismatch etc....
>>>>>
>>>>> Yeah, that's kinda what I expect to happen, and we have enough numbers for
>>>>> this all to not be an issue.
>>>>>
>>>>>>> Agree. Any newly allocated method that is specific to a vendor is going to
>>>>>>> be opaque anyway, since it can't be generic for all drivers. This just helps
>>>>>>> reduce the noise in DRM core.
>>>>>>>
>>>>>>> And yes, there could be different vendor-specific cases for the same driver
>>>>>>> and the driver should be able to provide the means to distinguish between
>>>>>>> them.
>>>>>>
>>>>>> Sim, what's your take on this then?
>>>>>>
>>>>>> Should we get back to the original idea of firmware-flash?
>>>>>
>>>>> Maybe intel-firmware-flash or something, meaning prefix with the vendor?
>>>>>
>>>>> The reason I think it should be specific is because I'm assuming you want
>>>>> to script this. And if you have a big fleet with different vendors, then
>>>>> "vendor-specific" doesn't tell you enough. But if it's something like
>>>>> $vendor-$magic_step then it does become scriptable, and we do have have a
>>>>> place to put some documentation on what you should do instead.
>>>>>
>>>>> If the point of this interface isn't that it's scriptable, then I'm not
>>>>> sure why it needs to be an uevent?
>>>>
>>>> You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
>>>>
>>>> And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
>>
>> I also don't like the idea or even the thought of scripting something like
>> a firmware-flash. But only to fail with a better pin point to make admin
>> lives easier with a notification.
>>
>>>>
>>>> In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
>>
>> Exactly, the hardware and firmware are giving the indication of what should be
>> done. It is not 'unknown'.
>>
>>>
>>> Yes, and since the recovery procedure is defined and known to the consumer,
>>> it can potentially be automated (atleast for non-firmware cases).
>>>
>>>>> I guess if you all want to stick with vendor-specific then I think that's
>>
>> Well, I would honestly prefer a direct firmware-flash, but if that is not
>> usable by other vendors and there's a push back on that, let's go with
>> the vendor-specific then.
> 
> I think the procedure for firmware-flash is vendor specific, so the wedged event
> alone is not sufficient either way. The consumer will need more guidance from
> vendor documentation.

Procedure of firmware-flash is vendor specific, but the term 
'firmware-flash' is still generic. The patch doesn't mention any vendor 
specific firmware or procedure. The push back was for the number of 
macros that can be added for other operations.


> 
> With vendor-specific method, the driver has the opportunity to cover as many
> cases as it wants without having to create a new method everytime, and face the
> same dilemma of being vendor agnostic.
> 
>>>>> ok with me too, but the docs should at least explain how to figure out
>>>>> from the uevent which vendor you're on with a small example. What I'm
>>>>> worried is that if we have this on multiple drivers userspace will
>>>>> otherwise make a complete mess and might want to run the wrong recovery
>>>>> steps.
>>>
>>> The device id along with driver can be identified from uevent (probably
>>> available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
>>> already knows if the device fits the criteria for recovery.
>>>
>>>>> I think ideally, no matter what, we'd have a concrete driver patch which
>>>>> then also comes with the documentation for what exactly you're supposed to
>>>>> do as something you can script. And not just this stand-alone patch here.
>>>
>>> Perhaps the rest of the series didn't make it to dri-devel, which will answer
>>> most of the above.
>>
>> Riana, could you please try to provide a bit more documentation like Sima
>> asked and re-send the entire series to dri-devel?

Sure will send the entire series to dri-devel. The documentation is 
present in the series.

> 
> With the ideas in this thread also documented so that we don't end up repeating
> the same discussion.
It is mentioned in cover letter but i didn't send it to dri-devel. will 
add more details

Thanks
Riana

> 
> Raag
> 
>>>>>>>>>> ---
>>>>>>>>>>   Documentation/gpu/drm-uapi.rst | 9 +++++----
>>>>>>>>>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>>>>>>>>>   include/drm/drm_device.h       | 4 ++++
>>>>>>>>>>   3 files changed, 11 insertions(+), 4 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>>>>>>>> index 263e5a97c080..c33070bdb347 100644
>>>>>>>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>>>>>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>>>>>>>> @@ -421,10 +421,10 @@ Recovery
>>>>>>>>>>   Current implementation defines three recovery methods, out of which, drivers
>>>>>>>>>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>>>>>>>>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>>>>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
>>>>>>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>>>>>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>>>>>>>>> -will be sent instead.
>>>>>>>>>> +more side-effects. If recovery method is specific to vendor
>>>>>>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>>>>>>>>>> +specific documentation for further recovery steps. If driver is unsure about
>>>>>>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>>>>>>>>>   
>>>>>>>>>>   Userspace consumers can parse this event and attempt recovery as per the
>>>>>>>>>>   following expectations.
>>>>>>>>>> @@ -435,6 +435,7 @@ following expectations.
>>>>>>>>>>       none            optional telemetry collection
>>>>>>>>>>       rebind          unbind + bind driver
>>>>>>>>>>       bus-reset       unbind + bus reset/re-enumeration + bind
>>>>>>>>>> +    vendor-specific vendor specific recovery method
>>>>>>>>>>       unknown         consumer policy
>>>>>>>>>>       =============== ========================================
>>>>>>>>>>   
>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>>>>>>>> index cdd591b11488..0ac723a46a91 100644
>>>>>>>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>>>>>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>>>>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>>>>>>>>   		return "rebind";
>>>>>>>>>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>>>>>>>>   		return "bus-reset";
>>>>>>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>>>>>>>>>> +		return "vendor-specific";
>>>>>>>>>>   	default:
>>>>>>>>>>   		return NULL;
>>>>>>>>>>   	}
>>>>>>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>>>>>>>>> index 08b3b2467c4c..08a087f149ff 100644
>>>>>>>>>> --- a/include/drm/drm_device.h
>>>>>>>>>> +++ b/include/drm/drm_device.h
>>>>>>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
>>>>>>>>>>    * Recovery methods for wedged device in order of less to more side-effects.
>>>>>>>>>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>>>>>>>>>    * use any one, multiple (or'd) or none depending on their needs.
>>>>>>>>>> + *
>>>>>>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>>>>>>>>>> + * details.
>>>>>>>>>>    */
>>>>>>>>>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>>>>>>>>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>>>>>>>>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>>>>>>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>>>>>>>>>   
>>>>>>>>>>   /**
>>>>>>>>>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>>>>>>>>> -- 
>>>>>>>>>> 2.47.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-10 17:12       ` Umesh Nerlige Ramappa
@ 2025-07-11  5:23         ` Riana Tauro
  0 siblings, 0 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-11  5:23 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

Hi Umesh

On 7/10/2025 10:42 PM, Umesh Nerlige Ramappa wrote:
> On Thu, Jul 10, 2025 at 11:29:44AM +0530, Riana Tauro wrote:
>> Hi Umesh
>>
>> On 7/10/2025 5:14 AM, Umesh Nerlige Ramappa wrote:
>>> On Wed, Jul 09, 2025 at 04:50:17PM +0530, Riana Tauro wrote:
>>>> Certain runtime firmware errors can cause the device to be in a 
>>>> unusable
>>>> state requiring a firmware flash to restore normal operation.
>>>> Runtime Survivability Mode indicates firmware flash is necessary by
>>>> wedging the device and exposing survivability mode sysfs.
>>>>
>>>> The below sysfs is an indication that device is in survivability mode
>>>>
>>>> /sys/bus/pci/devices/<device>/survivability_mode
>>>>
>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>> ---
>>>> drivers/gpu/drm/xe/xe_survivability_mode.c    | 42 ++++++++++++++++++-
>>>> drivers/gpu/drm/xe/xe_survivability_mode.h    |  1 +
>>>> .../gpu/drm/xe/xe_survivability_mode_types.h  |  1 +
>>>> 3 files changed, 43 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/ 
>>>> gpu/ drm/xe/xe_survivability_mode.c
>>>> index fefb027b1c84..ca1cfa13525a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>>>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>>>> @@ -137,7 +137,8 @@ static ssize_t survivability_mode_show(struct 
>>>> device *dev,
>>>>     struct xe_survivability_info *info = survivability->info;
>>>>     int index = 0, count = 0;
>>>>
>>>> -    count += sysfs_emit_at(buff, count, "Survivability mode type: 
>>>> Boot\n");
>>>> +    count += sysfs_emit_at(buff, count, "Survivability mode type: 
>>>> %s\n",
>>>> +                   survivability->type ? "Runtime" : "Boot");
>>>>
>>>>     if (!check_boot_failure(xe))
>>>>         return count;
>>>> @@ -288,6 +289,45 @@ bool xe_survivability_mode_is_requested(struct 
>>>> xe_device *xe)
>>>>     return check_boot_failure(xe);
>>>> }
>>>>
>>>> +/**
>>>> + * xe_survivability_mode_runtime_enable - Initialize and enable 
>>>> runtime survivability mode
>>>> + * @xe: xe device instance
>>>> + *
>>>> + * Initialize survivability information and enable runtime 
>>>> survivability mode.
>>>> + * Runtime survivability mode is enabled when certain errors cause 
>>>> the device to be
>>>> + * in non-recoverable state. The device is declared wedged with the 
>>>> appropriate
>>>> + * recovery method and survivability mode sysfs exposed to userspace
>>>> + *
>>>> + * Return: 0 if runtime survivability mode is enabled or not 
>>>> requested, negative error
>>>
>>> is the "not requested" still applicable here?
>>
>> Copied it from boot survivability. Not applicable, will remove this
>>
>>>
>>>
>>>> + * code otherwise.
>>>> + */
>>>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>>>> +{
>>>> +    struct xe_survivability *survivability = &xe->survivability;
>>>> +    struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>>> +    int ret;
>>>> +
>>>> +    if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform < 
>>>> XE_BATTLEMAGE) {
>>>
>>> Do you think this condition can be better handled with a 
>>> has_runtime_survivability for platforms that support it?
>>
>> Was used once so added it here. Can be split out to a different function
> 
> Oh, not a different function. I mean a has_* property. More like entries 
> defined in xe_pci_types.h under struct xe_graphics_desc.

That might be unnecessary for now, since the function says not 
applicable prior to bmg.
If in future, this is not consistent then we could add per pci_desc

Thanks
Riana

> 
> Regards,
> Umesh
> 
>>>
>>>> +        dev_err(&pdev->dev, "Runtime Survivability Mode not 
>>>> supported\n");
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    ret = init_survivability_mode(xe);
>>>> +    if (ret)
>>>> +        return ret;
>>>> +
>>>> +    ret = create_survivability_sysfs(pdev);
>>>> +    if (ret)
>>>> +        dev_err(&pdev->dev, "Failed to create survivability mode 
>>>> sysfs\n");
>>>
>>> You do not return ret in the above if condition. Is that intenational?
>>
>> yeah this is intentional. The device has to be wedged since it is not 
>> usable on such errors even without the sysfs.
>>
>> Thanks
>> Riana
>>
>>>
>>> Regards,
>>> Umesh
>>>
>>>> +
>>>> +    survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME;
>>>> +    xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
>>>> +    xe_device_declare_wedged(xe);
>>>> +
>>>> +    dev_err(&pdev->dev, "Runtime Survivability mode enabled\n");
>>>> +    return 0;
>>>> +}
>>>> +
>>>> /**
>>>>  * xe_survivability_mode_boot_enable - Initialize and enable boot 
>>>> survivability mode
>>>>  * @xe: xe device instance
>>>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/ 
>>>> gpu/ drm/xe/xe_survivability_mode.h
>>>> index f6ee283ea5e8..1cc94226aa82 100644
>>>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>>>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>>>> @@ -11,6 +11,7 @@
>>>> struct xe_device;
>>>>
>>>> int xe_survivability_mode_boot_enable(struct xe_device *xe);
>>>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe);
>>>> bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
>>>> bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/ 
>>>> drivers/gpu/drm/xe/xe_survivability_mode_types.h
>>>> index 5dce393498da..cd65a5d167c9 100644
>>>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>>>> @@ -11,6 +11,7 @@
>>>>
>>>> enum xe_survivability_type {
>>>>     XE_SURVIVABILITY_TYPE_BOOT,
>>>> +    XE_SURVIVABILITY_TYPE_RUNTIME,
>>>> };
>>>>
>>>> struct xe_survivability_info {
>>>> -- 
>>>> 2.47.1
>>>>
>>
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 7/9] drm/xe: Add support to handle hardware errors
  2025-07-10 21:09   ` Umesh Nerlige Ramappa
@ 2025-07-11  5:35     ` Riana Tauro
  2025-07-11 17:34       ` Umesh Nerlige Ramappa
  0 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-11  5:35 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban,
	Himal Prasad Ghimiray

Hi Umesh

On 7/11/2025 2:39 AM, Umesh Nerlige Ramappa wrote:
> Resending since it got lost earlier...
> 
> On Wed, Jul 09, 2025 at 04:50:19PM +0530, Riana Tauro wrote:
>> Gfx device reports two classes of errors: uncorrectable and
>> correctable. Depending on the severity uncorrectable errors are
>> further classified as non fatal and fatal
>>
>> Correctable and non-fatal errors are reported as MSI's and bits in
>> the Master Interrupt Register indicate the class of the error.
>> The source of the error is then read from the Device Error Source
>> Register.
> 
> nit: Since Fatal is a separate category, maybe a split here into a 
> separate paragraph and some formatting would be good.
> 
>> Fatal errors are reported as PCIe errors
>> When a PCIe error is asserted, the OS will perform a device warm reset
>> which causes the driver to reload. The error registers are sticky
>> and the values are maintained through a warm reset
>>
>> Add basic support to handle these errors
>>
>> Bspec: 50875, 53073, 53074, 53075, 53076
>>
>> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/Makefile                |   1 +
>> drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
>> drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
>> drivers/gpu/drm/xe/xe_hw_error.c           | 108 +++++++++++++++++++++
>> drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
>> drivers/gpu/drm/xe/xe_irq.c                |   4 +
>> 6 files changed, 144 insertions(+)
>> create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
>> create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> index 1d97e5b63f4e..fea8ee3b0785 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -73,6 +73,7 @@ xe-y += xe_bb.o \
>>     xe_hw_engine.o \
>>     xe_hw_engine_class_sysfs.o \
>>     xe_hw_engine_group.o \
>> +    xe_hw_error.o \
>>     xe_hw_fence.o \
>>     xe_irq.o \
>>     xe_lrc.o \
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/ 
>> drm/xe/regs/xe_hw_error_regs.h
>> new file mode 100644
>> index 000000000000..ed9b81fb28a0
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -0,0 +1,15 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_HW_ERROR_REGS_H_
>> +#define _XE_HW_ERROR_REGS_H_
>> +
>> +#define DEV_ERR_STAT_NONFATAL            0x100178
>> +#define DEV_ERR_STAT_CORRECTABLE        0x10017c
>> +#define DEV_ERR_STAT_REG(x)            XE_REG(_PICK_EVEN((x), \
>> +                                  DEV_ERR_STAT_CORRECTABLE, \
>> +                                  DEV_ERR_STAT_NONFATAL))
> 
 > For x = 1 and x = 2, I don't see the above result in correct values. 
Can > you please double check?

I had got confused with the same when i took the patch from the other 
series. But the second part of the macro becomes negative and the 
registers are correct.

Calculations for 1 and 2

#define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a)))

_PICK_EVEN([HARDWARE_ERROR_NONFATAL = 1]) = DEV_ERR_STAT_CORRECTABLE + 1 
* (DEV_ERR_STAT_NONFATAL - DEV_ERR_STAT_CORRECTABLE)
					    0x10017c + 1 * (0x100178 - 0x10017c)
    					    0x100178

_PICK_EVEN([HARDWARE_ERROR_FATAL = 2]) = DEV_ERR_STAT_CORRECTABLE + 1 * 
(DEV_ERR_STAT_NONFATAL - DEV_ERR_STAT_CORRECTABLE)
					    0x10017c + 2 * (0x100178 - 0x10017c)
    					    0x100174

Thanks
Riana
		

> 
> What about DEV_ERR_STAT_FATAL?
> 
> Rest looks good,
> 
> Umesh
> 
>> +
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h b/drivers/gpu/drm/ 
>> xe/regs/xe_irq_regs.h
>> index f0ecfcac4003..2758b64cec9e 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>> @@ -18,6 +18,7 @@
>> #define GFX_MSTR_IRQ                XE_REG(0x190010, XE_REG_OPTION_VF)
>> #define   MASTER_IRQ                REG_BIT(31)
>> #define   GU_MISC_IRQ                REG_BIT(29)
>> +#define   ERROR_IRQ(x)                REG_BIT(26 + (x))
>> #define   DISPLAY_IRQ                REG_BIT(16)
>> #define   GT_DW_IRQ(x)                REG_BIT(x)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/ 
>> xe_hw_error.c
>> new file mode 100644
>> index 000000000000..0f2590839900
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -0,0 +1,108 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include "regs/xe_hw_error_regs.h"
>> +#include "regs/xe_irq_regs.h"
>> +
>> +#include "xe_device.h"
>> +#include "xe_hw_error.h"
>> +#include "xe_mmio.h"
>> +
>> +/* Error categories reported by hardware */
>> +enum hardware_error {
>> +    HARDWARE_ERROR_CORRECTABLE = 0,
>> +    HARDWARE_ERROR_NONFATAL = 1,
>> +    HARDWARE_ERROR_FATAL = 2,
>> +    HARDWARE_ERROR_MAX,
>> +};
>> +
>> +static const char *hw_error_to_str(const enum hardware_error hw_err)
>> +{
>> +    switch (hw_err) {
>> +    case HARDWARE_ERROR_CORRECTABLE:
>> +        return "CORRECTABLE";
>> +    case HARDWARE_ERROR_NONFATAL:
>> +        return "NONFATAL";
>> +    case HARDWARE_ERROR_FATAL:
>> +        return "FATAL";
>> +    default:
>> +        return "UNKNOWN";
>> +    }
>> +}
>> +
>> +static void hw_error_source_handler(struct xe_tile *tile, const enum 
>> hardware_error hw_err)
>> +{
>> +    const char *hw_err_str = hw_error_to_str(hw_err);
>> +    struct xe_device *xe = tile_to_xe(tile);
>> +    unsigned long flags;
>> +    u32 err_src;
>> +
>> +    if (xe->info.platform != XE_BATTLEMAGE)
>> +        return;
>> +
>> +    spin_lock_irqsave(&xe->irq.lock, flags);
>> +    err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
>> +    if (!err_src) {
>> +        drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported 
>> DEV_ERR_STAT_%s blank!\n",
>> +                    tile->id, hw_err_str);
>> +        goto unlock;
>> +    }
>> +
>> +    /* TODO: Process errrors per source */
>> +
>> +    xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>> +
>> +unlock:
>> +    spin_unlock_irqrestore(&xe->irq.lock, flags);
>> +}
>> +
>> +/**
>> + * xe_hw_error_irq_handler - irq handling for hw errors
>> + * @tile: tile instance
>> + * @master_ctl: value read from master interrupt register
>> + *
>> + * Xe platforms add three error bits to the master interrupt register 
>> to support error handling.
>> + * These three bits are used to convey the class of error FATAL, 
>> NONFATAL, or CORRECTABLE.
>> + * To process the interrupt, determine the source of error by reading 
>> the Device Error Source
>> + * Register that corresponds to the class of error being serviced.
>> + */
>> +void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>> +{
>> +    enum hardware_error hw_err;
>> +
>> +    for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>> +        if (master_ctl & ERROR_IRQ(hw_err))
>> +            hw_error_source_handler(tile, hw_err);
>> +}
>> +
>> +/*
>> + * Process hardware errors during boot
>> + */
>> +static void process_hw_errors(struct xe_device *xe)
>> +{
>> +    struct xe_tile *tile;
>> +    u32 master_ctl;
>> +    u8 id;
>> +
>> +    for_each_tile(tile, xe, id) {
>> +        master_ctl = xe_mmio_read32(&tile->mmio, GFX_MSTR_IRQ);
>> +        xe_hw_error_irq_handler(tile, master_ctl);
>> +        xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ, master_ctl);
>> +    }
>> +}
>> +
>> +/**
>> + * xe_hw_error_init - Initialize hw errors
>> + * @xe: xe device instance
>> + *
>> + * Initialize and process hw errors
>> + */
>> +void xe_hw_error_init(struct xe_device *xe)
>> +{
>> +    if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>> +        return;
>> +
>> +    process_hw_errors(xe);
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.h b/drivers/gpu/drm/xe/ 
>> xe_hw_error.h
>> new file mode 100644
>> index 000000000000..d86e28c5180c
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.h
>> @@ -0,0 +1,15 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +#ifndef XE_HW_ERROR_H_
>> +#define XE_HW_ERROR_H_
>> +
>> +#include <linux/types.h>
>> +
>> +struct xe_tile;
>> +struct xe_device;
>> +
>> +void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 
>> master_ctl);
>> +void xe_hw_error_init(struct xe_device *xe);
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>> index 5362d3174b06..24ccf3bec52c 100644
>> --- a/drivers/gpu/drm/xe/xe_irq.c
>> +++ b/drivers/gpu/drm/xe/xe_irq.c
>> @@ -18,6 +18,7 @@
>> #include "xe_gt.h"
>> #include "xe_guc.h"
>> #include "xe_hw_engine.h"
>> +#include "xe_hw_error.h"
>> #include "xe_memirq.h"
>> #include "xe_mmio.h"
>> #include "xe_pxp.h"
>> @@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void 
>> *arg)
>>         xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>>
>>         gt_irq_handler(tile, master_ctl, intr_dw, identity);
>> +        xe_hw_error_irq_handler(tile, master_ctl);
>>
>>         /*
>>          * Display interrupts (including display backlight operations
>> @@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
>>     int nvec = 1;
>>     int err;
>>
>> +    xe_hw_error_init(xe);
>> +
>>     xe_irq_reset(xe);
>>
>>     if (xe_device_has_msix(xe)) {
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-09 11:20 ` [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
@ 2025-07-11  5:39   ` Raag Jadav
  2025-07-11  6:09     ` Riana Tauro
  0 siblings, 1 reply; 48+ messages in thread
From: Raag Jadav @ 2025-07-11  5:39 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban

On Wed, Jul 09, 2025 at 04:50:18PM +0530, Riana Tauro wrote:
> Add documentation for vendor specific device wedged recovery method
> and runtime survivability.

...

> + * Runtime Survivability
> + * =====================
> + *
> + * Certain runtime firmware errors can cause the device to enter a wedged state
> + * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
> + * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
> + * is indicated by the presence of survivability mode sysfs::
> + *
> + *	/sys/bus/pci/devices/<device>/survivability_mode
> + *
> + * Survivability mode sysfs provides information about the type of survivability mode.
> + *
> + * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
> + * survivability mode. User can then initiate a firmware flash to restore device to normal
> + * operation.

Do we have definition on actual procedure? Can we add a reference to it?
Otherwise it's telling me to do something I have no idea about.

Raag

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-11  0:36   ` Umesh Nerlige Ramappa
@ 2025-07-11  5:46     ` Riana Tauro
  2025-07-11 17:38       ` Umesh Nerlige Ramappa
  0 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-11  5:46 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

Hi Umesh

On 7/11/2025 6:06 AM, Umesh Nerlige Ramappa wrote:
> On Wed, Jul 09, 2025 at 04:50:20PM +0530, Riana Tauro wrote:
>> Add support to handle CSC firmware reported errors. When CSC firmware
>> errors are encoutered, a error interrupt is received by the GFX device as
>> a MSI interrupt.
>>
>> Device Source control registers indicates the source of the error as CSC
>> The HEC error status register indicates that the error is firmware 
>> reported
>> Depending on the type of error, the error cause is written to the HEC
>> Firmware error register.
>>
>> On encountering such CSC firmware errors, the graphics device is
>> non-recoverable from driver context. The only way to recover from these
>> errors is firmware flash. The device is then wedged and userspace is
>> notified with a drm uevent
>>
>> v2: use vendor recovery method with
>>    runtime survivability (Christian, Rodrigo, Raag)
>>
>> v3: move declare wedged to runtime survivability mode (Rodrigo)
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>> drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>> drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>> drivers/gpu/drm/xe/xe_hw_error.c           | 68 +++++++++++++++++++++-
>> 4 files changed, 78 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/ 
>> xe/regs/xe_gsc_regs.h
>> index 9b66cc972a63..180be82672ab 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> @@ -13,6 +13,8 @@
>>
>> /* Definitions of GSC H/W registers, bits, etc */
>>
>> +#define BMG_GSC_HECI1_BASE    0x373000
>> +
>> #define MTL_GSC_HECI1_BASE    0x00116000
>> #define MTL_GSC_HECI2_BASE    0x00117000
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/ 
>> drm/xe/regs/xe_hw_error_regs.h
>> index ed9b81fb28a0..c146b9ef44eb 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -6,10 +6,15 @@
>> #ifndef _XE_HW_ERROR_REGS_H_
>> #define _XE_HW_ERROR_REGS_H_
>>
>> +#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) 
>> + 0x118)
>> +#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>> +
>> +#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) 
>> + 0x124)
>> +
>> #define DEV_ERR_STAT_NONFATAL            0x100178
>> #define DEV_ERR_STAT_CORRECTABLE        0x10017c
>> #define DEV_ERR_STAT_REG(x)            XE_REG(_PICK_EVEN((x), \
>>                                   DEV_ERR_STAT_CORRECTABLE, \
>>                                   DEV_ERR_STAT_NONFATAL))
>> -
>> +#define   XE_CSC_ERROR                BIT(17)
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/ 
>> xe/xe_device_types.h
>> index ca300338e8c2..283d5c88758e 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -241,6 +241,9 @@ struct xe_tile {
>>     /** @memirq: Memory Based Interrupts. */
>>     struct xe_memirq memirq;
>>
>> +    /** @csc_hw_error_work: worker to report CSC HW errors */
>> +    struct work_struct csc_hw_error_work;
>> +
>>     /** @pcode: tile's PCODE */
>>     struct {
>>         /** @pcode.lock: protecting tile's PCODE mailbox data */
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/ 
>> xe_hw_error.c
>> index 0f2590839900..7cc9b8a7fa1a 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,12 +3,16 @@
>>  * Copyright © 2025 Intel Corporation
>>  */
>>
>> +#include "regs/xe_gsc_regs.h"
>> #include "regs/xe_hw_error_regs.h"
>> #include "regs/xe_irq_regs.h"
>>
>> #include "xe_device.h"
>> #include "xe_hw_error.h"
>> #include "xe_mmio.h"
>> +#include "xe_survivability_mode.h"
>> +
>> +#define  HEC_UNCORR_FW_ERR_BITS 4
>>
>> /* Error categories reported by hardware */
>> enum hardware_error {
>> @@ -18,6 +22,13 @@ enum hardware_error {
>>     HARDWARE_ERROR_MAX,
>> };
>>
>> +static const char * const hec_uncorrected_fw_errors[] = {
>> +    "Fatal",
>> +    "CSE Disabled",
>> +    "FD Corruption",
>> +    "Data Corruption"
>> +};
>> +
>> static const char *hw_error_to_str(const enum hardware_error hw_err)
>> {
>>     switch (hw_err) {
>> @@ -32,6 +43,56 @@ static const char *hw_error_to_str(const enum 
>> hardware_error hw_err)
>>     }
>> }
>>
>> +static void csc_hw_error_work(struct work_struct *work)
>> +{
>> +    struct xe_tile *tile = container_of(work, typeof(*tile), 
>> csc_hw_error_work);
>> +    struct xe_device *xe = tile_to_xe(tile);
>> +    int ret;
>> +
>> +    ret = xe_survivability_mode_runtime_enable(xe);
> 
> xe_survivability_mode_runtime_enable() returns if it's not BMG, not dgfx 
> etc., so does it make sense to not even queue the work if those 
> conditions are not met?

CSC work is only scheduled for BMG in the below handler.
The bit is not present in prior platforms
> 
>> +    if (ret)
>> +        drm_err(&xe->drm, "Failed to enable runtime survivability 
>> mode\n");
>> +}
>> +
>> +static void csc_hw_error_handler(struct xe_tile *tile, const enum 
>> hardware_error hw_err)
>> +{
>> +    const char *hw_err_str = hw_error_to_str(hw_err);
>> +    struct xe_device *xe = tile_to_xe(tile);
>> +    struct xe_mmio *mmio = &tile->mmio;
>> +    u32 base, err_bit, err_src;
>> +    unsigned long fw_err;
>> +
>> +    if (xe->info.platform != XE_BATTLEMAGE)
>> +        return;
>> +
>> +    /* Not supported in BMG */
>> +    if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>> +        return;
>> +
>> +    base = BMG_GSC_HECI1_BASE;
>> +    lockdep_assert_held(&xe->irq.lock);
>> +    err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>> +    if (!err_src) {
>> +        drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported 
>> HEC_ERR_STATUS_%s blank\n",
>> +                    tile->id, hw_err_str);
>> +        return;
>> +    }
>> +
>> +    if (err_src & UNCORR_FW_REPORTED_ERR) {
>> +        fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>> +        for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>> +            drm_err_ratelimited(&xe->drm, HW_ERR
>> +                        "%s: HEC Uncorrected FW %s error reported, 
>> bit[%d] is set\n",
>> +                         hw_err_str, hec_uncorrected_fw_errors[err_bit],
>> +                         err_bit);
>> +
>> +            schedule_work(&tile->csc_hw_error_work);
>> +        }
>> +    }
>> +
>> +    xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>> +}
>> +
>> static void hw_error_source_handler(struct xe_tile *tile, const enum 
>> hardware_error hw_err)
>> {
>>     const char *hw_err_str = hw_error_to_str(hw_err);
>> @@ -50,7 +111,8 @@ static void hw_error_source_handler(struct xe_tile 
>> *tile, const enum hardware_er
>>         goto unlock;
>>     }
>>
>> -    /* TODO: Process errrors per source */
>> +    if (err_src & XE_CSC_ERROR)
>> +        csc_hw_error_handler(tile, hw_err);
>>
>>     xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>>
>> @@ -101,8 +163,12 @@ static void process_hw_errors(struct xe_device *xe)
>>  */
>> void xe_hw_error_init(struct xe_device *xe)
>> {
>> +    struct xe_tile *tile = xe_device_get_root_tile(xe);
>> +
>>     if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>         return;
>>
>> +    INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
> 
> Same here, why have a worker if it's not BMG?
> 
> Also, reiterating a previous comment in another patch - if the feature 
> can be defined as a has_ struct member in the pci/gt info that could 
> streamline the checks.

This is only initialization. The queueing is done in the handler.
If it is supported from a particular platform then it seems unnecessary.
Should i add a function instead?

Thanks,
Riana

> 
> Thanks,
> Umesh
> 
>> +
>>     process_hw_errors(xe);
>> }
>> -- 
>> 2.47.1
>>



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-11  5:17                     ` Riana Tauro
@ 2025-07-11  6:08                       ` Raag Jadav
  0 siblings, 0 replies; 48+ messages in thread
From: Raag Jadav @ 2025-07-11  6:08 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Rodrigo Vivi, Christian König, Simona Vetter, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Fri, Jul 11, 2025 at 10:47:39AM +0530, Riana Tauro wrote:
> On 7/11/2025 3:16 AM, Raag Jadav wrote:
> > On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
> > > On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> > > > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > > > > > On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > > > > > > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > > > > > > > On 09.07.25 15:41, Simona Vetter wrote:
> > > > > > > > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > > > > > > > > > Certain errors can cause the device to be wedged and may
> > > > > > > > > > > require a vendor specific recovery method to restore normal
> > > > > > > > > > > operation.
> > > > > > > > > > > 
> > > > > > > > > > > Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > > > > > > > > > must provide additional recovery documentation if this method
> > > > > > > > > > > is used.
> > > > > > > > > > > 
> > > > > > > > > > > v2: fix documentation (Raag)
> > > > > > > > > > > 
> > > > > > > > > > > Cc: André Almeida <andrealmeid@igalia.com>
> > > > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > > > > Cc: David Airlie <airlied@gmail.com>
> > > > > > > > > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > > > > > > > > Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > > > > > > > > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > > > > > > > > 
> > > > > > > > > > I'm not really understanding what this is useful for, maybe concrete
> > > > > > > > > > example in the form of driver code that uses this, and some tool or
> > > > > > > > > > documentation steps that should be taken for recovery?
> > > > > > > 
> > > > > > > The case here is when FW underneath identified something badly corrupted on
> > > > > > > FW land and decided that only a firmware-flashing could solve the day and
> > > > > > > raise interrupt to the driver. At that point we want to wedge, but immediately
> > > > > > > hint the admin the recommended action.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > The recovery method for this particular case is to flash in a new firmware.
> > > > > > > > > 
> > > > > > > > > > The issues I'm seeing here is that eventually we'll get different
> > > > > > > > > > vendor-specific recovery steps, and maybe even on the same device, and
> > > > > > > > > > that leads us to an enumeration issue. Since it's just a string and an
> > > > > > > > > > enum I think it'd be better to just allocate a new one every time there's
> > > > > > > > > > a new strange recovery method instead of this opaque approach.
> > > > > > > > > 
> > > > > > > > > That is exactly the opposite of what we discussed so far.
> > > > > > 
> > > > > > Sorry, I missed that context.
> > > > > > 
> > > > > > > > > The original idea was to add a firmware-flush recovery method which
> > > > > > > > > looked a bit wage since it didn't give any information on what to do
> > > > > > > > > exactly.
> > > > > > > > > 
> > > > > > > > > That's why I suggested to add a more generic vendor-specific event
> > > > > > > > > with refers to the documentation and system log to see what actually
> > > > > > > > > needs to be done.
> > > > > > > > > 
> > > > > > > > > Otherwise we would end up with events like firmware-flash, update FW
> > > > > > > > > image A, update FW image B, FW version mismatch etc....
> > > > > > 
> > > > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > > > this all to not be an issue.
> > > > > > 
> > > > > > > > Agree. Any newly allocated method that is specific to a vendor is going to
> > > > > > > > be opaque anyway, since it can't be generic for all drivers. This just helps
> > > > > > > > reduce the noise in DRM core.
> > > > > > > > 
> > > > > > > > And yes, there could be different vendor-specific cases for the same driver
> > > > > > > > and the driver should be able to provide the means to distinguish between
> > > > > > > > them.
> > > > > > > 
> > > > > > > Sim, what's your take on this then?
> > > > > > > 
> > > > > > > Should we get back to the original idea of firmware-flash?
> > > > > > 
> > > > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > > > 
> > > > > > The reason I think it should be specific is because I'm assuming you want
> > > > > > to script this. And if you have a big fleet with different vendors, then
> > > > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > > > place to put some documentation on what you should do instead.
> > > > > > 
> > > > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > > > sure why it needs to be an uevent?
> > > > > 
> > > > > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > > > > 
> > > > > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> > > 
> > > I also don't like the idea or even the thought of scripting something like
> > > a firmware-flash. But only to fail with a better pin point to make admin
> > > lives easier with a notification.
> > > 
> > > > > 
> > > > > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
> > > 
> > > Exactly, the hardware and firmware are giving the indication of what should be
> > > done. It is not 'unknown'.
> > > 
> > > > 
> > > > Yes, and since the recovery procedure is defined and known to the consumer,
> > > > it can potentially be automated (atleast for non-firmware cases).
> > > > 
> > > > > > I guess if you all want to stick with vendor-specific then I think that's
> > > 
> > > Well, I would honestly prefer a direct firmware-flash, but if that is not
> > > usable by other vendors and there's a push back on that, let's go with
> > > the vendor-specific then.
> > 
> > I think the procedure for firmware-flash is vendor specific, so the wedged event
> > alone is not sufficient either way. The consumer will need more guidance from
> > vendor documentation.
> 
> Procedure of firmware-flash is vendor specific, but the term
> 'firmware-flash' is still generic. The patch doesn't mention any vendor
> specific firmware or procedure. The push back was for the number of macros
> that can be added for other operations.

A common procedure for the methods is what makes them agnostic and usable
for all drivers. Otherwise it's pretty much a chaos for the consumer.

> > With vendor-specific method, the driver has the opportunity to cover as many
> > cases as it wants without having to create a new method everytime, and face the
> > same dilemma of being vendor agnostic.
> > 
> > > > > > ok with me too, but the docs should at least explain how to figure out
> > > > > > from the uevent which vendor you're on with a small example. What I'm
> > > > > > worried is that if we have this on multiple drivers userspace will
> > > > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > > > steps.
> > > > 
> > > > The device id along with driver can be identified from uevent (probably
> > > > available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> > > > already knows if the device fits the criteria for recovery.
> > > > 
> > > > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > > > then also comes with the documentation for what exactly you're supposed to
> > > > > > do as something you can script. And not just this stand-alone patch here.
> > > > 
> > > > Perhaps the rest of the series didn't make it to dri-devel, which will answer
> > > > most of the above.
> > > 
> > > Riana, could you please try to provide a bit more documentation like Sima
> > > asked and re-send the entire series to dri-devel?
> 
> Sure will send the entire series to dri-devel. The documentation is present
> in the series.
> 
> > 
> > With the ideas in this thread also documented so that we don't end up repeating
> > the same discussion.
> It is mentioned in cover letter but i didn't send it to dri-devel. will add
> more details

Thank you.

Raag

> > > > > > > > > > > ---
> > > > > > > > > > >   Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > > > > > > > > >   drivers/gpu/drm/drm_drv.c      | 2 ++
> > > > > > > > > > >   include/drm/drm_device.h       | 4 ++++
> > > > > > > > > > >   3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > > > index 263e5a97c080..c33070bdb347 100644
> > > > > > > > > > > --- a/Documentation/gpu/drm-uapi.rst
> > > > > > > > > > > +++ b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > > > @@ -421,10 +421,10 @@ Recovery
> > > > > > > > > > >   Current implementation defines three recovery methods, out of which, drivers
> > > > > > > > > > >   can use any one, multiple or none. Method(s) of choice will be sent in the
> > > > > > > > > > >   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > > > > > > > > > -more side-effects. If driver is unsure about recovery or method is unknown
> > > > > > > > > > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > > > > > > > > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > > > > > > > > > -will be sent instead.
> > > > > > > > > > > +more side-effects. If recovery method is specific to vendor
> > > > > > > > > > > +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > > > > > > > > > +specific documentation for further recovery steps. If driver is unsure about
> > > > > > > > > > > +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > > > > > > > > >   Userspace consumers can parse this event and attempt recovery as per the
> > > > > > > > > > >   following expectations.
> > > > > > > > > > > @@ -435,6 +435,7 @@ following expectations.
> > > > > > > > > > >       none            optional telemetry collection
> > > > > > > > > > >       rebind          unbind + bind driver
> > > > > > > > > > >       bus-reset       unbind + bus reset/re-enumeration + bind
> > > > > > > > > > > +    vendor-specific vendor specific recovery method
> > > > > > > > > > >       unknown         consumer policy
> > > > > > > > > > >       =============== ========================================
> > > > > > > > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > > > index cdd591b11488..0ac723a46a91 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/drm_drv.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > > > @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > > > > > > > > >   		return "rebind";
> > > > > > > > > > >   	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > > > > > > > > >   		return "bus-reset";
> > > > > > > > > > > +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > > > > > > > > > +		return "vendor-specific";
> > > > > > > > > > >   	default:
> > > > > > > > > > >   		return NULL;
> > > > > > > > > > >   	}
> > > > > > > > > > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > > > > > > > > > index 08b3b2467c4c..08a087f149ff 100644
> > > > > > > > > > > --- a/include/drm/drm_device.h
> > > > > > > > > > > +++ b/include/drm/drm_device.h
> > > > > > > > > > > @@ -26,10 +26,14 @@ struct pci_controller;
> > > > > > > > > > >    * Recovery methods for wedged device in order of less to more side-effects.
> > > > > > > > > > >    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > > > > > > > > >    * use any one, multiple (or'd) or none depending on their needs.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > > > > > > > > > + * details.
> > > > > > > > > > >    */
> > > > > > > > > > >   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > > > > > > > > >   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > > > > > > > > >   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > > > > > > > > > +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > > > > > > > > >   /**
> > > > > > > > > > >    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > > > > > > > > > -- 
> > > > > > > > > > > 2.47.1
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > 
> > > > > 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-11  5:39   ` Raag Jadav
@ 2025-07-11  6:09     ` Riana Tauro
  2025-07-12  5:45       ` Raag Jadav
  0 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-11  6:09 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban



On 7/11/2025 11:09 AM, Raag Jadav wrote:
> On Wed, Jul 09, 2025 at 04:50:18PM +0530, Riana Tauro wrote:
>> Add documentation for vendor specific device wedged recovery method
>> and runtime survivability.
> 
> ...
> 
>> + * Runtime Survivability
>> + * =====================
>> + *
>> + * Certain runtime firmware errors can cause the device to enter a wedged state
>> + * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
>> + * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
>> + * is indicated by the presence of survivability mode sysfs::
>> + *
>> + *	/sys/bus/pci/devices/<device>/survivability_mode
>> + *
>> + * Survivability mode sysfs provides information about the type of survivability mode.
>> + *
>> + * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
>> + * survivability mode. User can then initiate a firmware flash to restore device to normal
>> + * operation.
> 
> Do we have definition on actual procedure? Can we add a reference to it?
> Otherwise it's telling me to do something I have no idea about.

That is a userspace tool. I don't see any kernel code refering to 
userspace documentation.

Thanks
Riana


> 
> Raag


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 19:00                 ` Rodrigo Vivi
  2025-07-10 21:46                   ` Raag Jadav
@ 2025-07-11  8:56                   ` Simona Vetter
  1 sibling, 0 replies; 48+ messages in thread
From: Simona Vetter @ 2025-07-11  8:56 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Raag Jadav, Christian König, Simona Vetter, Riana Tauro,
	intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
> On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > >>>> On 09.07.25 15:41, Simona Vetter wrote:
> > > >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > >>>>>> Certain errors can cause the device to be wedged and may
> > > >>>>>> require a vendor specific recovery method to restore normal
> > > >>>>>> operation.
> > > >>>>>>
> > > >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > >>>>>> must provide additional recovery documentation if this method
> > > >>>>>> is used.
> > > >>>>>>
> > > >>>>>> v2: fix documentation (Raag)
> > > >>>>>>
> > > >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> > > >>>>>> Cc: Christian König <christian.koenig@amd.com>
> > > >>>>>> Cc: David Airlie <airlied@gmail.com>
> > > >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> > > >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > >>>>>
> > > >>>>> I'm not really understanding what this is useful for, maybe concrete
> > > >>>>> example in the form of driver code that uses this, and some tool or
> > > >>>>> documentation steps that should be taken for recovery?
> > > >>
> > > >> The case here is when FW underneath identified something badly corrupted on
> > > >> FW land and decided that only a firmware-flashing could solve the day and
> > > >> raise interrupt to the driver. At that point we want to wedge, but immediately
> > > >> hint the admin the recommended action.
> > > >>
> > > >>>>
> > > >>>> The recovery method for this particular case is to flash in a new firmware.
> > > >>>>
> > > >>>>> The issues I'm seeing here is that eventually we'll get different
> > > >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> > > >>>>> that leads us to an enumeration issue. Since it's just a string and an
> > > >>>>> enum I think it'd be better to just allocate a new one every time there's
> > > >>>>> a new strange recovery method instead of this opaque approach.
> > > >>>>
> > > >>>> That is exactly the opposite of what we discussed so far.
> > > > 
> > > > Sorry, I missed that context.
> > > > 
> > > >>>> The original idea was to add a firmware-flush recovery method which
> > > >>>> looked a bit wage since it didn't give any information on what to do
> > > >>>> exactly.
> > > >>>>
> > > >>>> That's why I suggested to add a more generic vendor-specific event
> > > >>>> with refers to the documentation and system log to see what actually
> > > >>>> needs to be done.
> > > >>>>
> > > >>>> Otherwise we would end up with events like firmware-flash, update FW
> > > >>>> image A, update FW image B, FW version mismatch etc....
> > > > 
> > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > this all to not be an issue.
> > > > 
> > > >>> Agree. Any newly allocated method that is specific to a vendor is going to
> > > >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> > > >>> reduce the noise in DRM core.
> > > >>>
> > > >>> And yes, there could be different vendor-specific cases for the same driver
> > > >>> and the driver should be able to provide the means to distinguish between
> > > >>> them.
> > > >>
> > > >> Sim, what's your take on this then?
> > > >>
> > > >> Should we get back to the original idea of firmware-flash?
> > > > 
> > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > 
> > > > The reason I think it should be specific is because I'm assuming you want
> > > > to script this. And if you have a big fleet with different vendors, then
> > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > place to put some documentation on what you should do instead.
> > > > 
> > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > sure why it needs to be an uevent?
> > > 
> > > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > > 
> > > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> 
> I also don't like the idea or even the thought of scripting something like
> a firmware-flash. But only to fail with a better pin point to make admin
> lives easier with a notification.
> 
> > > 
> > > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
> 
> Exactly, the hardware and firmware are giving the indication of what should be
> done. It is not 'unknown'.
> 
> > 
> > Yes, and since the recovery procedure is defined and known to the consumer,
> > it can potentially be automated (atleast for non-firmware cases).
> > 
> > > > I guess if you all want to stick with vendor-specific then I think that's
> 
> Well, I would honestly prefer a direct firmware-flash, but if that is not
> usable by other vendors and there's a push back on that, let's go with
> the vendor-specific then.
> 
> > > > ok with me too, but the docs should at least explain how to figure out
> > > > from the uevent which vendor you're on with a small example. What I'm
> > > > worried is that if we have this on multiple drivers userspace will
> > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > steps.
> > 
> > The device id along with driver can be identified from uevent (probably
> > available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> > already knows if the device fits the criteria for recovery.
> > 
> > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > then also comes with the documentation for what exactly you're supposed to
> > > > do as something you can script. And not just this stand-alone patch here.
> > 
> > Perhaps the rest of the series didn't make it to dri-devel, which will answer
> > most of the above.
> 
> Riana, could you please try to provide a bit more documentation like Sima
> asked and re-send the entire series to dri-devel?

Yeah all the stuff discussed here needs to be included in the commit
message.

Also ideally the patch series would add a user in xe, and the xe patch
would then also include documentation on how that firmware flash should
happen. Also I guess the xe patch I guess should explain how backwards
compatibility should work if there's ever a need for yet-another recovery
method. Asking userspace to filter on specific pciid sounds brittle, I
think worst case we'll just add vendor-specific-1 or something :-)

Which is why I still think we should just enum them all, but I'm ok with
just properly documenting the reasoning behind this all.

Cheers, Sima

> 
> Thanks,
> Rodrigo.
> 
> > 
> > Raag
> > 
> > > >>>>>> ---
> > > >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > > >>>>>>  include/drm/drm_device.h       | 4 ++++
> > > >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> > > >>>>>>
> > > >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> index 263e5a97c080..c33070bdb347 100644
> > > >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> > > >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> @@ -421,10 +421,10 @@ Recovery
> > > >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> > > >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > > >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> > > >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > >>>>>> -will be sent instead.
> > > >>>>>> +more side-effects. If recovery method is specific to vendor
> > > >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> > > >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > >>>>>>  
> > > >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> > > >>>>>>  following expectations.
> > > >>>>>> @@ -435,6 +435,7 @@ following expectations.
> > > >>>>>>      none            optional telemetry collection
> > > >>>>>>      rebind          unbind + bind driver
> > > >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> > > >>>>>> +    vendor-specific vendor specific recovery method
> > > >>>>>>      unknown         consumer policy
> > > >>>>>>      =============== ========================================
> > > >>>>>>  
> > > >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> index cdd591b11488..0ac723a46a91 100644
> > > >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> > > >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > >>>>>>  		return "rebind";
> > > >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > >>>>>>  		return "bus-reset";
> > > >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > >>>>>> +		return "vendor-specific";
> > > >>>>>>  	default:
> > > >>>>>>  		return NULL;
> > > >>>>>>  	}
> > > >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> > > >>>>>> --- a/include/drm/drm_device.h
> > > >>>>>> +++ b/include/drm/drm_device.h
> > > >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> > > >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> > > >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> > > >>>>>> + *
> > > >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > >>>>>> + * details.
> > > >>>>>>   */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > >>>>>>  
> > > >>>>>>  /**
> > > >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > >>>>>> -- 
> > > >>>>>> 2.47.1
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > > 
> > > 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10  9:37             ` Christian König
  2025-07-10 10:24               ` Raag Jadav
@ 2025-07-11  8:59               ` Simona Vetter
  2025-07-14  5:27                 ` Riana Tauro
  1 sibling, 1 reply; 48+ messages in thread
From: Simona Vetter @ 2025-07-11  8:59 UTC (permalink / raw)
  To: Christian König
  Cc: Simona Vetter, Rodrigo Vivi, Raag Jadav, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> On 10.07.25 11:01, Simona Vetter wrote:
> > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> >>>> On 09.07.25 15:41, Simona Vetter wrote:
> >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> >>>>>> Certain errors can cause the device to be wedged and may
> >>>>>> require a vendor specific recovery method to restore normal
> >>>>>> operation.
> >>>>>>
> >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> >>>>>> must provide additional recovery documentation if this method
> >>>>>> is used.
> >>>>>>
> >>>>>> v2: fix documentation (Raag)
> >>>>>>
> >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> >>>>>> Cc: Christian König <christian.koenig@amd.com>
> >>>>>> Cc: David Airlie <airlied@gmail.com>
> >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> >>>>>
> >>>>> I'm not really understanding what this is useful for, maybe concrete
> >>>>> example in the form of driver code that uses this, and some tool or
> >>>>> documentation steps that should be taken for recovery?
> >>
> >> The case here is when FW underneath identified something badly corrupted on
> >> FW land and decided that only a firmware-flashing could solve the day and
> >> raise interrupt to the driver. At that point we want to wedge, but immediately
> >> hint the admin the recommended action.
> >>
> >>>>
> >>>> The recovery method for this particular case is to flash in a new firmware.
> >>>>
> >>>>> The issues I'm seeing here is that eventually we'll get different
> >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> >>>>> that leads us to an enumeration issue. Since it's just a string and an
> >>>>> enum I think it'd be better to just allocate a new one every time there's
> >>>>> a new strange recovery method instead of this opaque approach.
> >>>>
> >>>> That is exactly the opposite of what we discussed so far.
> > 
> > Sorry, I missed that context.
> > 
> >>>> The original idea was to add a firmware-flush recovery method which
> >>>> looked a bit wage since it didn't give any information on what to do
> >>>> exactly.
> >>>>
> >>>> That's why I suggested to add a more generic vendor-specific event
> >>>> with refers to the documentation and system log to see what actually
> >>>> needs to be done.
> >>>>
> >>>> Otherwise we would end up with events like firmware-flash, update FW
> >>>> image A, update FW image B, FW version mismatch etc....
> > 
> > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > this all to not be an issue.
> > 
> >>> Agree. Any newly allocated method that is specific to a vendor is going to
> >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> >>> reduce the noise in DRM core.
> >>>
> >>> And yes, there could be different vendor-specific cases for the same driver
> >>> and the driver should be able to provide the means to distinguish between
> >>> them.
> >>
> >> Sim, what's your take on this then?
> >>
> >> Should we get back to the original idea of firmware-flash?
> > 
> > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > 
> > The reason I think it should be specific is because I'm assuming you want
> > to script this. And if you have a big fleet with different vendors, then
> > "vendor-specific" doesn't tell you enough. But if it's something like
> > $vendor-$magic_step then it does become scriptable, and we do have have a
> > place to put some documentation on what you should do instead.
> > 
> > If the point of this interface isn't that it's scriptable, then I'm not
> > sure why it needs to be an uevent?
> 
> You should probably read up on the previous discussion, cause that is
> exactly what I asked as well :)
> 
> And no, it should *not* be scripted. That would be a bit brave for a
> firmware update where you should absolutely not power down the system
> for example.

I guess if we clearly state that this is for manual recovery only, or for
cases where you exactly know what you're doing (fleet-specific scripts
instead of generic distros), I guess this very opaque code makes sense.

But we should clearly document then that doing anything scripted here is
very much "you get to keep the pieces", and definitely don't try to do
something fancy generic.

Which without documentation is just really confusing when some of the
other error codes clearly look like they're meant to facilitate scripted
recovery.

> In my understanding the new value "vendor-specific" basically means it
> is a known issue with a documented solution, while "unknown" means the
> driver has no idea how to solve it.

I think that's another detail which should be documented clearly.
-Sima
> 
> Regards,
> Christian.
> 
> > I guess if you all want to stick with vendor-specific then I think that's
> > ok with me too, but the docs should at least explain how to figure out
> > from the uevent which vendor you're on with a small example. What I'm
> > worried is that if we have this on multiple drivers userspace will
> > otherwise make a complete mess and might want to run the wrong recovery
> > steps.
> > 
> > I think ideally, no matter what, we'd have a concrete driver patch which
> > then also comes with the documentation for what exactly you're supposed to
> > do as something you can script. And not just this stand-alone patch here.
> > 
> > Cheers, Sima
> >>
> >>>
> >>> Raag
> >>>
> >>>>>> ---
> >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> >>>>>>  include/drm/drm_device.h       | 4 ++++
> >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >>>>>> index 263e5a97c080..c33070bdb347 100644
> >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> >>>>>> @@ -421,10 +421,10 @@ Recovery
> >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >>>>>> -will be sent instead.
> >>>>>> +more side-effects. If recovery method is specific to vendor
> >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> >>>>>>  
> >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> >>>>>>  following expectations.
> >>>>>> @@ -435,6 +435,7 @@ following expectations.
> >>>>>>      none            optional telemetry collection
> >>>>>>      rebind          unbind + bind driver
> >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> >>>>>> +    vendor-specific vendor specific recovery method
> >>>>>>      unknown         consumer policy
> >>>>>>      =============== ========================================
> >>>>>>  
> >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >>>>>> index cdd591b11488..0ac723a46a91 100644
> >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>>>>>  		return "rebind";
> >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>>>>>  		return "bus-reset";
> >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> >>>>>> +		return "vendor-specific";
> >>>>>>  	default:
> >>>>>>  		return NULL;
> >>>>>>  	}
> >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> >>>>>> --- a/include/drm/drm_device.h
> >>>>>> +++ b/include/drm/drm_device.h
> >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> >>>>>> + *
> >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> >>>>>> + * details.
> >>>>>>   */
> >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> >>>>>>  
> >>>>>>  /**
> >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >>>>>> -- 
> >>>>>> 2.47.1
> >>>>>>
> >>>>>
> >>>>
> > 
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 7/9] drm/xe: Add support to handle hardware errors
  2025-07-11  5:35     ` Riana Tauro
@ 2025-07-11 17:34       ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 48+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-07-11 17:34 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban,
	Himal Prasad Ghimiray

On Fri, Jul 11, 2025 at 11:05:04AM +0530, Riana Tauro wrote:
>Hi Umesh
>
>On 7/11/2025 2:39 AM, Umesh Nerlige Ramappa wrote:
>>Resending since it got lost earlier...
>>
>>On Wed, Jul 09, 2025 at 04:50:19PM +0530, Riana Tauro wrote:
>>>Gfx device reports two classes of errors: uncorrectable and
>>>correctable. Depending on the severity uncorrectable errors are
>>>further classified as non fatal and fatal
>>>
>>>Correctable and non-fatal errors are reported as MSI's and bits in
>>>the Master Interrupt Register indicate the class of the error.
>>>The source of the error is then read from the Device Error Source
>>>Register.
>>
>>nit: Since Fatal is a separate category, maybe a split here into a 
>>separate paragraph and some formatting would be good.
>>
>>>Fatal errors are reported as PCIe errors
>>>When a PCIe error is asserted, the OS will perform a device warm reset
>>>which causes the driver to reload. The error registers are sticky
>>>and the values are maintained through a warm reset
>>>
>>>Add basic support to handle these errors
>>>
>>>Bspec: 50875, 53073, 53074, 53075, 53076
>>>
>>>Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>---
>>>drivers/gpu/drm/xe/Makefile                |   1 +
>>>drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
>>>drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
>>>drivers/gpu/drm/xe/xe_hw_error.c           | 108 +++++++++++++++++++++
>>>drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
>>>drivers/gpu/drm/xe/xe_irq.c                |   4 +
>>>6 files changed, 144 insertions(+)
>>>create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>>create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
>>>create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
>>>
>>>diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>>>index 1d97e5b63f4e..fea8ee3b0785 100644
>>>--- a/drivers/gpu/drm/xe/Makefile
>>>+++ b/drivers/gpu/drm/xe/Makefile
>>>@@ -73,6 +73,7 @@ xe-y += xe_bb.o \
>>>    xe_hw_engine.o \
>>>    xe_hw_engine_class_sysfs.o \
>>>    xe_hw_engine_group.o \
>>>+    xe_hw_error.o \
>>>    xe_hw_fence.o \
>>>    xe_irq.o \
>>>    xe_lrc.o \
>>>diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h 
>>>b/drivers/gpu/ drm/xe/regs/xe_hw_error_regs.h
>>>new file mode 100644
>>>index 000000000000..ed9b81fb28a0
>>>--- /dev/null
>>>+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>>@@ -0,0 +1,15 @@
>>>+/* SPDX-License-Identifier: MIT */
>>>+/*
>>>+ * Copyright © 2025 Intel Corporation
>>>+ */
>>>+
>>>+#ifndef _XE_HW_ERROR_REGS_H_
>>>+#define _XE_HW_ERROR_REGS_H_
>>>+
>>>+#define DEV_ERR_STAT_NONFATAL            0x100178
>>>+#define DEV_ERR_STAT_CORRECTABLE        0x10017c
>>>+#define DEV_ERR_STAT_REG(x)            XE_REG(_PICK_EVEN((x), \
>>>+                                  DEV_ERR_STAT_CORRECTABLE, \
>>>+                                  DEV_ERR_STAT_NONFATAL))
>>
>> For x = 1 and x = 2, I don't see the above result in correct values. 
>Can > you please double check?
>
>I had got confused with the same when i took the patch from the other 
>series. But the second part of the macro becomes negative and the 
>registers are correct.
>
>Calculations for 1 and 2
>
>#define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a)))
>
>_PICK_EVEN([HARDWARE_ERROR_NONFATAL = 1]) = DEV_ERR_STAT_CORRECTABLE + 
>1 * (DEV_ERR_STAT_NONFATAL - DEV_ERR_STAT_CORRECTABLE)
>					    0x10017c + 1 * (0x100178 - 0x10017c)
>   					    0x100178
>
>_PICK_EVEN([HARDWARE_ERROR_FATAL = 2]) = DEV_ERR_STAT_CORRECTABLE + 1 
>* (DEV_ERR_STAT_NONFATAL - DEV_ERR_STAT_CORRECTABLE)
>					    0x10017c + 2 * (0x100178 - 0x10017c)
>   					    0x100174

ok, makes sense now,

Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Thanks,
Umesh
>
>Thanks
>Riana
>		
>
>>
>>What about DEV_ERR_STAT_FATAL?
>>
>>Rest looks good,
>>
>>Umesh
>>
>>>+
>>>+#endif
>>>diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h 
>>>b/drivers/gpu/drm/ xe/regs/xe_irq_regs.h
>>>index f0ecfcac4003..2758b64cec9e 100644
>>>--- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>>>+++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>>>@@ -18,6 +18,7 @@
>>>#define GFX_MSTR_IRQ                XE_REG(0x190010, XE_REG_OPTION_VF)
>>>#define   MASTER_IRQ                REG_BIT(31)
>>>#define   GU_MISC_IRQ                REG_BIT(29)
>>>+#define   ERROR_IRQ(x)                REG_BIT(26 + (x))
>>>#define   DISPLAY_IRQ                REG_BIT(16)
>>>#define   GT_DW_IRQ(x)                REG_BIT(x)
>>>
>>>diff --git a/drivers/gpu/drm/xe/xe_hw_error.c 
>>>b/drivers/gpu/drm/xe/ xe_hw_error.c
>>>new file mode 100644
>>>index 000000000000..0f2590839900
>>>--- /dev/null
>>>+++ b/drivers/gpu/drm/xe/xe_hw_error.c
>>>@@ -0,0 +1,108 @@
>>>+// SPDX-License-Identifier: MIT
>>>+/*
>>>+ * Copyright © 2025 Intel Corporation
>>>+ */
>>>+
>>>+#include "regs/xe_hw_error_regs.h"
>>>+#include "regs/xe_irq_regs.h"
>>>+
>>>+#include "xe_device.h"
>>>+#include "xe_hw_error.h"
>>>+#include "xe_mmio.h"
>>>+
>>>+/* Error categories reported by hardware */
>>>+enum hardware_error {
>>>+    HARDWARE_ERROR_CORRECTABLE = 0,
>>>+    HARDWARE_ERROR_NONFATAL = 1,
>>>+    HARDWARE_ERROR_FATAL = 2,
>>>+    HARDWARE_ERROR_MAX,
>>>+};
>>>+
>>>+static const char *hw_error_to_str(const enum hardware_error hw_err)
>>>+{
>>>+    switch (hw_err) {
>>>+    case HARDWARE_ERROR_CORRECTABLE:
>>>+        return "CORRECTABLE";
>>>+    case HARDWARE_ERROR_NONFATAL:
>>>+        return "NONFATAL";
>>>+    case HARDWARE_ERROR_FATAL:
>>>+        return "FATAL";
>>>+    default:
>>>+        return "UNKNOWN";
>>>+    }
>>>+}
>>>+
>>>+static void hw_error_source_handler(struct xe_tile *tile, const 
>>>enum hardware_error hw_err)
>>>+{
>>>+    const char *hw_err_str = hw_error_to_str(hw_err);
>>>+    struct xe_device *xe = tile_to_xe(tile);
>>>+    unsigned long flags;
>>>+    u32 err_src;
>>>+
>>>+    if (xe->info.platform != XE_BATTLEMAGE)
>>>+        return;
>>>+
>>>+    spin_lock_irqsave(&xe->irq.lock, flags);
>>>+    err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
>>>+    if (!err_src) {
>>>+        drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported 
>>>DEV_ERR_STAT_%s blank!\n",
>>>+                    tile->id, hw_err_str);
>>>+        goto unlock;
>>>+    }
>>>+
>>>+    /* TODO: Process errrors per source */
>>>+
>>>+    xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>>>+
>>>+unlock:
>>>+    spin_unlock_irqrestore(&xe->irq.lock, flags);
>>>+}
>>>+
>>>+/**
>>>+ * xe_hw_error_irq_handler - irq handling for hw errors
>>>+ * @tile: tile instance
>>>+ * @master_ctl: value read from master interrupt register
>>>+ *
>>>+ * Xe platforms add three error bits to the master interrupt 
>>>register to support error handling.
>>>+ * These three bits are used to convey the class of error FATAL, 
>>>NONFATAL, or CORRECTABLE.
>>>+ * To process the interrupt, determine the source of error by 
>>>reading the Device Error Source
>>>+ * Register that corresponds to the class of error being serviced.
>>>+ */
>>>+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>>>+{
>>>+    enum hardware_error hw_err;
>>>+
>>>+    for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>>>+        if (master_ctl & ERROR_IRQ(hw_err))
>>>+            hw_error_source_handler(tile, hw_err);
>>>+}
>>>+
>>>+/*
>>>+ * Process hardware errors during boot
>>>+ */
>>>+static void process_hw_errors(struct xe_device *xe)
>>>+{
>>>+    struct xe_tile *tile;
>>>+    u32 master_ctl;
>>>+    u8 id;
>>>+
>>>+    for_each_tile(tile, xe, id) {
>>>+        master_ctl = xe_mmio_read32(&tile->mmio, GFX_MSTR_IRQ);
>>>+        xe_hw_error_irq_handler(tile, master_ctl);
>>>+        xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ, master_ctl);
>>>+    }
>>>+}
>>>+
>>>+/**
>>>+ * xe_hw_error_init - Initialize hw errors
>>>+ * @xe: xe device instance
>>>+ *
>>>+ * Initialize and process hw errors
>>>+ */
>>>+void xe_hw_error_init(struct xe_device *xe)
>>>+{
>>>+    if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>>+        return;
>>>+
>>>+    process_hw_errors(xe);
>>>+}
>>>diff --git a/drivers/gpu/drm/xe/xe_hw_error.h 
>>>b/drivers/gpu/drm/xe/ xe_hw_error.h
>>>new file mode 100644
>>>index 000000000000..d86e28c5180c
>>>--- /dev/null
>>>+++ b/drivers/gpu/drm/xe/xe_hw_error.h
>>>@@ -0,0 +1,15 @@
>>>+/* SPDX-License-Identifier: MIT */
>>>+/*
>>>+ * Copyright © 2025 Intel Corporation
>>>+ */
>>>+#ifndef XE_HW_ERROR_H_
>>>+#define XE_HW_ERROR_H_
>>>+
>>>+#include <linux/types.h>
>>>+
>>>+struct xe_tile;
>>>+struct xe_device;
>>>+
>>>+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 
>>>master_ctl);
>>>+void xe_hw_error_init(struct xe_device *xe);
>>>+#endif
>>>diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>>>index 5362d3174b06..24ccf3bec52c 100644
>>>--- a/drivers/gpu/drm/xe/xe_irq.c
>>>+++ b/drivers/gpu/drm/xe/xe_irq.c
>>>@@ -18,6 +18,7 @@
>>>#include "xe_gt.h"
>>>#include "xe_guc.h"
>>>#include "xe_hw_engine.h"
>>>+#include "xe_hw_error.h"
>>>#include "xe_memirq.h"
>>>#include "xe_mmio.h"
>>>#include "xe_pxp.h"
>>>@@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, 
>>>void *arg)
>>>        xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>>>
>>>        gt_irq_handler(tile, master_ctl, intr_dw, identity);
>>>+        xe_hw_error_irq_handler(tile, master_ctl);
>>>
>>>        /*
>>>         * Display interrupts (including display backlight operations
>>>@@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
>>>    int nvec = 1;
>>>    int err;
>>>
>>>+    xe_hw_error_init(xe);
>>>+
>>>    xe_irq_reset(xe);
>>>
>>>    if (xe_device_has_msix(xe)) {
>>>-- 
>>>2.47.1
>>>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-11  5:46     ` Riana Tauro
@ 2025-07-11 17:38       ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 48+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-07-11 17:38 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

On Fri, Jul 11, 2025 at 11:16:15AM +0530, Riana Tauro wrote:
>Hi Umesh
>
>On 7/11/2025 6:06 AM, Umesh Nerlige Ramappa wrote:
>>On Wed, Jul 09, 2025 at 04:50:20PM +0530, Riana Tauro wrote:
>>>Add support to handle CSC firmware reported errors. When CSC firmware
>>>errors are encoutered, a error interrupt is received by the GFX device as
>>>a MSI interrupt.
>>>
>>>Device Source control registers indicates the source of the error as CSC
>>>The HEC error status register indicates that the error is firmware 
>>>reported
>>>Depending on the type of error, the error cause is written to the HEC
>>>Firmware error register.
>>>
>>>On encountering such CSC firmware errors, the graphics device is
>>>non-recoverable from driver context. The only way to recover from these
>>>errors is firmware flash. The device is then wedged and userspace is
>>>notified with a drm uevent
>>>
>>>v2: use vendor recovery method with
>>>   runtime survivability (Christian, Rodrigo, Raag)
>>>
>>>v3: move declare wedged to runtime survivability mode (Rodrigo)
>>>
>>>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>---
>>>drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>>>drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>>>drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>>>drivers/gpu/drm/xe/xe_hw_error.c           | 68 +++++++++++++++++++++-
>>>4 files changed, 78 insertions(+), 2 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h 
>>>b/drivers/gpu/drm/ xe/regs/xe_gsc_regs.h
>>>index 9b66cc972a63..180be82672ab 100644
>>>--- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>>>+++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>>>@@ -13,6 +13,8 @@
>>>
>>>/* Definitions of GSC H/W registers, bits, etc */
>>>
>>>+#define BMG_GSC_HECI1_BASE    0x373000
>>>+
>>>#define MTL_GSC_HECI1_BASE    0x00116000
>>>#define MTL_GSC_HECI2_BASE    0x00117000
>>>
>>>diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h 
>>>b/drivers/gpu/ drm/xe/regs/xe_hw_error_regs.h
>>>index ed9b81fb28a0..c146b9ef44eb 100644
>>>--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>>+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>>@@ -6,10 +6,15 @@
>>>#ifndef _XE_HW_ERROR_REGS_H_
>>>#define _XE_HW_ERROR_REGS_H_
>>>
>>>+#define HEC_UNCORR_ERR_STATUS(base)                    
>>>XE_REG((base) + 0x118)
>>>+#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>>>+
>>>+#define HEC_UNCORR_FW_ERR_DW0(base)                    
>>>XE_REG((base) + 0x124)
>>>+
>>>#define DEV_ERR_STAT_NONFATAL            0x100178
>>>#define DEV_ERR_STAT_CORRECTABLE        0x10017c
>>>#define DEV_ERR_STAT_REG(x)            XE_REG(_PICK_EVEN((x), \
>>>                                  DEV_ERR_STAT_CORRECTABLE, \
>>>                                  DEV_ERR_STAT_NONFATAL))
>>>-
>>>+#define   XE_CSC_ERROR                BIT(17)
>>>#endif
>>>diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
>>>b/drivers/gpu/drm/ xe/xe_device_types.h
>>>index ca300338e8c2..283d5c88758e 100644
>>>--- a/drivers/gpu/drm/xe/xe_device_types.h
>>>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>>>@@ -241,6 +241,9 @@ struct xe_tile {
>>>    /** @memirq: Memory Based Interrupts. */
>>>    struct xe_memirq memirq;
>>>
>>>+    /** @csc_hw_error_work: worker to report CSC HW errors */
>>>+    struct work_struct csc_hw_error_work;
>>>+
>>>    /** @pcode: tile's PCODE */
>>>    struct {
>>>        /** @pcode.lock: protecting tile's PCODE mailbox data */
>>>diff --git a/drivers/gpu/drm/xe/xe_hw_error.c 
>>>b/drivers/gpu/drm/xe/ xe_hw_error.c
>>>index 0f2590839900..7cc9b8a7fa1a 100644
>>>--- a/drivers/gpu/drm/xe/xe_hw_error.c
>>>+++ b/drivers/gpu/drm/xe/xe_hw_error.c
>>>@@ -3,12 +3,16 @@
>>> * Copyright © 2025 Intel Corporation
>>> */
>>>
>>>+#include "regs/xe_gsc_regs.h"
>>>#include "regs/xe_hw_error_regs.h"
>>>#include "regs/xe_irq_regs.h"
>>>
>>>#include "xe_device.h"
>>>#include "xe_hw_error.h"
>>>#include "xe_mmio.h"
>>>+#include "xe_survivability_mode.h"
>>>+
>>>+#define  HEC_UNCORR_FW_ERR_BITS 4
>>>
>>>/* Error categories reported by hardware */
>>>enum hardware_error {
>>>@@ -18,6 +22,13 @@ enum hardware_error {
>>>    HARDWARE_ERROR_MAX,
>>>};
>>>
>>>+static const char * const hec_uncorrected_fw_errors[] = {
>>>+    "Fatal",
>>>+    "CSE Disabled",
>>>+    "FD Corruption",
>>>+    "Data Corruption"
>>>+};
>>>+
>>>static const char *hw_error_to_str(const enum hardware_error hw_err)
>>>{
>>>    switch (hw_err) {
>>>@@ -32,6 +43,56 @@ static const char *hw_error_to_str(const enum 
>>>hardware_error hw_err)
>>>    }
>>>}
>>>
>>>+static void csc_hw_error_work(struct work_struct *work)
>>>+{
>>>+    struct xe_tile *tile = container_of(work, typeof(*tile), 
>>>csc_hw_error_work);
>>>+    struct xe_device *xe = tile_to_xe(tile);
>>>+    int ret;
>>>+
>>>+    ret = xe_survivability_mode_runtime_enable(xe);
>>
>>xe_survivability_mode_runtime_enable() returns if it's not BMG, not 
>>dgfx etc., so does it make sense to not even queue the work if those 
>>conditions are not met?
>
>CSC work is only scheduled for BMG in the below handler.
>The bit is not present in prior platforms
>>
>>>+    if (ret)
>>>+        drm_err(&xe->drm, "Failed to enable runtime survivability 
>>>mode\n");
>>>+}
>>>+
>>>+static void csc_hw_error_handler(struct xe_tile *tile, const enum 
>>>hardware_error hw_err)
>>>+{
>>>+    const char *hw_err_str = hw_error_to_str(hw_err);
>>>+    struct xe_device *xe = tile_to_xe(tile);
>>>+    struct xe_mmio *mmio = &tile->mmio;
>>>+    u32 base, err_bit, err_src;
>>>+    unsigned long fw_err;
>>>+
>>>+    if (xe->info.platform != XE_BATTLEMAGE)
>>>+        return;
>>>+
>>>+    /* Not supported in BMG */
>>>+    if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>>>+        return;
>>>+
>>>+    base = BMG_GSC_HECI1_BASE;
>>>+    lockdep_assert_held(&xe->irq.lock);
>>>+    err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>>>+    if (!err_src) {
>>>+        drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported 
>>>HEC_ERR_STATUS_%s blank\n",
>>>+                    tile->id, hw_err_str);
>>>+        return;
>>>+    }
>>>+
>>>+    if (err_src & UNCORR_FW_REPORTED_ERR) {
>>>+        fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>>>+        for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>>>+            drm_err_ratelimited(&xe->drm, HW_ERR
>>>+                        "%s: HEC Uncorrected FW %s error 
>>>reported, bit[%d] is set\n",
>>>+                         hw_err_str, hec_uncorrected_fw_errors[err_bit],
>>>+                         err_bit);
>>>+
>>>+            schedule_work(&tile->csc_hw_error_work);
>>>+        }
>>>+    }
>>>+
>>>+    xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>>>+}
>>>+
>>>static void hw_error_source_handler(struct xe_tile *tile, const 
>>>enum hardware_error hw_err)
>>>{
>>>    const char *hw_err_str = hw_error_to_str(hw_err);
>>>@@ -50,7 +111,8 @@ static void hw_error_source_handler(struct 
>>>xe_tile *tile, const enum hardware_er
>>>        goto unlock;
>>>    }
>>>
>>>-    /* TODO: Process errrors per source */
>>>+    if (err_src & XE_CSC_ERROR)
>>>+        csc_hw_error_handler(tile, hw_err);
>>>
>>>    xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>>>
>>>@@ -101,8 +163,12 @@ static void process_hw_errors(struct xe_device *xe)
>>> */
>>>void xe_hw_error_init(struct xe_device *xe)
>>>{
>>>+    struct xe_tile *tile = xe_device_get_root_tile(xe);
>>>+
>>>    if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>>        return;
>>>
>>>+    INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>>
>>Same here, why have a worker if it's not BMG?
>>
>>Also, reiterating a previous comment in another patch - if the 
>>feature can be defined as a has_ struct member in the pci/gt info 
>>that could streamline the checks.
>
>This is only initialization. The queueing is done in the handler.
>If it is supported from a particular platform then it seems unnecessary.
>Should i add a function instead?

No, this is good enough if the worker is queued for supported platform.

Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Thanks,
Umesh
>
>Thanks,
>Riana
>
>>
>>Thanks,
>>Umesh
>>
>>>+
>>>    process_hw_errors(xe);
>>>}
>>>-- 
>>>2.47.1
>>>
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
  2025-07-09 11:20 ` [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
@ 2025-07-11 17:41   ` Umesh Nerlige Ramappa
  2025-07-14  7:07     ` Riana Tauro
  0 siblings, 1 reply; 48+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-07-11 17:41 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban

On Wed, Jul 09, 2025 at 04:50:21PM +0530, Riana Tauro wrote:
>Add a debugfs fault handler to trigger csc error handler that
>wedges the device and sends drm uevent
>
>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>---
> drivers/gpu/drm/xe/xe_debugfs.c  |  2 ++
> drivers/gpu/drm/xe/xe_hw_error.c | 11 +++++++++++
> 2 files changed, 13 insertions(+)
>
>diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
>index d83cd6ed3fa8..134610437aea 100644
>--- a/drivers/gpu/drm/xe/xe_debugfs.c
>+++ b/drivers/gpu/drm/xe/xe_debugfs.c
>@@ -29,6 +29,7 @@
> #endif
>
> DECLARE_FAULT_ATTR(gt_reset_failure);
>+DECLARE_FAULT_ATTR(inject_csc_hw_error);
>
> static struct xe_device *node_to_xe(struct drm_info_node *node)
> {
>@@ -273,4 +274,5 @@ void xe_debugfs_register(struct xe_device *xe)
> 	xe_pxp_debugfs_register(xe->pxp);
>
> 	fault_create_debugfs_attr("fail_gt_reset", root, &gt_reset_failure);
>+	fault_create_debugfs_attr("inject_csc_hw_error", root, &inject_csc_hw_error);

Maybe create this attribute only for BMG since it will bail out anyways 
with an error when the worker runs? OR are you expecting to see that log 
message which says "runtime survivability not supported".

The absence of this attribute in debugfs can also be sufficient to 
indicate that it's not supported.

Thanks,
Umesh

> }
>diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>index 7cc9b8a7fa1a..2d56a93b3a71 100644
>--- a/drivers/gpu/drm/xe/xe_hw_error.c
>+++ b/drivers/gpu/drm/xe/xe_hw_error.c
>@@ -3,6 +3,8 @@
>  * Copyright © 2025 Intel Corporation
>  */
>
>+#include <linux/fault-inject.h>
>+
> #include "regs/xe_gsc_regs.h"
> #include "regs/xe_hw_error_regs.h"
> #include "regs/xe_irq_regs.h"
>@@ -13,6 +15,7 @@
> #include "xe_survivability_mode.h"
>
> #define  HEC_UNCORR_FW_ERR_BITS 4
>+extern struct fault_attr inject_csc_hw_error;
>
> /* Error categories reported by hardware */
> enum hardware_error {
>@@ -43,6 +46,11 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
> 	}
> }
>
>+static bool fault_inject_csc_hw_error(void)
>+{
>+	return should_fail(&inject_csc_hw_error, 1);
>+}
>+
> static void csc_hw_error_work(struct work_struct *work)
> {
> 	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
>@@ -134,6 +142,9 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
> {
> 	enum hardware_error hw_err;
>
>+	if (fault_inject_csc_hw_error())
>+		schedule_work(&tile->csc_hw_error_work);
>+
> 	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
> 		if (master_ctl & ERROR_IRQ(hw_err))
> 			hw_error_source_handler(tile, hw_err);
>-- 
>2.47.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-11  6:09     ` Riana Tauro
@ 2025-07-12  5:45       ` Raag Jadav
  2025-07-14  9:04         ` Riana Tauro
  0 siblings, 1 reply; 48+ messages in thread
From: Raag Jadav @ 2025-07-12  5:45 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban

On Fri, Jul 11, 2025 at 11:39:22AM +0530, Riana Tauro wrote:
> On 7/11/2025 11:09 AM, Raag Jadav wrote:
> > On Wed, Jul 09, 2025 at 04:50:18PM +0530, Riana Tauro wrote:
> > > Add documentation for vendor specific device wedged recovery method
> > > and runtime survivability.
> > 
> > ...
> > 
> > > + * Runtime Survivability
> > > + * =====================
> > > + *
> > > + * Certain runtime firmware errors can cause the device to enter a wedged state
> > > + * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
> > > + * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
> > > + * is indicated by the presence of survivability mode sysfs::
> > > + *
> > > + *	/sys/bus/pci/devices/<device>/survivability_mode
> > > + *
> > > + * Survivability mode sysfs provides information about the type of survivability mode.
> > > + *
> > > + * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
> > > + * survivability mode. User can then initiate a firmware flash to restore device to normal
> > > + * operation.
> > 
> > Do we have definition on actual procedure? Can we add a reference to it?
> > Otherwise it's telling me to do something I have no idea about.
> 
> That is a userspace tool. I don't see any kernel code refering to userspace
> documentation.

How are we expecting users to be know about it?

Raag

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-11  8:59               ` Simona Vetter
@ 2025-07-14  5:27                 ` Riana Tauro
  2025-07-14 12:33                   ` Simona Vetter
  0 siblings, 1 reply; 48+ messages in thread
From: Riana Tauro @ 2025-07-14  5:27 UTC (permalink / raw)
  To: Simona Vetter, Christian König, Rodrigo Vivi, Raag Jadav
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel



On 7/11/2025 2:29 PM, Simona Vetter wrote:
> On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
>> On 10.07.25 11:01, Simona Vetter wrote:
>>> On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
>>>> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
>>>>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
>>>>>> On 09.07.25 15:41, Simona Vetter wrote:
>>>>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>>>>>>>> Certain errors can cause the device to be wedged and may
>>>>>>>> require a vendor specific recovery method to restore normal
>>>>>>>> operation.
>>>>>>>>
>>>>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>>>>>>>> must provide additional recovery documentation if this method
>>>>>>>> is used.
>>>>>>>>
>>>>>>>> v2: fix documentation (Raag)
>>>>>>>>
>>>>>>>> Cc: André Almeida <andrealmeid@igalia.com>
>>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>> Cc: <dri-devel@lists.freedesktop.org>
>>>>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>>>>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>>>>>
>>>>>>> I'm not really understanding what this is useful for, maybe concrete
>>>>>>> example in the form of driver code that uses this, and some tool or
>>>>>>> documentation steps that should be taken for recovery?
>>>>
>>>> The case here is when FW underneath identified something badly corrupted on
>>>> FW land and decided that only a firmware-flashing could solve the day and
>>>> raise interrupt to the driver. At that point we want to wedge, but immediately
>>>> hint the admin the recommended action.
>>>>
>>>>>>
>>>>>> The recovery method for this particular case is to flash in a new firmware.
>>>>>>
>>>>>>> The issues I'm seeing here is that eventually we'll get different
>>>>>>> vendor-specific recovery steps, and maybe even on the same device, and
>>>>>>> that leads us to an enumeration issue. Since it's just a string and an
>>>>>>> enum I think it'd be better to just allocate a new one every time there's
>>>>>>> a new strange recovery method instead of this opaque approach.
>>>>>>
>>>>>> That is exactly the opposite of what we discussed so far.
>>>
>>> Sorry, I missed that context.
>>>
>>>>>> The original idea was to add a firmware-flush recovery method which
>>>>>> looked a bit wage since it didn't give any information on what to do
>>>>>> exactly.
>>>>>>
>>>>>> That's why I suggested to add a more generic vendor-specific event
>>>>>> with refers to the documentation and system log to see what actually
>>>>>> needs to be done.
>>>>>>
>>>>>> Otherwise we would end up with events like firmware-flash, update FW
>>>>>> image A, update FW image B, FW version mismatch etc....
>>>
>>> Yeah, that's kinda what I expect to happen, and we have enough numbers for
>>> this all to not be an issue.
>>>
>>>>> Agree. Any newly allocated method that is specific to a vendor is going to
>>>>> be opaque anyway, since it can't be generic for all drivers. This just helps
>>>>> reduce the noise in DRM core.
>>>>>
>>>>> And yes, there could be different vendor-specific cases for the same driver
>>>>> and the driver should be able to provide the means to distinguish between
>>>>> them.
>>>>
>>>> Sim, what's your take on this then?
>>>>
>>>> Should we get back to the original idea of firmware-flash?
>>>
>>> Maybe intel-firmware-flash or something, meaning prefix with the vendor?
>>>
>>> The reason I think it should be specific is because I'm assuming you want
>>> to script this. And if you have a big fleet with different vendors, then
>>> "vendor-specific" doesn't tell you enough. But if it's something like
>>> $vendor-$magic_step then it does become scriptable, and we do have have a
>>> place to put some documentation on what you should do instead.
>>>
>>> If the point of this interface isn't that it's scriptable, then I'm not
>>> sure why it needs to be an uevent?
>>
>> You should probably read up on the previous discussion, cause that is
>> exactly what I asked as well :)
>>
>> And no, it should *not* be scripted. That would be a bit brave for a
>> firmware update where you should absolutely not power down the system
>> for example.
> 
> I guess if we clearly state that this is for manual recovery only, or for
> cases where you exactly know what you're doing (fleet-specific scripts
> instead of generic distros), I guess this very opaque code makes sense.
> 
> But we should clearly document then that doing anything scripted here is
> very much "you get to keep the pieces", and definitely don't try to do
> something fancy generic.


The documentation is part of the series but was sent only to intel-xe 
mailing list. Will re-send the entire series to dri-devel

https://lore.kernel.org/intel-xe/aHH2XGuOvz8bSlax@black.fi.intel.com/T/#m883269cf0b1f6891ecc9c24d3d45325f46d56572


> 
> Which without documentation is just really confusing when some of the
> other error codes clearly look like they're meant to facilitate scripted
> recovery.
> 

To get consensus on the patch, is 'vendor-specific' okay or is it better 
to have 'firmware-flash' with additional event parameter 'vendor' if 
number of macros is not a concern?

Thanks
Riana
>> In my understanding the new value "vendor-specific" basically means it
>> is a known issue with a documented solution, while "unknown" means the
>> driver has no idea how to solve it.
> 
> I think that's another detail which should be documented clearly.
> -Sima
>>
>> Regards,
>> Christian.
>>
>>> I guess if you all want to stick with vendor-specific then I think that's
>>> ok with me too, but the docs should at least explain how to figure out
>>> from the uevent which vendor you're on with a small example. What I'm
>>> worried is that if we have this on multiple drivers userspace will
>>> otherwise make a complete mess and might want to run the wrong recovery
>>> steps.
>>>
>>> I think ideally, no matter what, we'd have a concrete driver patch which
>>> then also comes with the documentation for what exactly you're supposed to
>>> do as something you can script. And not just this stand-alone patch here.
>>>
>>> Cheers, Sima
>>>>
>>>>>
>>>>> Raag
>>>>>
>>>>>>>> ---
>>>>>>>>   Documentation/gpu/drm-uapi.rst | 9 +++++----
>>>>>>>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>>>>>>>   include/drm/drm_device.h       | 4 ++++
>>>>>>>>   3 files changed, 11 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>>>>>> index 263e5a97c080..c33070bdb347 100644
>>>>>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>>>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>>>>>> @@ -421,10 +421,10 @@ Recovery
>>>>>>>>   Current implementation defines three recovery methods, out of which, drivers
>>>>>>>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>>>>>>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
>>>>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>>>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>>>>>>> -will be sent instead.
>>>>>>>> +more side-effects. If recovery method is specific to vendor
>>>>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>>>>>>>> +specific documentation for further recovery steps. If driver is unsure about
>>>>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>>>>>>>   
>>>>>>>>   Userspace consumers can parse this event and attempt recovery as per the
>>>>>>>>   following expectations.
>>>>>>>> @@ -435,6 +435,7 @@ following expectations.
>>>>>>>>       none            optional telemetry collection
>>>>>>>>       rebind          unbind + bind driver
>>>>>>>>       bus-reset       unbind + bus reset/re-enumeration + bind
>>>>>>>> +    vendor-specific vendor specific recovery method
>>>>>>>>       unknown         consumer policy
>>>>>>>>       =============== ========================================
>>>>>>>>   
>>>>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>>>>>> index cdd591b11488..0ac723a46a91 100644
>>>>>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>>>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>>>>>>   		return "rebind";
>>>>>>>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>>>>>>   		return "bus-reset";
>>>>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>>>>>>>> +		return "vendor-specific";
>>>>>>>>   	default:
>>>>>>>>   		return NULL;
>>>>>>>>   	}
>>>>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>>>>>>> index 08b3b2467c4c..08a087f149ff 100644
>>>>>>>> --- a/include/drm/drm_device.h
>>>>>>>> +++ b/include/drm/drm_device.h
>>>>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
>>>>>>>>    * Recovery methods for wedged device in order of less to more side-effects.
>>>>>>>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>>>>>>>    * use any one, multiple (or'd) or none depending on their needs.
>>>>>>>> + *
>>>>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>>>>>>>> + * details.
>>>>>>>>    */
>>>>>>>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>>>>>>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>>>>>>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>>>>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>>>>>>>   
>>>>>>>>   /**
>>>>>>>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>>>>>>> -- 
>>>>>>>> 2.47.1
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
  2025-07-11 17:41   ` Umesh Nerlige Ramappa
@ 2025-07-14  7:07     ` Riana Tauro
  0 siblings, 0 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-14  7:07 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, frank.scarbrough, sk.anirban



On 7/11/2025 11:11 PM, Umesh Nerlige Ramappa wrote:
> On Wed, Jul 09, 2025 at 04:50:21PM +0530, Riana Tauro wrote:
>> Add a debugfs fault handler to trigger csc error handler that
>> wedges the device and sends drm uevent
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_debugfs.c  |  2 ++
>> drivers/gpu/drm/xe/xe_hw_error.c | 11 +++++++++++
>> 2 files changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/ 
>> xe_debugfs.c
>> index d83cd6ed3fa8..134610437aea 100644
>> --- a/drivers/gpu/drm/xe/xe_debugfs.c
>> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
>> @@ -29,6 +29,7 @@
>> #endif
>>
>> DECLARE_FAULT_ATTR(gt_reset_failure);
>> +DECLARE_FAULT_ATTR(inject_csc_hw_error);
>>
>> static struct xe_device *node_to_xe(struct drm_info_node *node)
>> {
>> @@ -273,4 +274,5 @@ void xe_debugfs_register(struct xe_device *xe)
>>     xe_pxp_debugfs_register(xe->pxp);
>>
>>     fault_create_debugfs_attr("fail_gt_reset", root, &gt_reset_failure);
>> +    fault_create_debugfs_attr("inject_csc_hw_error", root, 
>> &inject_csc_hw_error);
> 
> Maybe create this attribute only for BMG since it will bail out anyways 
> with an error when the worker runs? OR are you expecting to see that log 
> message which says "runtime survivability not supported".
> 
> The absence of this attribute in debugfs can also be sufficient to 
> indicate that it's not supported.

yeah makes sense. Will add this only for bmg

Thanks
Riana
> 
> Thanks,
> Umesh
> 
>> }
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/ 
>> xe_hw_error.c
>> index 7cc9b8a7fa1a..2d56a93b3a71 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,6 +3,8 @@
>>  * Copyright © 2025 Intel Corporation
>>  */
>>
>> +#include <linux/fault-inject.h>
>> +
>> #include "regs/xe_gsc_regs.h"
>> #include "regs/xe_hw_error_regs.h"
>> #include "regs/xe_irq_regs.h"
>> @@ -13,6 +15,7 @@
>> #include "xe_survivability_mode.h"
>>
>> #define  HEC_UNCORR_FW_ERR_BITS 4
>> +extern struct fault_attr inject_csc_hw_error;
>>
>> /* Error categories reported by hardware */
>> enum hardware_error {
>> @@ -43,6 +46,11 @@ static const char *hw_error_to_str(const enum 
>> hardware_error hw_err)
>>     }
>> }
>>
>> +static bool fault_inject_csc_hw_error(void)
>> +{
>> +    return should_fail(&inject_csc_hw_error, 1);
>> +}
>> +
>> static void csc_hw_error_work(struct work_struct *work)
>> {
>>     struct xe_tile *tile = container_of(work, typeof(*tile), 
>> csc_hw_error_work);
>> @@ -134,6 +142,9 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, 
>> const u32 master_ctl)
>> {
>>     enum hardware_error hw_err;
>>
>> +    if (fault_inject_csc_hw_error())
>> +        schedule_work(&tile->csc_hw_error_work);
>> +
>>     for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>>         if (master_ctl & ERROR_IRQ(hw_err))
>>             hw_error_source_handler(tile, hw_err);
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-12  5:45       ` Raag Jadav
@ 2025-07-14  9:04         ` Riana Tauro
  0 siblings, 0 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-14  9:04 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban



On 7/12/2025 11:15 AM, Raag Jadav wrote:
> On Fri, Jul 11, 2025 at 11:39:22AM +0530, Riana Tauro wrote:
>> On 7/11/2025 11:09 AM, Raag Jadav wrote:
>>> On Wed, Jul 09, 2025 at 04:50:18PM +0530, Riana Tauro wrote:
>>>> Add documentation for vendor specific device wedged recovery method
>>>> and runtime survivability.
>>>
>>> ...
>>>
>>>> + * Runtime Survivability
>>>> + * =====================
>>>> + *
>>>> + * Certain runtime firmware errors can cause the device to enter a wedged state
>>>> + * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
>>>> + * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
>>>> + * is indicated by the presence of survivability mode sysfs::
>>>> + *
>>>> + *	/sys/bus/pci/devices/<device>/survivability_mode
>>>> + *
>>>> + * Survivability mode sysfs provides information about the type of survivability mode.
>>>> + *
>>>> + * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
>>>> + * survivability mode. User can then initiate a firmware flash to restore device to normal
>>>> + * operation.
>>>
>>> Do we have definition on actual procedure? Can we add a reference to it?
>>> Otherwise it's telling me to do something I have no idea about.
>>
>> That is a userspace tool. I don't see any kernel code refering to userspace
>> documentation.
> 
> How are we expecting users to be know about it?

There is no documentation in kernel for fwupd or xpu-manager userspace 
tools.  The documentation should be provided by userspace tools about 
the required procedure.

I'll mention 'firmware update tools like fwupd' so that user can then 
refer to the respective documentation

Thanks
Riana

> 
> Raag


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-14  5:27                 ` Riana Tauro
@ 2025-07-14 12:33                   ` Simona Vetter
  0 siblings, 0 replies; 48+ messages in thread
From: Simona Vetter @ 2025-07-14 12:33 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Simona Vetter, Christian König, Rodrigo Vivi, Raag Jadav,
	intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Mon, Jul 14, 2025 at 10:57:48AM +0530, Riana Tauro wrote:
> 
> 
> On 7/11/2025 2:29 PM, Simona Vetter wrote:
> > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > > > On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > > > > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > > > > > On 09.07.25 15:41, Simona Vetter wrote:
> > > > > > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > > > > > > > Certain errors can cause the device to be wedged and may
> > > > > > > > > require a vendor specific recovery method to restore normal
> > > > > > > > > operation.
> > > > > > > > > 
> > > > > > > > > Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > > > > > > > must provide additional recovery documentation if this method
> > > > > > > > > is used.
> > > > > > > > > 
> > > > > > > > > v2: fix documentation (Raag)
> > > > > > > > > 
> > > > > > > > > Cc: André Almeida <andrealmeid@igalia.com>
> > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > > Cc: David Airlie <airlied@gmail.com>
> > > > > > > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > > > > > > Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > > > > > > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > > > > > > 
> > > > > > > > I'm not really understanding what this is useful for, maybe concrete
> > > > > > > > example in the form of driver code that uses this, and some tool or
> > > > > > > > documentation steps that should be taken for recovery?
> > > > > 
> > > > > The case here is when FW underneath identified something badly corrupted on
> > > > > FW land and decided that only a firmware-flashing could solve the day and
> > > > > raise interrupt to the driver. At that point we want to wedge, but immediately
> > > > > hint the admin the recommended action.
> > > > > 
> > > > > > > 
> > > > > > > The recovery method for this particular case is to flash in a new firmware.
> > > > > > > 
> > > > > > > > The issues I'm seeing here is that eventually we'll get different
> > > > > > > > vendor-specific recovery steps, and maybe even on the same device, and
> > > > > > > > that leads us to an enumeration issue. Since it's just a string and an
> > > > > > > > enum I think it'd be better to just allocate a new one every time there's
> > > > > > > > a new strange recovery method instead of this opaque approach.
> > > > > > > 
> > > > > > > That is exactly the opposite of what we discussed so far.
> > > > 
> > > > Sorry, I missed that context.
> > > > 
> > > > > > > The original idea was to add a firmware-flush recovery method which
> > > > > > > looked a bit wage since it didn't give any information on what to do
> > > > > > > exactly.
> > > > > > > 
> > > > > > > That's why I suggested to add a more generic vendor-specific event
> > > > > > > with refers to the documentation and system log to see what actually
> > > > > > > needs to be done.
> > > > > > > 
> > > > > > > Otherwise we would end up with events like firmware-flash, update FW
> > > > > > > image A, update FW image B, FW version mismatch etc....
> > > > 
> > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > this all to not be an issue.
> > > > 
> > > > > > Agree. Any newly allocated method that is specific to a vendor is going to
> > > > > > be opaque anyway, since it can't be generic for all drivers. This just helps
> > > > > > reduce the noise in DRM core.
> > > > > > 
> > > > > > And yes, there could be different vendor-specific cases for the same driver
> > > > > > and the driver should be able to provide the means to distinguish between
> > > > > > them.
> > > > > 
> > > > > Sim, what's your take on this then?
> > > > > 
> > > > > Should we get back to the original idea of firmware-flash?
> > > > 
> > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > 
> > > > The reason I think it should be specific is because I'm assuming you want
> > > > to script this. And if you have a big fleet with different vendors, then
> > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > place to put some documentation on what you should do instead.
> > > > 
> > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > sure why it needs to be an uevent?
> > > 
> > > You should probably read up on the previous discussion, cause that is
> > > exactly what I asked as well :)
> > > 
> > > And no, it should *not* be scripted. That would be a bit brave for a
> > > firmware update where you should absolutely not power down the system
> > > for example.
> > 
> > I guess if we clearly state that this is for manual recovery only, or for
> > cases where you exactly know what you're doing (fleet-specific scripts
> > instead of generic distros), I guess this very opaque code makes sense.
> > 
> > But we should clearly document then that doing anything scripted here is
> > very much "you get to keep the pieces", and definitely don't try to do
> > something fancy generic.
> 
> 
> The documentation is part of the series but was sent only to intel-xe
> mailing list. Will re-send the entire series to dri-devel
> 
> https://lore.kernel.org/intel-xe/aHH2XGuOvz8bSlax@black.fi.intel.com/T/#m883269cf0b1f6891ecc9c24d3d45325f46d56572

Duh, missed that, but yes definitely send the entire series to all mailing
lists. Especially when adding new drm features like this one does.

> > Which without documentation is just really confusing when some of the
> > other error codes clearly look like they're meant to facilitate scripted
> > recovery.
> > 
> 
> To get consensus on the patch, is 'vendor-specific' okay or is it better to
> have 'firmware-flash' with additional event parameter 'vendor' if number of
> macros is not a concern?

I'll refrain from a vote.
-Sima

> 
> Thanks
> Riana
> > > In my understanding the new value "vendor-specific" basically means it
> > > is a known issue with a documented solution, while "unknown" means the
> > > driver has no idea how to solve it.
> > 
> > I think that's another detail which should be documented clearly.
> > -Sima
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > I guess if you all want to stick with vendor-specific then I think that's
> > > > ok with me too, but the docs should at least explain how to figure out
> > > > from the uevent which vendor you're on with a small example. What I'm
> > > > worried is that if we have this on multiple drivers userspace will
> > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > steps.
> > > > 
> > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > then also comes with the documentation for what exactly you're supposed to
> > > > do as something you can script. And not just this stand-alone patch here.
> > > > 
> > > > Cheers, Sima
> > > > > 
> > > > > > 
> > > > > > Raag
> > > > > > 
> > > > > > > > > ---
> > > > > > > > >   Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > > > > > > >   drivers/gpu/drm/drm_drv.c      | 2 ++
> > > > > > > > >   include/drm/drm_device.h       | 4 ++++
> > > > > > > > >   3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > index 263e5a97c080..c33070bdb347 100644
> > > > > > > > > --- a/Documentation/gpu/drm-uapi.rst
> > > > > > > > > +++ b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > @@ -421,10 +421,10 @@ Recovery
> > > > > > > > >   Current implementation defines three recovery methods, out of which, drivers
> > > > > > > > >   can use any one, multiple or none. Method(s) of choice will be sent in the
> > > > > > > > >   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > > > > > > > -more side-effects. If driver is unsure about recovery or method is unknown
> > > > > > > > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > > > > > > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > > > > > > > -will be sent instead.
> > > > > > > > > +more side-effects. If recovery method is specific to vendor
> > > > > > > > > +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > > > > > > > +specific documentation for further recovery steps. If driver is unsure about
> > > > > > > > > +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > > > > > > >   Userspace consumers can parse this event and attempt recovery as per the
> > > > > > > > >   following expectations.
> > > > > > > > > @@ -435,6 +435,7 @@ following expectations.
> > > > > > > > >       none            optional telemetry collection
> > > > > > > > >       rebind          unbind + bind driver
> > > > > > > > >       bus-reset       unbind + bus reset/re-enumeration + bind
> > > > > > > > > +    vendor-specific vendor specific recovery method
> > > > > > > > >       unknown         consumer policy
> > > > > > > > >       =============== ========================================
> > > > > > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > index cdd591b11488..0ac723a46a91 100644
> > > > > > > > > --- a/drivers/gpu/drm/drm_drv.c
> > > > > > > > > +++ b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > > > > > > >   		return "rebind";
> > > > > > > > >   	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > > > > > > >   		return "bus-reset";
> > > > > > > > > +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > > > > > > > +		return "vendor-specific";
> > > > > > > > >   	default:
> > > > > > > > >   		return NULL;
> > > > > > > > >   	}
> > > > > > > > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > > > > > > > index 08b3b2467c4c..08a087f149ff 100644
> > > > > > > > > --- a/include/drm/drm_device.h
> > > > > > > > > +++ b/include/drm/drm_device.h
> > > > > > > > > @@ -26,10 +26,14 @@ struct pci_controller;
> > > > > > > > >    * Recovery methods for wedged device in order of less to more side-effects.
> > > > > > > > >    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > > > > > > >    * use any one, multiple (or'd) or none depending on their needs.
> > > > > > > > > + *
> > > > > > > > > + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > > > > > > > + * details.
> > > > > > > > >    */
> > > > > > > > >   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > > > > > > >   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > > > > > > >   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > > > > > > > +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > > > > > > >   /**
> > > > > > > > >    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > > > > > > > -- 
> > > > > > > > > 2.47.1
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > 
> > > 
> > 
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-07-14 12:33 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-09 13:41   ` Simona Vetter
2025-07-09 14:09     ` Christian König
2025-07-09 14:18       ` Raag Jadav
2025-07-09 16:52         ` Rodrigo Vivi
2025-07-10  9:01           ` Simona Vetter
2025-07-10  9:37             ` Christian König
2025-07-10 10:24               ` Raag Jadav
2025-07-10 19:00                 ` Rodrigo Vivi
2025-07-10 21:46                   ` Raag Jadav
2025-07-11  5:17                     ` Riana Tauro
2025-07-11  6:08                       ` Raag Jadav
2025-07-11  8:56                   ` Simona Vetter
2025-07-11  8:59               ` Simona Vetter
2025-07-14  5:27                 ` Riana Tauro
2025-07-14 12:33                   ` Simona Vetter
2025-07-09 14:46     ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-09 17:26   ` Matthew Brost
2025-07-09 11:20 ` [PATCH v4 3/9] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-07-09 11:20 ` [PATCH v4 4/9] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
2025-07-09 11:20 ` [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
2025-07-09 23:44   ` Umesh Nerlige Ramappa
2025-07-10  5:59     ` Riana Tauro
2025-07-10 17:12       ` Umesh Nerlige Ramappa
2025-07-11  5:23         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-11  5:39   ` Raag Jadav
2025-07-11  6:09     ` Riana Tauro
2025-07-12  5:45       ` Raag Jadav
2025-07-14  9:04         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-10 21:09   ` Umesh Nerlige Ramappa
2025-07-11  5:35     ` Riana Tauro
2025-07-11 17:34       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-11  0:36   ` Umesh Nerlige Ramappa
2025-07-11  5:46     ` Riana Tauro
2025-07-11 17:38       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-11 17:41   ` Umesh Nerlige Ramappa
2025-07-14  7:07     ` Riana Tauro
2025-07-09 12:28 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev4) Patchwork
2025-07-09 12:30 ` ✓ CI.KUnit: success " Patchwork
2025-07-09 12:44 ` ✗ CI.checksparse: warning " Patchwork
2025-07-09 13:06 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-09 15:02 ` ✗ Xe.CI.Full: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox