[PATCH 0/4] Handle Firmware reported Hardware Errors

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4]  Handle Firmware reported Hardware Errors
@ 2025-06-03  8:13 Riana Tauro
  2025-06-03  8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-03  8:13 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
	frank.scarbrough

Add support to handle firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of firmware error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash.

Add a firmware flash method to the drm device wedged uevent. Send
this method along with the uevent to notify userspace of the wedged
state and the possible recovery method.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[754.709341] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=firmware-flash
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0

Bspec: 50875, 53073, 53074, 53075, 53076

Riana Tauro (4):
  drm: Add a firmware flash method to device wedged uevent
  drm/xe: Add a helper function to set recovery method
  drm/xe: Add support to handle hardware errors
  drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors

 Documentation/gpu/drm-uapi.rst             |   6 +-
 drivers/gpu/drm/drm_drv.c                  |   2 +
 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |   2 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  20 +++
 drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
 drivers/gpu/drm/xe/xe_device.c             |  30 +++-
 drivers/gpu/drm/xe/xe_device.h             |   1 +
 drivers/gpu/drm/xe/xe_device_types.h       |   5 +
 drivers/gpu/drm/xe/xe_hw_error.c           | 171 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h           |  15 ++
 drivers/gpu/drm/xe/xe_irq.c                |   4 +
 include/drm/drm_device.h                   |   1 +
 13 files changed, 249 insertions(+), 10 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
  2025-06-03  8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
@ 2025-06-03  8:13 ` Riana Tauro
  2025-06-04 10:43   ` Raag Jadav
  2025-06-03  8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-03  8:13 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
	frank.scarbrough

A device is declared wedged when it is non-recoverable from
the driver context. Some firmware errors can also cause
the device to enter this state and the only method to recover
from this would be to do a firmware flash

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-uapi.rst | 6 +++---
 drivers/gpu/drm/drm_drv.c      | 2 ++
 include/drm/drm_device.h       | 1 +
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 4863a4deb0ee..524224afb09f 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
 can use any one, multiple or none. Method(s) of choice will be sent in the
 uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
 more side-effects. If driver is unsure about recovery or method is unknown
-(like soft/hard system reboot, firmware flashing, physical device replacement
-or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
-will be sent instead.
+(like soft/hard system reboot, physical device replacement or any other procedure
+which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
 
 Userspace consumers can parse this event and attempt recovery as per the
 following expectations.
@@ -435,6 +434,7 @@ following expectations.
     none            optional telemetry collection
     rebind          unbind + bind driver
     bus-reset       unbind + bus reset/re-enumeration + bind
+    firmware-flash  unbind + firmware flash + bind
     unknown         consumer policy
     =============== ========================================
 
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 56dd61f8e05a..e287424db753 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -533,6 +533,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
 		return "rebind";
 	case DRM_WEDGE_RECOVERY_BUS_RESET:
 		return "bus-reset";
+	case DRM_WEDGE_RECOVERY_FW_FLASH:
+		return "firmware-flash";
 	default:
 		return NULL;
 	}
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index e2f894f1b90a..e446e40f8306 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -29,6 +29,7 @@ struct pci_controller;
 #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
 #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
 #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
+#define DRM_WEDGE_RECOVERY_FW_FLASH	BIT(3)  /* unbind + firmware flash + bind */
 
 /**
  * enum switch_power_state - power state of drm device
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] drm/xe: Add a helper function to set recovery method
  2025-06-03  8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
  2025-06-03  8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
@ 2025-06-03  8:13 ` Riana Tauro
  2025-06-06 15:12   ` Raag Jadav
  2025-06-03  8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
  2025-06-03  8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
  3 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-03  8:13 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
	frank.scarbrough

Add a helper function to set recovery method. The recovery
method has to be set before declaring the device wedged and sending the
drm wedged uevent. If no method is set, default unbind/re-bind method
will be set

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c       | 30 +++++++++++++++++++++-------
 drivers/gpu/drm/xe/xe_device.h       |  1 +
 drivers/gpu/drm/xe/xe_device_types.h |  2 ++
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 660b0c5126dc..3fd604ebdc6e 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
 	xe_pm_runtime_put(xe);
 }
 
+/**
+ * xe_device_set_wedged_method - Set wedged recovery method
+ * @xe: xe device instance
+ *
+ * Set wedged recovery method to be sent using drm wedged uevent.
+ */
+void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
+{
+	xe->wedged.method = method;
+}
+
 /**
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a module
- * re-probe (unbind + bind).
- * In this state every IOCTL will be blocked so the GT cannot be used.
+ * This is a final state that can only be cleared with the method specified
+ * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
+ * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
+ * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure. Userspace will be notified of this state
- * through device wedged uevent.
+ * failure or guc loading failure or firmware failure.
+ * Userspace will be notified of this state through device wedged uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
 		return;
 	}
 
+	/* If no wedge recovery method is set, use default */
+	if (!xe->wedged.method)
+		xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
+					    | DRM_WEDGE_RECOVERY_BUS_RESET);
+
 	if (!atomic_xchg(&xe->wedged.flag, 1)) {
 		xe->needs_flr_on_fini = true;
 		drm_err(&xe->drm,
@@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
 			dev_name(xe->drm.dev));
 
 		/* Notify userspace of wedged device */
-		drm_dev_wedged_event(&xe->drm,
-				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
+		drm_dev_wedged_event(&xe->drm, xe->wedged.method);
 	}
 
 	for_each_gt(gt, xe, id)
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 0bc3bc8e6803..06350740aac5 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
 }
 
 void xe_device_declare_wedged(struct xe_device *xe);
+void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
 
 struct xe_file *xe_file_get(struct xe_file *xef);
 void xe_file_put(struct xe_file *xef);
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index b93c04466637..fb3617956d63 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -559,6 +559,8 @@ struct xe_device {
 		atomic_t flag;
 		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
 		int mode;
+		/** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
+		unsigned long method;
 	} wedged;
 
 	/** @bo_device: Struct to control async free of BOs */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] drm/xe: Add support to handle hardware errors
  2025-06-03  8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
  2025-06-03  8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
  2025-06-03  8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
@ 2025-06-03  8:13 ` Riana Tauro
  2025-06-03  8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
  3 siblings, 0 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-03  8:13 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
	frank.scarbrough

Gfx device reports two classes of errors: uncorrectable and
correctable. Depending on the severity uncorrectable errors are
further classified as non fatal and fatal

Correctable and non-fatal errors are reported as MSI's and bits in
the Master Interrupt Register indicate the class of the error.
The source of the error is then read from the Device Error Source
Register. Fatal errors are reported as PCIe errors
When a PCIe error is asserted, the OS will perform a device warm reset
which causes the driver to reload. The error registers are sticky
and the values are maintained through a warm reset

Add basic support to handle these errors

Bspec: 50875, 53073, 53074, 53075, 53076

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
 drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
 drivers/gpu/drm/xe/xe_hw_error.c           | 108 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
 drivers/gpu/drm/xe/xe_irq.c                |   4 +
 6 files changed, 144 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index e4bf484d4121..29f4e64068b7 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -73,6 +73,7 @@ xe-y += xe_bb.o \
 	xe_hw_engine.o \
 	xe_hw_engine_class_sysfs.o \
 	xe_hw_engine_group.o \
+	xe_hw_error.o \
 	xe_hw_fence.o \
 	xe_irq.o \
 	xe_lrc.o \
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
new file mode 100644
index 000000000000..ed9b81fb28a0
--- /dev/null
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_HW_ERROR_REGS_H_
+#define _XE_HW_ERROR_REGS_H_
+
+#define DEV_ERR_STAT_NONFATAL			0x100178
+#define DEV_ERR_STAT_CORRECTABLE		0x10017c
+#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
+								  DEV_ERR_STAT_CORRECTABLE, \
+								  DEV_ERR_STAT_NONFATAL))
+
+#endif
diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
index f0ecfcac4003..2758b64cec9e 100644
--- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
@@ -18,6 +18,7 @@
 #define GFX_MSTR_IRQ				XE_REG(0x190010, XE_REG_OPTION_VF)
 #define   MASTER_IRQ				REG_BIT(31)
 #define   GU_MISC_IRQ				REG_BIT(29)
+#define   ERROR_IRQ(x)				REG_BIT(26 + (x))
 #define   DISPLAY_IRQ				REG_BIT(16)
 #define   GT_DW_IRQ(x)				REG_BIT(x)
 
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
new file mode 100644
index 000000000000..0f2590839900
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include "regs/xe_hw_error_regs.h"
+#include "regs/xe_irq_regs.h"
+
+#include "xe_device.h"
+#include "xe_hw_error.h"
+#include "xe_mmio.h"
+
+/* Error categories reported by hardware */
+enum hardware_error {
+	HARDWARE_ERROR_CORRECTABLE = 0,
+	HARDWARE_ERROR_NONFATAL = 1,
+	HARDWARE_ERROR_FATAL = 2,
+	HARDWARE_ERROR_MAX,
+};
+
+static const char *hw_error_to_str(const enum hardware_error hw_err)
+{
+	switch (hw_err) {
+	case HARDWARE_ERROR_CORRECTABLE:
+		return "CORRECTABLE";
+	case HARDWARE_ERROR_NONFATAL:
+		return "NONFATAL";
+	case HARDWARE_ERROR_FATAL:
+		return "FATAL";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hw_error_to_str(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	unsigned long flags;
+	u32 err_src;
+
+	if (xe->info.platform != XE_BATTLEMAGE)
+		return;
+
+	spin_lock_irqsave(&xe->irq.lock, flags);
+	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
+	if (!err_src) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
+				    tile->id, hw_err_str);
+		goto unlock;
+	}
+
+	/* TODO: Process errrors per source */
+
+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+
+unlock:
+	spin_unlock_irqrestore(&xe->irq.lock, flags);
+}
+
+/**
+ * xe_hw_error_irq_handler - irq handling for hw errors
+ * @tile: tile instance
+ * @master_ctl: value read from master interrupt register
+ *
+ * Xe platforms add three error bits to the master interrupt register to support error handling.
+ * These three bits are used to convey the class of error FATAL, NONFATAL, or CORRECTABLE.
+ * To process the interrupt, determine the source of error by reading the Device Error Source
+ * Register that corresponds to the class of error being serviced.
+ */
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
+{
+	enum hardware_error hw_err;
+
+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+		if (master_ctl & ERROR_IRQ(hw_err))
+			hw_error_source_handler(tile, hw_err);
+}
+
+/*
+ * Process hardware errors during boot
+ */
+static void process_hw_errors(struct xe_device *xe)
+{
+	struct xe_tile *tile;
+	u32 master_ctl;
+	u8 id;
+
+	for_each_tile(tile, xe, id) {
+		master_ctl = xe_mmio_read32(&tile->mmio, GFX_MSTR_IRQ);
+		xe_hw_error_irq_handler(tile, master_ctl);
+		xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ, master_ctl);
+	}
+}
+
+/**
+ * xe_hw_error_init - Initialize hw errors
+ * @xe: xe device instance
+ *
+ * Initialize and process hw errors
+ */
+void xe_hw_error_init(struct xe_device *xe)
+{
+	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
+		return;
+
+	process_hw_errors(xe);
+}
diff --git a/drivers/gpu/drm/xe/xe_hw_error.h b/drivers/gpu/drm/xe/xe_hw_error.h
new file mode 100644
index 000000000000..d86e28c5180c
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+#ifndef XE_HW_ERROR_H_
+#define XE_HW_ERROR_H_
+
+#include <linux/types.h>
+
+struct xe_tile;
+struct xe_device;
+
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl);
+void xe_hw_error_init(struct xe_device *xe);
+#endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 5362d3174b06..24ccf3bec52c 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -18,6 +18,7 @@
 #include "xe_gt.h"
 #include "xe_guc.h"
 #include "xe_hw_engine.h"
+#include "xe_hw_error.h"
 #include "xe_memirq.h"
 #include "xe_mmio.h"
 #include "xe_pxp.h"
@@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
 		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
 
 		gt_irq_handler(tile, master_ctl, intr_dw, identity);
+		xe_hw_error_irq_handler(tile, master_ctl);
 
 		/*
 		 * Display interrupts (including display backlight operations
@@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
 	int nvec = 1;
 	int err;
 
+	xe_hw_error_init(xe);
+
 	xe_irq_reset(xe);
 
 	if (xe_device_has_msix(xe)) {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-06-03  8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (2 preceding siblings ...)
  2025-06-03  8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
@ 2025-06-03  8:14 ` Riana Tauro
  2025-06-03  9:31   ` Ghimiray, Himal Prasad
  3 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-03  8:14 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
	frank.scarbrough

Add support to handle CSC firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash. The device is then wedged and userspace is
notified with a drm uevent

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
 drivers/gpu/drm/xe/xe_device_types.h       |  3 +
 drivers/gpu/drm/xe/xe_hw_error.c           | 65 +++++++++++++++++++++-
 4 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
index 7702364b65f1..fcb6003f3226 100644
--- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
@@ -13,6 +13,8 @@
 
 /* Definitions of GSC H/W registers, bits, etc */
 
+#define BMG_GSC_HECI1_BASE	0x373000
+
 #define MTL_GSC_HECI1_BASE	0x00116000
 #define MTL_GSC_HECI2_BASE	0x00117000
 
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index ed9b81fb28a0..c146b9ef44eb 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -6,10 +6,15 @@
 #ifndef _XE_HW_ERROR_REGS_H_
 #define _XE_HW_ERROR_REGS_H_
 
+#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
+#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
+
+#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
+
 #define DEV_ERR_STAT_NONFATAL			0x100178
 #define DEV_ERR_STAT_CORRECTABLE		0x10017c
 #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
 								  DEV_ERR_STAT_CORRECTABLE, \
 								  DEV_ERR_STAT_NONFATAL))
-
+#define   XE_CSC_ERROR				BIT(17)
 #endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index fb3617956d63..1325ae917c99 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -239,6 +239,9 @@ struct xe_tile {
 	/** @memirq: Memory Based Interrupts. */
 	struct xe_memirq memirq;
 
+	/** @csc_hw_error_work: worker to report CSC HW errors */
+	struct work_struct csc_hw_error_work;
+
 	/** @pcode: tile's PCODE */
 	struct {
 		/** @pcode.lock: protecting tile's PCODE mailbox data */
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 0f2590839900..ad1e244ea612 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,7 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include "regs/xe_gsc_regs.h"
 #include "regs/xe_hw_error_regs.h"
 #include "regs/xe_irq_regs.h"
 
@@ -10,6 +11,8 @@
 #include "xe_hw_error.h"
 #include "xe_mmio.h"
 
+#define  HEC_UNCORR_FW_ERR_BITS 4
+
 /* Error categories reported by hardware */
 enum hardware_error {
 	HARDWARE_ERROR_CORRECTABLE = 0,
@@ -18,6 +21,13 @@ enum hardware_error {
 	HARDWARE_ERROR_MAX,
 };
 
+static const char * const hec_uncorrected_fw_errors[] = {
+	"Fatal",
+	"CSE Disabled",
+	"FD Corruption",
+	"Data Corruption"
+};
+
 static const char *hw_error_to_str(const enum hardware_error hw_err)
 {
 	switch (hw_err) {
@@ -32,6 +42,54 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
 	}
 }
 
+static void csc_hw_error_work(struct work_struct *work)
+{
+	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
+	struct xe_device *xe = tile_to_xe(tile);
+
+	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_FW_FLASH);
+	xe_device_declare_wedged(xe);
+}
+
+static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hw_error_to_str(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_mmio *mmio = &tile->mmio;
+	u32 base, err_bit, err_src;
+	unsigned long fw_err;
+
+	if (xe->info.platform != XE_BATTLEMAGE)
+		return;
+
+	/* Not supported in BMG */
+	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+		return;
+
+	base = BMG_GSC_HECI1_BASE;
+	lockdep_assert_held(&xe->irq.lock);
+	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
+	if (!err_src) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
+				    tile->id, hw_err_str);
+		return;
+	}
+
+	if (err_src & UNCORR_FW_REPORTED_ERR) {
+		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
+		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
+			drm_err_ratelimited(&xe->drm, HW_ERR
+					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
+					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
+					     err_bit);
+
+			schedule_work(&tile->csc_hw_error_work);
+		}
+	}
+
+	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const char *hw_err_str = hw_error_to_str(hw_err);
@@ -50,7 +108,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 		goto unlock;
 	}
 
-	/* TODO: Process errrors per source */
+	if (err_src & XE_CSC_ERROR)
+		csc_hw_error_handler(tile, hw_err);
 
 	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
 
@@ -101,8 +160,12 @@ static void process_hw_errors(struct xe_device *xe)
  */
 void xe_hw_error_init(struct xe_device *xe)
 {
+	struct xe_tile *tile = xe_device_get_root_tile(xe);
+
 	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
 		return;
 
+	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
+
 	process_hw_errors(xe);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-06-03  8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
@ 2025-06-03  9:31   ` Ghimiray, Himal Prasad
  2025-06-03  9:53     ` Riana Tauro
  0 siblings, 1 reply; 12+ messages in thread
From: Ghimiray, Himal Prasad @ 2025-06-03  9:31 UTC (permalink / raw)
  To: Riana Tauro, intel-xe, dri-devel
  Cc: anshuman.gupta, rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, frank.scarbrough



On 03-06-2025 13:44, Riana Tauro wrote:
> Add support to handle CSC firmware reported errors. When CSC firmware
> errors are encoutered, a error interrupt is received by the GFX device as
> a MSI interrupt.
> 
> Device Source control registers indicates the source of the error as CSC
> The HEC error status register indicates that the error is firmware reported
> Depending on the type of error, the error cause is written to the HEC
> Firmware error register.
> 
> On encountering such CSC firmware errors, the graphics device is
> non-recoverable from driver context. The only way to recover from these
> errors is firmware flash. The device is then wedged and userspace is
> notified with a drm uevent
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>   drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>   drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>   drivers/gpu/drm/xe/xe_hw_error.c           | 65 +++++++++++++++++++++-
>   4 files changed, 75 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> index 7702364b65f1..fcb6003f3226 100644
> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> @@ -13,6 +13,8 @@
>   
>   /* Definitions of GSC H/W registers, bits, etc */
>   
> +#define BMG_GSC_HECI1_BASE	0x373000
> +
>   #define MTL_GSC_HECI1_BASE	0x00116000
>   #define MTL_GSC_HECI2_BASE	0x00117000
>   
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index ed9b81fb28a0..c146b9ef44eb 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -6,10 +6,15 @@
>   #ifndef _XE_HW_ERROR_REGS_H_
>   #define _XE_HW_ERROR_REGS_H_
>   
> +#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
> +#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
> +
> +#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
> +
>   #define DEV_ERR_STAT_NONFATAL			0x100178
>   #define DEV_ERR_STAT_CORRECTABLE		0x10017c
>   #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>   								  DEV_ERR_STAT_CORRECTABLE, \
>   								  DEV_ERR_STAT_NONFATAL))
> -
> +#define   XE_CSC_ERROR				BIT(17)
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index fb3617956d63..1325ae917c99 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -239,6 +239,9 @@ struct xe_tile {
>   	/** @memirq: Memory Based Interrupts. */
>   	struct xe_memirq memirq;
>   
> +	/** @csc_hw_error_work: worker to report CSC HW errors */
> +	struct work_struct csc_hw_error_work;
> +
>   	/** @pcode: tile's PCODE */
>   	struct {
>   		/** @pcode.lock: protecting tile's PCODE mailbox data */
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 0f2590839900..ad1e244ea612 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -3,6 +3,7 @@
>    * Copyright © 2025 Intel Corporation
>    */
>   
> +#include "regs/xe_gsc_regs.h"
>   #include "regs/xe_hw_error_regs.h"
>   #include "regs/xe_irq_regs.h"
>   
> @@ -10,6 +11,8 @@
>   #include "xe_hw_error.h"
>   #include "xe_mmio.h"
>   
> +#define  HEC_UNCORR_FW_ERR_BITS 4
> +
>   /* Error categories reported by hardware */
>   enum hardware_error {
>   	HARDWARE_ERROR_CORRECTABLE = 0,
> @@ -18,6 +21,13 @@ enum hardware_error {
>   	HARDWARE_ERROR_MAX,
>   };
>   
> +static const char * const hec_uncorrected_fw_errors[] = {
> +	"Fatal",
> +	"CSE Disabled",
> +	"FD Corruption",
> +	"Data Corruption"
> +};
> +
>   static const char *hw_error_to_str(const enum hardware_error hw_err)
>   {
>   	switch (hw_err) {
> @@ -32,6 +42,54 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
>   	}
>   }
>   
> +static void csc_hw_error_work(struct work_struct *work)
> +{
> +	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
> +	struct xe_device *xe = tile_to_xe(tile);
> +
> +	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_FW_FLASH);
> +	xe_device_declare_wedged(xe);
> +}

Any specific need for worker to set wedging any significant impact on 
making it synchronous call?


> +
> +static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> +{
> +	const char *hw_err_str = hw_error_to_str(hw_err);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_mmio *mmio = &tile->mmio;
> +	u32 base, err_bit, err_src;
> +	unsigned long fw_err;
> +
> +	if (xe->info.platform != XE_BATTLEMAGE)
> +		return;

why platform specific check here ? I remember having similar error on 
PVC (reported by root tile).
  > +
> +	/* Not supported in BMG */
> +	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
> +		return;
> +
> +	base = BMG_GSC_HECI1_BASE;
> +	lockdep_assert_held(&xe->irq.lock);
> +	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
> +	if (!err_src) {
> +		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
> +				    tile->id, hw_err_str);
> +		return;
> +	}
> +
> +	if (err_src & UNCORR_FW_REPORTED_ERR) {
> +		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
> +		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
> +			drm_err_ratelimited(&xe->drm, HW_ERR
> +					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
> +					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
> +					     err_bit);
> +
> +			schedule_work(&tile->csc_hw_error_work);
> +		}
> +	}
> +
> +	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
> +}
> +
>   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>   {
>   	const char *hw_err_str = hw_error_to_str(hw_err);
> @@ -50,7 +108,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>   		goto unlock;
>   	}
>   
> -	/* TODO: Process errrors per source */
> +	if (err_src & XE_CSC_ERROR)
> +		csc_hw_error_handler(tile, hw_err);
>   
>   	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>   
> @@ -101,8 +160,12 @@ static void process_hw_errors(struct xe_device *xe)
>    */
>   void xe_hw_error_init(struct xe_device *xe)
>   {
> +	struct xe_tile *tile = xe_device_get_root_tile(xe);
> +
>   	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>   		return;
>   
> +	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
> +
>   	process_hw_errors(xe);
>   }


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-06-03  9:31   ` Ghimiray, Himal Prasad
@ 2025-06-03  9:53     ` Riana Tauro
  0 siblings, 0 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-03  9:53 UTC (permalink / raw)
  To: Ghimiray, Himal Prasad, intel-xe, dri-devel
  Cc: anshuman.gupta, rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, frank.scarbrough

Hi Himal

On 6/3/2025 3:01 PM, Ghimiray, Himal Prasad wrote:
> 
> 
> On 03-06-2025 13:44, Riana Tauro wrote:
>> Add support to handle CSC firmware reported errors. When CSC firmware
>> errors are encoutered, a error interrupt is received by the GFX device as
>> a MSI interrupt.
>>
>> Device Source control registers indicates the source of the error as CSC
>> The HEC error status register indicates that the error is firmware 
>> reported
>> Depending on the type of error, the error cause is written to the HEC
>> Firmware error register.
>>
>> On encountering such CSC firmware errors, the graphics device is
>> non-recoverable from driver context. The only way to recover from these
>> errors is firmware flash. The device is then wedged and userspace is
>> notified with a drm uevent
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>>   drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 65 +++++++++++++++++++++-
>>   4 files changed, 75 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/ 
>> xe/regs/xe_gsc_regs.h
>> index 7702364b65f1..fcb6003f3226 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> @@ -13,6 +13,8 @@
>>   /* Definitions of GSC H/W registers, bits, etc */
>> +#define BMG_GSC_HECI1_BASE    0x373000
>> +
>>   #define MTL_GSC_HECI1_BASE    0x00116000
>>   #define MTL_GSC_HECI2_BASE    0x00117000
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/ 
>> drm/xe/regs/xe_hw_error_regs.h
>> index ed9b81fb28a0..c146b9ef44eb 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -6,10 +6,15 @@
>>   #ifndef _XE_HW_ERROR_REGS_H_
>>   #define _XE_HW_ERROR_REGS_H_
>> +#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) 
>> + 0x118)
>> +#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>> +
>> +#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) 
>> + 0x124)
>> +
>>   #define DEV_ERR_STAT_NONFATAL            0x100178
>>   #define DEV_ERR_STAT_CORRECTABLE        0x10017c
>>   #define DEV_ERR_STAT_REG(x)            XE_REG(_PICK_EVEN((x), \
>>                                     DEV_ERR_STAT_CORRECTABLE, \
>>                                     DEV_ERR_STAT_NONFATAL))
>> -
>> +#define   XE_CSC_ERROR                BIT(17)
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/ 
>> xe/xe_device_types.h
>> index fb3617956d63..1325ae917c99 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -239,6 +239,9 @@ struct xe_tile {
>>       /** @memirq: Memory Based Interrupts. */
>>       struct xe_memirq memirq;
>> +    /** @csc_hw_error_work: worker to report CSC HW errors */
>> +    struct work_struct csc_hw_error_work;
>> +
>>       /** @pcode: tile's PCODE */
>>       struct {
>>           /** @pcode.lock: protecting tile's PCODE mailbox data */
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/ 
>> xe_hw_error.c
>> index 0f2590839900..ad1e244ea612 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,6 +3,7 @@
>>    * Copyright © 2025 Intel Corporation
>>    */
>> +#include "regs/xe_gsc_regs.h"
>>   #include "regs/xe_hw_error_regs.h"
>>   #include "regs/xe_irq_regs.h"
>> @@ -10,6 +11,8 @@
>>   #include "xe_hw_error.h"
>>   #include "xe_mmio.h"
>> +#define  HEC_UNCORR_FW_ERR_BITS 4
>> +
>>   /* Error categories reported by hardware */
>>   enum hardware_error {
>>       HARDWARE_ERROR_CORRECTABLE = 0,
>> @@ -18,6 +21,13 @@ enum hardware_error {
>>       HARDWARE_ERROR_MAX,
>>   };
>> +static const char * const hec_uncorrected_fw_errors[] = {
>> +    "Fatal",
>> +    "CSE Disabled",
>> +    "FD Corruption",
>> +    "Data Corruption"
>> +};
>> +
>>   static const char *hw_error_to_str(const enum hardware_error hw_err)
>>   {
>>       switch (hw_err) {
>> @@ -32,6 +42,54 @@ static const char *hw_error_to_str(const enum 
>> hardware_error hw_err)
>>       }
>>   }
>> +static void csc_hw_error_work(struct work_struct *work)
>> +{
>> +    struct xe_tile *tile = container_of(work, typeof(*tile), 
>> csc_hw_error_work);
>> +    struct xe_device *xe = tile_to_xe(tile);
>> +
>> +    xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_FW_FLASH);
>> +    xe_device_declare_wedged(xe);
>> +}
> 
> Any specific need for worker to set wedging any significant impact on 
> making it synchronous call?

I tried synchronous but there is a sleeping function that caused an 
error that's why moved it to workqueue

> 
> 
>> +
>> +static void csc_hw_error_handler(struct xe_tile *tile, const enum 
>> hardware_error hw_err)
>> +{
>> +    const char *hw_err_str = hw_error_to_str(hw_err);
>> +    struct xe_device *xe = tile_to_xe(tile);
>> +    struct xe_mmio *mmio = &tile->mmio;
>> +    u32 base, err_bit, err_src;
>> +    unsigned long fw_err;
>> +
>> +    if (xe->info.platform != XE_BATTLEMAGE)
>> +        return;
> 
> why platform specific check here ? I remember having similar error on 
> PVC (reported by root tile).

No PVC had the GSC error bit set and this is only applicable for CSC 
errors. On encountering such errors, device is wedged and uevent needs 
to be sent for firmware update which was also not applicable for PVC

Hence the check

Thanks
Riana

>   > +
>> +    /* Not supported in BMG */
>> +    if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>> +        return;
>> +
>> +    base = BMG_GSC_HECI1_BASE;
>> +    lockdep_assert_held(&xe->irq.lock);
>> +    err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>> +    if (!err_src) {
>> +        drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported 
>> HEC_ERR_STATUS_%s blank\n",
>> +                    tile->id, hw_err_str);
>> +        return;
>> +    }
>> +
>> +    if (err_src & UNCORR_FW_REPORTED_ERR) {
>> +        fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>> +        for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>> +            drm_err_ratelimited(&xe->drm, HW_ERR
>> +                        "%s: HEC Uncorrected FW %s error reported, 
>> bit[%d] is set\n",
>> +                         hw_err_str, hec_uncorrected_fw_errors[err_bit],
>> +                         err_bit);
>> +
>> +            schedule_work(&tile->csc_hw_error_work);
>> +        }
>> +    }
>> +
>> +    xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>> +}
>> +
>>   static void hw_error_source_handler(struct xe_tile *tile, const enum 
>> hardware_error hw_err)
>>   {
>>       const char *hw_err_str = hw_error_to_str(hw_err);
>> @@ -50,7 +108,8 @@ static void hw_error_source_handler(struct xe_tile 
>> *tile, const enum hardware_er
>>           goto unlock;
>>       }
>> -    /* TODO: Process errrors per source */
>> +    if (err_src & XE_CSC_ERROR)
>> +        csc_hw_error_handler(tile, hw_err);
>>       xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>> @@ -101,8 +160,12 @@ static void process_hw_errors(struct xe_device *xe)
>>    */
>>   void xe_hw_error_init(struct xe_device *xe)
>>   {
>> +    struct xe_tile *tile = xe_device_get_root_tile(xe);
>> +
>>       if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>           return;
>> +    INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>> +
>>       process_hw_errors(xe);
>>   }
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
  2025-06-03  8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
@ 2025-06-04 10:43   ` Raag Jadav
  2025-06-05 11:24     ` Riana Tauro
  0 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2025-06-04 10:43 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough

On Tue, Jun 03, 2025 at 01:43:57PM +0530, Riana Tauro wrote:
> A device is declared wedged when it is non-recoverable from
> the driver context. Some firmware errors can also cause
> the device to enter this state and the only method to recover
> from this would be to do a firmware flash
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  Documentation/gpu/drm-uapi.rst | 6 +++---
>  drivers/gpu/drm/drm_drv.c      | 2 ++
>  include/drm/drm_device.h       | 1 +
>  3 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 4863a4deb0ee..524224afb09f 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
>  can use any one, multiple or none. Method(s) of choice will be sent in the
>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>  more side-effects. If driver is unsure about recovery or method is unknown
> -(like soft/hard system reboot, firmware flashing, physical device replacement
> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> -will be sent instead.
> +(like soft/hard system reboot, physical device replacement or any other procedure
> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
>  
>  Userspace consumers can parse this event and attempt recovery as per the
>  following expectations.
> @@ -435,6 +434,7 @@ following expectations.
>      none            optional telemetry collection
>      rebind          unbind + bind driver
>      bus-reset       unbind + bus reset/re-enumeration + bind
> +    firmware-flash  unbind + firmware flash + bind

Can you guarantee this to be generic for all drivers?

Raag

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
  2025-06-04 10:43   ` Raag Jadav
@ 2025-06-05 11:24     ` Riana Tauro
  2025-06-16 20:39       ` Rodrigo Vivi
  0 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-05 11:24 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough,
	andrealmeid, christian.koenig


Hi Raag

On 6/4/2025 4:13 PM, Raag Jadav wrote:
> On Tue, Jun 03, 2025 at 01:43:57PM +0530, Riana Tauro wrote:
>> A device is declared wedged when it is non-recoverable from
>> the driver context. Some firmware errors can also cause
>> the device to enter this state and the only method to recover
>> from this would be to do a firmware flash
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   Documentation/gpu/drm-uapi.rst | 6 +++---
>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>   include/drm/drm_device.h       | 1 +
>>   3 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 4863a4deb0ee..524224afb09f 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>   more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +(like soft/hard system reboot, physical device replacement or any other procedure
>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
>>   
>>   Userspace consumers can parse this event and attempt recovery as per the
>>   following expectations.
>> @@ -435,6 +434,7 @@ following expectations.
>>       none            optional telemetry collection
>>       rebind          unbind + bind driver
>>       bus-reset       unbind + bus reset/re-enumeration + bind
>> +    firmware-flash  unbind + firmware flash + bind
> 
> Can you guarantee this to be generic for all drivers?


Firmware flash as a method was mentioned as unknown in the document. So 
if there is an error that requires firmware flash to recover, mentioning 
this as recovery method should be okay

Wanted to get some comments on unbind/bind. If this is not required will 
remove it.

Adding reviewers for inputs

Thanks
Riana


> 
> Raag


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/xe: Add a helper function to set recovery method
  2025-06-03  8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
@ 2025-06-06 15:12   ` Raag Jadav
  2025-06-19  7:26     ` Riana Tauro
  0 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2025-06-06 15:12 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough

On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
> Add a helper function to set recovery method. The recovery
> method has to be set before declaring the device wedged and sending the
> drm wedged uevent. If no method is set, default unbind/re-bind method
> will be set
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       | 30 +++++++++++++++++++++-------
>  drivers/gpu/drm/xe/xe_device.h       |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>  3 files changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 660b0c5126dc..3fd604ebdc6e 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>  	xe_pm_runtime_put(xe);
>  }
>  
> +/**
> + * xe_device_set_wedged_method - Set wedged recovery method
> + * @xe: xe device instance

Missing @method

> + *
> + * Set wedged recovery method to be sent using drm wedged uevent.
> + */
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
> +{
> +	xe->wedged.method = method;
> +}
> +
>  /**
>   * xe_device_declare_wedged - Declare device wedged
>   * @xe: xe device instance
>   *
> - * This is a final state that can only be cleared with a module
> - * re-probe (unbind + bind).
> - * In this state every IOCTL will be blocked so the GT cannot be used.
> + * This is a final state that can only be cleared with the method specified
> + * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
> + * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
> + * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.

The file convention seems like 80 characters for kernel doc, so let's
stick to it.

>   * In general it will be called upon any critical error such as gt reset
> - * failure or guc loading failure. Userspace will be notified of this state
> - * through device wedged uevent.
> + * failure or guc loading failure or firmware failure.
> + * Userspace will be notified of this state through device wedged uevent.
>   * If xe.wedged module parameter is set to 2, this function will be called
>   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>   * snapshot capture. In this mode, GT reset won't be attempted so the state of
> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  		return;
>  	}
>  
> +	/* If no wedge recovery method is set, use default */
> +	if (!xe->wedged.method)
> +		xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
> +					    | DRM_WEDGE_RECOVERY_BUS_RESET);

Although there are no strict rules about this, we usually don't begin a
new line with a symbol.

> +
>  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>  		xe->needs_flr_on_fini = true;
>  		drm_err(&xe->drm,
> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  			dev_name(xe->drm.dev));
>  
>  		/* Notify userspace of wedged device */
> -		drm_dev_wedged_event(&xe->drm,
> -				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
> +		drm_dev_wedged_event(&xe->drm, xe->wedged.method);

I was a bit late to realize it when I originally added this. The event
call should be after xe_gt_declare_wedged() to comply with wedging rules.
We notify userspace *after* we're done with driver cleanup.

Raag

>  	}
>  
>  	for_each_gt(gt, xe, id)
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 0bc3bc8e6803..06350740aac5 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>  }
>  
>  void xe_device_declare_wedged(struct xe_device *xe);
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>  
>  struct xe_file *xe_file_get(struct xe_file *xef);
>  void xe_file_put(struct xe_file *xef);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index b93c04466637..fb3617956d63 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -559,6 +559,8 @@ struct xe_device {
>  		atomic_t flag;
>  		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
>  		int mode;
> +		/** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
> +		unsigned long method;
>  	} wedged;
>  
>  	/** @bo_device: Struct to control async free of BOs */
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
  2025-06-05 11:24     ` Riana Tauro
@ 2025-06-16 20:39       ` Rodrigo Vivi
  0 siblings, 0 replies; 12+ messages in thread
From: Rodrigo Vivi @ 2025-06-16 20:39 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Raag Jadav, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough,
	andrealmeid, christian.koenig

On Thu, Jun 05, 2025 at 04:54:24PM +0530, Riana Tauro wrote:
> 
> Hi Raag
> 
> On 6/4/2025 4:13 PM, Raag Jadav wrote:
> > On Tue, Jun 03, 2025 at 01:43:57PM +0530, Riana Tauro wrote:
> > > A device is declared wedged when it is non-recoverable from
> > > the driver context. Some firmware errors can also cause
> > > the device to enter this state and the only method to recover
> > > from this would be to do a firmware flash
> > > 
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > >   Documentation/gpu/drm-uapi.rst | 6 +++---
> > >   drivers/gpu/drm/drm_drv.c      | 2 ++
> > >   include/drm/drm_device.h       | 1 +
> > >   3 files changed, 6 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 4863a4deb0ee..524224afb09f 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
> > >   can use any one, multiple or none. Method(s) of choice will be sent in the
> > >   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > >   more side-effects. If driver is unsure about recovery or method is unknown
> > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > -will be sent instead.
> > > +(like soft/hard system reboot, physical device replacement or any other procedure
> > > +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
> > >   Userspace consumers can parse this event and attempt recovery as per the
> > >   following expectations.
> > > @@ -435,6 +434,7 @@ following expectations.
> > >       none            optional telemetry collection
> > >       rebind          unbind + bind driver
> > >       bus-reset       unbind + bus reset/re-enumeration + bind
> > > +    firmware-flash  unbind + firmware flash + bind
> > 
> > Can you guarantee this to be generic for all drivers?
> 
> 
> Firmware flash as a method was mentioned as unknown in the document. So if
> there is an error that requires firmware flash to recover, mentioning this
> as recovery method should be okay
> 
> Wanted to get some comments on unbind/bind. If this is not required will
> remove it.

Yeap, probably better to remove the unbind/bind and keep this generic.
Even in some of our cases we should need to unbind + config-survivability + rebind + flash firmware + unbind + delete configfs + bind.

> 
> Adding reviewers for inputs
> 
> Thanks
> Riana
> 
> 
> > 
> > Raag
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/xe: Add a helper function to set recovery method
  2025-06-06 15:12   ` Raag Jadav
@ 2025-06-19  7:26     ` Riana Tauro
  0 siblings, 0 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-19  7:26 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough

Hi Raag

Thank you for the review comments

On 6/6/2025 8:42 PM, Raag Jadav wrote:
> On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
>> Add a helper function to set recovery method. The recovery
>> method has to be set before declaring the device wedged and sending the
>> drm wedged uevent. If no method is set, default unbind/re-bind method
>> will be set
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c       | 30 +++++++++++++++++++++-------
>>   drivers/gpu/drm/xe/xe_device.h       |  1 +
>>   drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>>   3 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 660b0c5126dc..3fd604ebdc6e 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>   	xe_pm_runtime_put(xe);
>>   }
>>   
>> +/**
>> + * xe_device_set_wedged_method - Set wedged recovery method
>> + * @xe: xe device instance
> 
> Missing @method

Missed this. Will fix it>
>> + *
>> + * Set wedged recovery method to be sent using drm wedged uevent.
>> + */
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
>> +{
>> +	xe->wedged.method = method;
>> +}
>> +
>>   /**
>>    * xe_device_declare_wedged - Declare device wedged
>>    * @xe: xe device instance
>>    *
>> - * This is a final state that can only be cleared with a module
>> - * re-probe (unbind + bind).
>> - * In this state every IOCTL will be blocked so the GT cannot be used.
>> + * This is a final state that can only be cleared with the method specified
>> + * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
>> + * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
>> + * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.
> 
> The file convention seems like 80 characters for kernel doc, so let's
> stick to it.

okay

> 
>>    * In general it will be called upon any critical error such as gt reset
>> - * failure or guc loading failure. Userspace will be notified of this state
>> - * through device wedged uevent.
>> + * failure or guc loading failure or firmware failure.
>> + * Userspace will be notified of this state through device wedged uevent.
>>    * If xe.wedged module parameter is set to 2, this function will be called
>>    * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>>    * snapshot capture. In this mode, GT reset won't be attempted so the state of
>> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   		return;
>>   	}
>>   
>> +	/* If no wedge recovery method is set, use default */
>> +	if (!xe->wedged.method)
>> +		xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
>> +					    | DRM_WEDGE_RECOVERY_BUS_RESET);
> 
> Although there are no strict rules about this, we usually don't begin a
> new line with a symbol.

will fix this

> 
>> +
>>   	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>>   		xe->needs_flr_on_fini = true;
>>   		drm_err(&xe->drm,
>> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   			dev_name(xe->drm.dev));
>>   
>>   		/* Notify userspace of wedged device */
>> -		drm_dev_wedged_event(&xe->drm,
>> -				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
>> +		drm_dev_wedged_event(&xe->drm, xe->wedged.method);
> 
> I was a bit late to realize it when I originally added this. The event
> call should be after xe_gt_declare_wedged() to comply with wedging rules.
> We notify userspace *after* we're done with driver cleanup.

Will move gt_wedged before uevent

Thanks
Riana

> 
> Raag
> 
>>   	}
>>   
>>   	for_each_gt(gt, xe, id)
>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>> index 0bc3bc8e6803..06350740aac5 100644
>> --- a/drivers/gpu/drm/xe/xe_device.h
>> +++ b/drivers/gpu/drm/xe/xe_device.h
>> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>>   }
>>   
>>   void xe_device_declare_wedged(struct xe_device *xe);
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>>   
>>   struct xe_file *xe_file_get(struct xe_file *xef);
>>   void xe_file_put(struct xe_file *xef);
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index b93c04466637..fb3617956d63 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -559,6 +559,8 @@ struct xe_device {
>>   		atomic_t flag;
>>   		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
>>   		int mode;
>> +		/** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
>> +		unsigned long method;
>>   	} wedged;
>>   
>>   	/** @bo_device: Struct to control async free of BOs */
>> -- 
>> 2.47.1
>>
w


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-06-19  7:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03  8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
2025-06-03  8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-04 10:43   ` Raag Jadav
2025-06-05 11:24     ` Riana Tauro
2025-06-16 20:39       ` Rodrigo Vivi
2025-06-03  8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-06-06 15:12   ` Raag Jadav
2025-06-19  7:26     ` Riana Tauro
2025-06-03  8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
2025-06-03  8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-06-03  9:31   ` Ghimiray, Himal Prasad
2025-06-03  9:53     ` Riana Tauro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).