* [PATCH 0/4] Handle Firmware reported Hardware Errors
@ 2025-06-03 8:13 Riana Tauro
2025-06-03 8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-03 8:13 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
frank.scarbrough
Add support to handle firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.
Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of firmware error, the error cause is written to the HEC
Firmware error register.
On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash.
Add a firmware flash method to the drm device wedged uevent. Send
this method along with the uevent to notify userspace of the wedged
state and the possible recovery method.
$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent
KERNEL[754.709341] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=firmware-flash
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0
Bspec: 50875, 53073, 53074, 53075, 53076
Riana Tauro (4):
drm: Add a firmware flash method to device wedged uevent
drm/xe: Add a helper function to set recovery method
drm/xe: Add support to handle hardware errors
drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
Documentation/gpu/drm-uapi.rst | 6 +-
drivers/gpu/drm/drm_drv.c | 2 +
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/regs/xe_gsc_regs.h | 2 +
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 20 +++
drivers/gpu/drm/xe/regs/xe_irq_regs.h | 1 +
drivers/gpu/drm/xe/xe_device.c | 30 +++-
drivers/gpu/drm/xe/xe_device.h | 1 +
drivers/gpu/drm/xe/xe_device_types.h | 5 +
drivers/gpu/drm/xe/xe_hw_error.c | 171 +++++++++++++++++++++
drivers/gpu/drm/xe/xe_hw_error.h | 15 ++
drivers/gpu/drm/xe/xe_irq.c | 4 +
include/drm/drm_device.h | 1 +
13 files changed, 249 insertions(+), 10 deletions(-)
create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
--
2.47.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
2025-06-03 8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
@ 2025-06-03 8:13 ` Riana Tauro
2025-06-04 10:43 ` Raag Jadav
2025-06-03 8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-03 8:13 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
frank.scarbrough
A device is declared wedged when it is non-recoverable from
the driver context. Some firmware errors can also cause
the device to enter this state and the only method to recover
from this would be to do a firmware flash
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
Documentation/gpu/drm-uapi.rst | 6 +++---
drivers/gpu/drm/drm_drv.c | 2 ++
include/drm/drm_device.h | 1 +
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 4863a4deb0ee..524224afb09f 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
can use any one, multiple or none. Method(s) of choice will be sent in the
uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
more side-effects. If driver is unsure about recovery or method is unknown
-(like soft/hard system reboot, firmware flashing, physical device replacement
-or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
-will be sent instead.
+(like soft/hard system reboot, physical device replacement or any other procedure
+which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
Userspace consumers can parse this event and attempt recovery as per the
following expectations.
@@ -435,6 +434,7 @@ following expectations.
none optional telemetry collection
rebind unbind + bind driver
bus-reset unbind + bus reset/re-enumeration + bind
+ firmware-flash unbind + firmware flash + bind
unknown consumer policy
=============== ========================================
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 56dd61f8e05a..e287424db753 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -533,6 +533,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
return "rebind";
case DRM_WEDGE_RECOVERY_BUS_RESET:
return "bus-reset";
+ case DRM_WEDGE_RECOVERY_FW_FLASH:
+ return "firmware-flash";
default:
return NULL;
}
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index e2f894f1b90a..e446e40f8306 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -29,6 +29,7 @@ struct pci_controller;
#define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional telemetry collection */
#define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver */
#define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus device + bind */
+#define DRM_WEDGE_RECOVERY_FW_FLASH BIT(3) /* unbind + firmware flash + bind */
/**
* enum switch_power_state - power state of drm device
--
2.47.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/4] drm/xe: Add a helper function to set recovery method
2025-06-03 8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
2025-06-03 8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
@ 2025-06-03 8:13 ` Riana Tauro
2025-06-06 15:12 ` Raag Jadav
2025-06-03 8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
2025-06-03 8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
3 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-03 8:13 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
frank.scarbrough
Add a helper function to set recovery method. The recovery
method has to be set before declaring the device wedged and sending the
drm wedged uevent. If no method is set, default unbind/re-bind method
will be set
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
drivers/gpu/drm/xe/xe_device.c | 30 +++++++++++++++++++++-------
drivers/gpu/drm/xe/xe_device.h | 1 +
drivers/gpu/drm/xe/xe_device_types.h | 2 ++
3 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 660b0c5126dc..3fd604ebdc6e 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
xe_pm_runtime_put(xe);
}
+/**
+ * xe_device_set_wedged_method - Set wedged recovery method
+ * @xe: xe device instance
+ *
+ * Set wedged recovery method to be sent using drm wedged uevent.
+ */
+void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
+{
+ xe->wedged.method = method;
+}
+
/**
* xe_device_declare_wedged - Declare device wedged
* @xe: xe device instance
*
- * This is a final state that can only be cleared with a module
- * re-probe (unbind + bind).
- * In this state every IOCTL will be blocked so the GT cannot be used.
+ * This is a final state that can only be cleared with the method specified
+ * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
+ * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
+ * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.
* In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure. Userspace will be notified of this state
- * through device wedged uevent.
+ * failure or guc loading failure or firmware failure.
+ * Userspace will be notified of this state through device wedged uevent.
* If xe.wedged module parameter is set to 2, this function will be called
* on every single execution timeout (a.k.a. GPU hang) right after devcoredump
* snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
return;
}
+ /* If no wedge recovery method is set, use default */
+ if (!xe->wedged.method)
+ xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
+ | DRM_WEDGE_RECOVERY_BUS_RESET);
+
if (!atomic_xchg(&xe->wedged.flag, 1)) {
xe->needs_flr_on_fini = true;
drm_err(&xe->drm,
@@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
dev_name(xe->drm.dev));
/* Notify userspace of wedged device */
- drm_dev_wedged_event(&xe->drm,
- DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
+ drm_dev_wedged_event(&xe->drm, xe->wedged.method);
}
for_each_gt(gt, xe, id)
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 0bc3bc8e6803..06350740aac5 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
}
void xe_device_declare_wedged(struct xe_device *xe);
+void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
struct xe_file *xe_file_get(struct xe_file *xef);
void xe_file_put(struct xe_file *xef);
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index b93c04466637..fb3617956d63 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -559,6 +559,8 @@ struct xe_device {
atomic_t flag;
/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
int mode;
+ /** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
+ unsigned long method;
} wedged;
/** @bo_device: Struct to control async free of BOs */
--
2.47.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/4] drm/xe: Add support to handle hardware errors
2025-06-03 8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
2025-06-03 8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-03 8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
@ 2025-06-03 8:13 ` Riana Tauro
2025-06-03 8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
3 siblings, 0 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-03 8:13 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
frank.scarbrough
Gfx device reports two classes of errors: uncorrectable and
correctable. Depending on the severity uncorrectable errors are
further classified as non fatal and fatal
Correctable and non-fatal errors are reported as MSI's and bits in
the Master Interrupt Register indicate the class of the error.
The source of the error is then read from the Device Error Source
Register. Fatal errors are reported as PCIe errors
When a PCIe error is asserted, the OS will perform a device warm reset
which causes the driver to reload. The error registers are sticky
and the values are maintained through a warm reset
Add basic support to handle these errors
Bspec: 50875, 53073, 53074, 53075, 53076
Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 15 +++
drivers/gpu/drm/xe/regs/xe_irq_regs.h | 1 +
drivers/gpu/drm/xe/xe_hw_error.c | 108 +++++++++++++++++++++
drivers/gpu/drm/xe/xe_hw_error.h | 15 +++
drivers/gpu/drm/xe/xe_irq.c | 4 +
6 files changed, 144 insertions(+)
create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index e4bf484d4121..29f4e64068b7 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -73,6 +73,7 @@ xe-y += xe_bb.o \
xe_hw_engine.o \
xe_hw_engine_class_sysfs.o \
xe_hw_engine_group.o \
+ xe_hw_error.o \
xe_hw_fence.o \
xe_irq.o \
xe_lrc.o \
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
new file mode 100644
index 000000000000..ed9b81fb28a0
--- /dev/null
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_HW_ERROR_REGS_H_
+#define _XE_HW_ERROR_REGS_H_
+
+#define DEV_ERR_STAT_NONFATAL 0x100178
+#define DEV_ERR_STAT_CORRECTABLE 0x10017c
+#define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \
+ DEV_ERR_STAT_CORRECTABLE, \
+ DEV_ERR_STAT_NONFATAL))
+
+#endif
diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
index f0ecfcac4003..2758b64cec9e 100644
--- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
@@ -18,6 +18,7 @@
#define GFX_MSTR_IRQ XE_REG(0x190010, XE_REG_OPTION_VF)
#define MASTER_IRQ REG_BIT(31)
#define GU_MISC_IRQ REG_BIT(29)
+#define ERROR_IRQ(x) REG_BIT(26 + (x))
#define DISPLAY_IRQ REG_BIT(16)
#define GT_DW_IRQ(x) REG_BIT(x)
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
new file mode 100644
index 000000000000..0f2590839900
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include "regs/xe_hw_error_regs.h"
+#include "regs/xe_irq_regs.h"
+
+#include "xe_device.h"
+#include "xe_hw_error.h"
+#include "xe_mmio.h"
+
+/* Error categories reported by hardware */
+enum hardware_error {
+ HARDWARE_ERROR_CORRECTABLE = 0,
+ HARDWARE_ERROR_NONFATAL = 1,
+ HARDWARE_ERROR_FATAL = 2,
+ HARDWARE_ERROR_MAX,
+};
+
+static const char *hw_error_to_str(const enum hardware_error hw_err)
+{
+ switch (hw_err) {
+ case HARDWARE_ERROR_CORRECTABLE:
+ return "CORRECTABLE";
+ case HARDWARE_ERROR_NONFATAL:
+ return "NONFATAL";
+ case HARDWARE_ERROR_FATAL:
+ return "FATAL";
+ default:
+ return "UNKNOWN";
+ }
+}
+
+static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+ const char *hw_err_str = hw_error_to_str(hw_err);
+ struct xe_device *xe = tile_to_xe(tile);
+ unsigned long flags;
+ u32 err_src;
+
+ if (xe->info.platform != XE_BATTLEMAGE)
+ return;
+
+ spin_lock_irqsave(&xe->irq.lock, flags);
+ err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
+ if (!err_src) {
+ drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
+ tile->id, hw_err_str);
+ goto unlock;
+ }
+
+ /* TODO: Process errrors per source */
+
+ xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+
+unlock:
+ spin_unlock_irqrestore(&xe->irq.lock, flags);
+}
+
+/**
+ * xe_hw_error_irq_handler - irq handling for hw errors
+ * @tile: tile instance
+ * @master_ctl: value read from master interrupt register
+ *
+ * Xe platforms add three error bits to the master interrupt register to support error handling.
+ * These three bits are used to convey the class of error FATAL, NONFATAL, or CORRECTABLE.
+ * To process the interrupt, determine the source of error by reading the Device Error Source
+ * Register that corresponds to the class of error being serviced.
+ */
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
+{
+ enum hardware_error hw_err;
+
+ for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+ if (master_ctl & ERROR_IRQ(hw_err))
+ hw_error_source_handler(tile, hw_err);
+}
+
+/*
+ * Process hardware errors during boot
+ */
+static void process_hw_errors(struct xe_device *xe)
+{
+ struct xe_tile *tile;
+ u32 master_ctl;
+ u8 id;
+
+ for_each_tile(tile, xe, id) {
+ master_ctl = xe_mmio_read32(&tile->mmio, GFX_MSTR_IRQ);
+ xe_hw_error_irq_handler(tile, master_ctl);
+ xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ, master_ctl);
+ }
+}
+
+/**
+ * xe_hw_error_init - Initialize hw errors
+ * @xe: xe device instance
+ *
+ * Initialize and process hw errors
+ */
+void xe_hw_error_init(struct xe_device *xe)
+{
+ if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
+ return;
+
+ process_hw_errors(xe);
+}
diff --git a/drivers/gpu/drm/xe/xe_hw_error.h b/drivers/gpu/drm/xe/xe_hw_error.h
new file mode 100644
index 000000000000..d86e28c5180c
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+#ifndef XE_HW_ERROR_H_
+#define XE_HW_ERROR_H_
+
+#include <linux/types.h>
+
+struct xe_tile;
+struct xe_device;
+
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl);
+void xe_hw_error_init(struct xe_device *xe);
+#endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 5362d3174b06..24ccf3bec52c 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -18,6 +18,7 @@
#include "xe_gt.h"
#include "xe_guc.h"
#include "xe_hw_engine.h"
+#include "xe_hw_error.h"
#include "xe_memirq.h"
#include "xe_mmio.h"
#include "xe_pxp.h"
@@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
gt_irq_handler(tile, master_ctl, intr_dw, identity);
+ xe_hw_error_irq_handler(tile, master_ctl);
/*
* Display interrupts (including display backlight operations
@@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
int nvec = 1;
int err;
+ xe_hw_error_init(xe);
+
xe_irq_reset(xe);
if (xe_device_has_msix(xe)) {
--
2.47.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
2025-06-03 8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
` (2 preceding siblings ...)
2025-06-03 8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
@ 2025-06-03 8:14 ` Riana Tauro
2025-06-03 9:31 ` Ghimiray, Himal Prasad
3 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-03 8:14 UTC (permalink / raw)
To: intel-xe, dri-devel
Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, raag.jadav, himal.prasad.ghimiray,
frank.scarbrough
Add support to handle CSC firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.
Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of error, the error cause is written to the HEC
Firmware error register.
On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash. The device is then wedged and userspace is
notified with a drm uevent
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
drivers/gpu/drm/xe/regs/xe_gsc_regs.h | 2 +
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 7 ++-
drivers/gpu/drm/xe/xe_device_types.h | 3 +
drivers/gpu/drm/xe/xe_hw_error.c | 65 +++++++++++++++++++++-
4 files changed, 75 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
index 7702364b65f1..fcb6003f3226 100644
--- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
@@ -13,6 +13,8 @@
/* Definitions of GSC H/W registers, bits, etc */
+#define BMG_GSC_HECI1_BASE 0x373000
+
#define MTL_GSC_HECI1_BASE 0x00116000
#define MTL_GSC_HECI2_BASE 0x00117000
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index ed9b81fb28a0..c146b9ef44eb 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -6,10 +6,15 @@
#ifndef _XE_HW_ERROR_REGS_H_
#define _XE_HW_ERROR_REGS_H_
+#define HEC_UNCORR_ERR_STATUS(base) XE_REG((base) + 0x118)
+#define UNCORR_FW_REPORTED_ERR BIT(6)
+
+#define HEC_UNCORR_FW_ERR_DW0(base) XE_REG((base) + 0x124)
+
#define DEV_ERR_STAT_NONFATAL 0x100178
#define DEV_ERR_STAT_CORRECTABLE 0x10017c
#define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \
DEV_ERR_STAT_CORRECTABLE, \
DEV_ERR_STAT_NONFATAL))
-
+#define XE_CSC_ERROR BIT(17)
#endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index fb3617956d63..1325ae917c99 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -239,6 +239,9 @@ struct xe_tile {
/** @memirq: Memory Based Interrupts. */
struct xe_memirq memirq;
+ /** @csc_hw_error_work: worker to report CSC HW errors */
+ struct work_struct csc_hw_error_work;
+
/** @pcode: tile's PCODE */
struct {
/** @pcode.lock: protecting tile's PCODE mailbox data */
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 0f2590839900..ad1e244ea612 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,7 @@
* Copyright © 2025 Intel Corporation
*/
+#include "regs/xe_gsc_regs.h"
#include "regs/xe_hw_error_regs.h"
#include "regs/xe_irq_regs.h"
@@ -10,6 +11,8 @@
#include "xe_hw_error.h"
#include "xe_mmio.h"
+#define HEC_UNCORR_FW_ERR_BITS 4
+
/* Error categories reported by hardware */
enum hardware_error {
HARDWARE_ERROR_CORRECTABLE = 0,
@@ -18,6 +21,13 @@ enum hardware_error {
HARDWARE_ERROR_MAX,
};
+static const char * const hec_uncorrected_fw_errors[] = {
+ "Fatal",
+ "CSE Disabled",
+ "FD Corruption",
+ "Data Corruption"
+};
+
static const char *hw_error_to_str(const enum hardware_error hw_err)
{
switch (hw_err) {
@@ -32,6 +42,54 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
}
}
+static void csc_hw_error_work(struct work_struct *work)
+{
+ struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
+ struct xe_device *xe = tile_to_xe(tile);
+
+ xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_FW_FLASH);
+ xe_device_declare_wedged(xe);
+}
+
+static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+ const char *hw_err_str = hw_error_to_str(hw_err);
+ struct xe_device *xe = tile_to_xe(tile);
+ struct xe_mmio *mmio = &tile->mmio;
+ u32 base, err_bit, err_src;
+ unsigned long fw_err;
+
+ if (xe->info.platform != XE_BATTLEMAGE)
+ return;
+
+ /* Not supported in BMG */
+ if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+ return;
+
+ base = BMG_GSC_HECI1_BASE;
+ lockdep_assert_held(&xe->irq.lock);
+ err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
+ if (!err_src) {
+ drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
+ tile->id, hw_err_str);
+ return;
+ }
+
+ if (err_src & UNCORR_FW_REPORTED_ERR) {
+ fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
+ for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
+ drm_err_ratelimited(&xe->drm, HW_ERR
+ "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
+ hw_err_str, hec_uncorrected_fw_errors[err_bit],
+ err_bit);
+
+ schedule_work(&tile->csc_hw_error_work);
+ }
+ }
+
+ xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
+}
+
static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
{
const char *hw_err_str = hw_error_to_str(hw_err);
@@ -50,7 +108,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
goto unlock;
}
- /* TODO: Process errrors per source */
+ if (err_src & XE_CSC_ERROR)
+ csc_hw_error_handler(tile, hw_err);
xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
@@ -101,8 +160,12 @@ static void process_hw_errors(struct xe_device *xe)
*/
void xe_hw_error_init(struct xe_device *xe)
{
+ struct xe_tile *tile = xe_device_get_root_tile(xe);
+
if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
return;
+ INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
+
process_hw_errors(xe);
}
--
2.47.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
2025-06-03 8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
@ 2025-06-03 9:31 ` Ghimiray, Himal Prasad
2025-06-03 9:53 ` Riana Tauro
0 siblings, 1 reply; 12+ messages in thread
From: Ghimiray, Himal Prasad @ 2025-06-03 9:31 UTC (permalink / raw)
To: Riana Tauro, intel-xe, dri-devel
Cc: anshuman.gupta, rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
raag.jadav, frank.scarbrough
On 03-06-2025 13:44, Riana Tauro wrote:
> Add support to handle CSC firmware reported errors. When CSC firmware
> errors are encoutered, a error interrupt is received by the GFX device as
> a MSI interrupt.
>
> Device Source control registers indicates the source of the error as CSC
> The HEC error status register indicates that the error is firmware reported
> Depending on the type of error, the error cause is written to the HEC
> Firmware error register.
>
> On encountering such CSC firmware errors, the graphics device is
> non-recoverable from driver context. The only way to recover from these
> errors is firmware flash. The device is then wedged and userspace is
> notified with a drm uevent
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> drivers/gpu/drm/xe/regs/xe_gsc_regs.h | 2 +
> drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 7 ++-
> drivers/gpu/drm/xe/xe_device_types.h | 3 +
> drivers/gpu/drm/xe/xe_hw_error.c | 65 +++++++++++++++++++++-
> 4 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> index 7702364b65f1..fcb6003f3226 100644
> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> @@ -13,6 +13,8 @@
>
> /* Definitions of GSC H/W registers, bits, etc */
>
> +#define BMG_GSC_HECI1_BASE 0x373000
> +
> #define MTL_GSC_HECI1_BASE 0x00116000
> #define MTL_GSC_HECI2_BASE 0x00117000
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index ed9b81fb28a0..c146b9ef44eb 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -6,10 +6,15 @@
> #ifndef _XE_HW_ERROR_REGS_H_
> #define _XE_HW_ERROR_REGS_H_
>
> +#define HEC_UNCORR_ERR_STATUS(base) XE_REG((base) + 0x118)
> +#define UNCORR_FW_REPORTED_ERR BIT(6)
> +
> +#define HEC_UNCORR_FW_ERR_DW0(base) XE_REG((base) + 0x124)
> +
> #define DEV_ERR_STAT_NONFATAL 0x100178
> #define DEV_ERR_STAT_CORRECTABLE 0x10017c
> #define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \
> DEV_ERR_STAT_CORRECTABLE, \
> DEV_ERR_STAT_NONFATAL))
> -
> +#define XE_CSC_ERROR BIT(17)
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index fb3617956d63..1325ae917c99 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -239,6 +239,9 @@ struct xe_tile {
> /** @memirq: Memory Based Interrupts. */
> struct xe_memirq memirq;
>
> + /** @csc_hw_error_work: worker to report CSC HW errors */
> + struct work_struct csc_hw_error_work;
> +
> /** @pcode: tile's PCODE */
> struct {
> /** @pcode.lock: protecting tile's PCODE mailbox data */
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 0f2590839900..ad1e244ea612 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -3,6 +3,7 @@
> * Copyright © 2025 Intel Corporation
> */
>
> +#include "regs/xe_gsc_regs.h"
> #include "regs/xe_hw_error_regs.h"
> #include "regs/xe_irq_regs.h"
>
> @@ -10,6 +11,8 @@
> #include "xe_hw_error.h"
> #include "xe_mmio.h"
>
> +#define HEC_UNCORR_FW_ERR_BITS 4
> +
> /* Error categories reported by hardware */
> enum hardware_error {
> HARDWARE_ERROR_CORRECTABLE = 0,
> @@ -18,6 +21,13 @@ enum hardware_error {
> HARDWARE_ERROR_MAX,
> };
>
> +static const char * const hec_uncorrected_fw_errors[] = {
> + "Fatal",
> + "CSE Disabled",
> + "FD Corruption",
> + "Data Corruption"
> +};
> +
> static const char *hw_error_to_str(const enum hardware_error hw_err)
> {
> switch (hw_err) {
> @@ -32,6 +42,54 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
> }
> }
>
> +static void csc_hw_error_work(struct work_struct *work)
> +{
> + struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
> + struct xe_device *xe = tile_to_xe(tile);
> +
> + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_FW_FLASH);
> + xe_device_declare_wedged(xe);
> +}
Any specific need for worker to set wedging any significant impact on
making it synchronous call?
> +
> +static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> +{
> + const char *hw_err_str = hw_error_to_str(hw_err);
> + struct xe_device *xe = tile_to_xe(tile);
> + struct xe_mmio *mmio = &tile->mmio;
> + u32 base, err_bit, err_src;
> + unsigned long fw_err;
> +
> + if (xe->info.platform != XE_BATTLEMAGE)
> + return;
why platform specific check here ? I remember having similar error on
PVC (reported by root tile).
> +
> + /* Not supported in BMG */
> + if (hw_err == HARDWARE_ERROR_CORRECTABLE)
> + return;
> +
> + base = BMG_GSC_HECI1_BASE;
> + lockdep_assert_held(&xe->irq.lock);
> + err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
> + if (!err_src) {
> + drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
> + tile->id, hw_err_str);
> + return;
> + }
> +
> + if (err_src & UNCORR_FW_REPORTED_ERR) {
> + fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
> + for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
> + drm_err_ratelimited(&xe->drm, HW_ERR
> + "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
> + hw_err_str, hec_uncorrected_fw_errors[err_bit],
> + err_bit);
> +
> + schedule_work(&tile->csc_hw_error_work);
> + }
> + }
> +
> + xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
> +}
> +
> static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> {
> const char *hw_err_str = hw_error_to_str(hw_err);
> @@ -50,7 +108,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
> goto unlock;
> }
>
> - /* TODO: Process errrors per source */
> + if (err_src & XE_CSC_ERROR)
> + csc_hw_error_handler(tile, hw_err);
>
> xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>
> @@ -101,8 +160,12 @@ static void process_hw_errors(struct xe_device *xe)
> */
> void xe_hw_error_init(struct xe_device *xe)
> {
> + struct xe_tile *tile = xe_device_get_root_tile(xe);
> +
> if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
> return;
>
> + INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
> +
> process_hw_errors(xe);
> }
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
2025-06-03 9:31 ` Ghimiray, Himal Prasad
@ 2025-06-03 9:53 ` Riana Tauro
0 siblings, 0 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-03 9:53 UTC (permalink / raw)
To: Ghimiray, Himal Prasad, intel-xe, dri-devel
Cc: anshuman.gupta, rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
raag.jadav, frank.scarbrough
Hi Himal
On 6/3/2025 3:01 PM, Ghimiray, Himal Prasad wrote:
>
>
> On 03-06-2025 13:44, Riana Tauro wrote:
>> Add support to handle CSC firmware reported errors. When CSC firmware
>> errors are encoutered, a error interrupt is received by the GFX device as
>> a MSI interrupt.
>>
>> Device Source control registers indicates the source of the error as CSC
>> The HEC error status register indicates that the error is firmware
>> reported
>> Depending on the type of error, the error cause is written to the HEC
>> Firmware error register.
>>
>> On encountering such CSC firmware errors, the graphics device is
>> non-recoverable from driver context. The only way to recover from these
>> errors is firmware flash. The device is then wedged and userspace is
>> notified with a drm uevent
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/regs/xe_gsc_regs.h | 2 +
>> drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 7 ++-
>> drivers/gpu/drm/xe/xe_device_types.h | 3 +
>> drivers/gpu/drm/xe/xe_hw_error.c | 65 +++++++++++++++++++++-
>> 4 files changed, 75 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/
>> xe/regs/xe_gsc_regs.h
>> index 7702364b65f1..fcb6003f3226 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> @@ -13,6 +13,8 @@
>> /* Definitions of GSC H/W registers, bits, etc */
>> +#define BMG_GSC_HECI1_BASE 0x373000
>> +
>> #define MTL_GSC_HECI1_BASE 0x00116000
>> #define MTL_GSC_HECI2_BASE 0x00117000
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/
>> drm/xe/regs/xe_hw_error_regs.h
>> index ed9b81fb28a0..c146b9ef44eb 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -6,10 +6,15 @@
>> #ifndef _XE_HW_ERROR_REGS_H_
>> #define _XE_HW_ERROR_REGS_H_
>> +#define HEC_UNCORR_ERR_STATUS(base) XE_REG((base)
>> + 0x118)
>> +#define UNCORR_FW_REPORTED_ERR BIT(6)
>> +
>> +#define HEC_UNCORR_FW_ERR_DW0(base) XE_REG((base)
>> + 0x124)
>> +
>> #define DEV_ERR_STAT_NONFATAL 0x100178
>> #define DEV_ERR_STAT_CORRECTABLE 0x10017c
>> #define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \
>> DEV_ERR_STAT_CORRECTABLE, \
>> DEV_ERR_STAT_NONFATAL))
>> -
>> +#define XE_CSC_ERROR BIT(17)
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/
>> xe/xe_device_types.h
>> index fb3617956d63..1325ae917c99 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -239,6 +239,9 @@ struct xe_tile {
>> /** @memirq: Memory Based Interrupts. */
>> struct xe_memirq memirq;
>> + /** @csc_hw_error_work: worker to report CSC HW errors */
>> + struct work_struct csc_hw_error_work;
>> +
>> /** @pcode: tile's PCODE */
>> struct {
>> /** @pcode.lock: protecting tile's PCODE mailbox data */
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/
>> xe_hw_error.c
>> index 0f2590839900..ad1e244ea612 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,6 +3,7 @@
>> * Copyright © 2025 Intel Corporation
>> */
>> +#include "regs/xe_gsc_regs.h"
>> #include "regs/xe_hw_error_regs.h"
>> #include "regs/xe_irq_regs.h"
>> @@ -10,6 +11,8 @@
>> #include "xe_hw_error.h"
>> #include "xe_mmio.h"
>> +#define HEC_UNCORR_FW_ERR_BITS 4
>> +
>> /* Error categories reported by hardware */
>> enum hardware_error {
>> HARDWARE_ERROR_CORRECTABLE = 0,
>> @@ -18,6 +21,13 @@ enum hardware_error {
>> HARDWARE_ERROR_MAX,
>> };
>> +static const char * const hec_uncorrected_fw_errors[] = {
>> + "Fatal",
>> + "CSE Disabled",
>> + "FD Corruption",
>> + "Data Corruption"
>> +};
>> +
>> static const char *hw_error_to_str(const enum hardware_error hw_err)
>> {
>> switch (hw_err) {
>> @@ -32,6 +42,54 @@ static const char *hw_error_to_str(const enum
>> hardware_error hw_err)
>> }
>> }
>> +static void csc_hw_error_work(struct work_struct *work)
>> +{
>> + struct xe_tile *tile = container_of(work, typeof(*tile),
>> csc_hw_error_work);
>> + struct xe_device *xe = tile_to_xe(tile);
>> +
>> + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_FW_FLASH);
>> + xe_device_declare_wedged(xe);
>> +}
>
> Any specific need for worker to set wedging any significant impact on
> making it synchronous call?
I tried synchronous but there is a sleeping function that caused an
error that's why moved it to workqueue
>
>
>> +
>> +static void csc_hw_error_handler(struct xe_tile *tile, const enum
>> hardware_error hw_err)
>> +{
>> + const char *hw_err_str = hw_error_to_str(hw_err);
>> + struct xe_device *xe = tile_to_xe(tile);
>> + struct xe_mmio *mmio = &tile->mmio;
>> + u32 base, err_bit, err_src;
>> + unsigned long fw_err;
>> +
>> + if (xe->info.platform != XE_BATTLEMAGE)
>> + return;
>
> why platform specific check here ? I remember having similar error on
> PVC (reported by root tile).
No PVC had the GSC error bit set and this is only applicable for CSC
errors. On encountering such errors, device is wedged and uevent needs
to be sent for firmware update which was also not applicable for PVC
Hence the check
Thanks
Riana
> > +
>> + /* Not supported in BMG */
>> + if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>> + return;
>> +
>> + base = BMG_GSC_HECI1_BASE;
>> + lockdep_assert_held(&xe->irq.lock);
>> + err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>> + if (!err_src) {
>> + drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported
>> HEC_ERR_STATUS_%s blank\n",
>> + tile->id, hw_err_str);
>> + return;
>> + }
>> +
>> + if (err_src & UNCORR_FW_REPORTED_ERR) {
>> + fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>> + for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>> + drm_err_ratelimited(&xe->drm, HW_ERR
>> + "%s: HEC Uncorrected FW %s error reported,
>> bit[%d] is set\n",
>> + hw_err_str, hec_uncorrected_fw_errors[err_bit],
>> + err_bit);
>> +
>> + schedule_work(&tile->csc_hw_error_work);
>> + }
>> + }
>> +
>> + xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>> +}
>> +
>> static void hw_error_source_handler(struct xe_tile *tile, const enum
>> hardware_error hw_err)
>> {
>> const char *hw_err_str = hw_error_to_str(hw_err);
>> @@ -50,7 +108,8 @@ static void hw_error_source_handler(struct xe_tile
>> *tile, const enum hardware_er
>> goto unlock;
>> }
>> - /* TODO: Process errrors per source */
>> + if (err_src & XE_CSC_ERROR)
>> + csc_hw_error_handler(tile, hw_err);
>> xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>> @@ -101,8 +160,12 @@ static void process_hw_errors(struct xe_device *xe)
>> */
>> void xe_hw_error_init(struct xe_device *xe)
>> {
>> + struct xe_tile *tile = xe_device_get_root_tile(xe);
>> +
>> if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>> return;
>> + INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>> +
>> process_hw_errors(xe);
>> }
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
2025-06-03 8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
@ 2025-06-04 10:43 ` Raag Jadav
2025-06-05 11:24 ` Riana Tauro
0 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2025-06-04 10:43 UTC (permalink / raw)
To: Riana Tauro
Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough
On Tue, Jun 03, 2025 at 01:43:57PM +0530, Riana Tauro wrote:
> A device is declared wedged when it is non-recoverable from
> the driver context. Some firmware errors can also cause
> the device to enter this state and the only method to recover
> from this would be to do a firmware flash
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> Documentation/gpu/drm-uapi.rst | 6 +++---
> drivers/gpu/drm/drm_drv.c | 2 ++
> include/drm/drm_device.h | 1 +
> 3 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 4863a4deb0ee..524224afb09f 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
> can use any one, multiple or none. Method(s) of choice will be sent in the
> uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> more side-effects. If driver is unsure about recovery or method is unknown
> -(like soft/hard system reboot, firmware flashing, physical device replacement
> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> -will be sent instead.
> +(like soft/hard system reboot, physical device replacement or any other procedure
> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
>
> Userspace consumers can parse this event and attempt recovery as per the
> following expectations.
> @@ -435,6 +434,7 @@ following expectations.
> none optional telemetry collection
> rebind unbind + bind driver
> bus-reset unbind + bus reset/re-enumeration + bind
> + firmware-flash unbind + firmware flash + bind
Can you guarantee this to be generic for all drivers?
Raag
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
2025-06-04 10:43 ` Raag Jadav
@ 2025-06-05 11:24 ` Riana Tauro
2025-06-16 20:39 ` Rodrigo Vivi
0 siblings, 1 reply; 12+ messages in thread
From: Riana Tauro @ 2025-06-05 11:24 UTC (permalink / raw)
To: Raag Jadav
Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough,
andrealmeid, christian.koenig
Hi Raag
On 6/4/2025 4:13 PM, Raag Jadav wrote:
> On Tue, Jun 03, 2025 at 01:43:57PM +0530, Riana Tauro wrote:
>> A device is declared wedged when it is non-recoverable from
>> the driver context. Some firmware errors can also cause
>> the device to enter this state and the only method to recover
>> from this would be to do a firmware flash
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> Documentation/gpu/drm-uapi.rst | 6 +++---
>> drivers/gpu/drm/drm_drv.c | 2 ++
>> include/drm/drm_device.h | 1 +
>> 3 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 4863a4deb0ee..524224afb09f 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
>> can use any one, multiple or none. Method(s) of choice will be sent in the
>> uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>> more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +(like soft/hard system reboot, physical device replacement or any other procedure
>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
>>
>> Userspace consumers can parse this event and attempt recovery as per the
>> following expectations.
>> @@ -435,6 +434,7 @@ following expectations.
>> none optional telemetry collection
>> rebind unbind + bind driver
>> bus-reset unbind + bus reset/re-enumeration + bind
>> + firmware-flash unbind + firmware flash + bind
>
> Can you guarantee this to be generic for all drivers?
Firmware flash as a method was mentioned as unknown in the document. So
if there is an error that requires firmware flash to recover, mentioning
this as recovery method should be okay
Wanted to get some comments on unbind/bind. If this is not required will
remove it.
Adding reviewers for inputs
Thanks
Riana
>
> Raag
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/xe: Add a helper function to set recovery method
2025-06-03 8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
@ 2025-06-06 15:12 ` Raag Jadav
2025-06-19 7:26 ` Riana Tauro
0 siblings, 1 reply; 12+ messages in thread
From: Raag Jadav @ 2025-06-06 15:12 UTC (permalink / raw)
To: Riana Tauro
Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough
On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
> Add a helper function to set recovery method. The recovery
> method has to be set before declaring the device wedged and sending the
> drm wedged uevent. If no method is set, default unbind/re-bind method
> will be set
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> drivers/gpu/drm/xe/xe_device.c | 30 +++++++++++++++++++++-------
> drivers/gpu/drm/xe/xe_device.h | 1 +
> drivers/gpu/drm/xe/xe_device_types.h | 2 ++
> 3 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 660b0c5126dc..3fd604ebdc6e 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> xe_pm_runtime_put(xe);
> }
>
> +/**
> + * xe_device_set_wedged_method - Set wedged recovery method
> + * @xe: xe device instance
Missing @method
> + *
> + * Set wedged recovery method to be sent using drm wedged uevent.
> + */
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
> +{
> + xe->wedged.method = method;
> +}
> +
> /**
> * xe_device_declare_wedged - Declare device wedged
> * @xe: xe device instance
> *
> - * This is a final state that can only be cleared with a module
> - * re-probe (unbind + bind).
> - * In this state every IOCTL will be blocked so the GT cannot be used.
> + * This is a final state that can only be cleared with the method specified
> + * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
> + * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
> + * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.
The file convention seems like 80 characters for kernel doc, so let's
stick to it.
> * In general it will be called upon any critical error such as gt reset
> - * failure or guc loading failure. Userspace will be notified of this state
> - * through device wedged uevent.
> + * failure or guc loading failure or firmware failure.
> + * Userspace will be notified of this state through device wedged uevent.
> * If xe.wedged module parameter is set to 2, this function will be called
> * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
> * snapshot capture. In this mode, GT reset won't be attempted so the state of
> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
> return;
> }
>
> + /* If no wedge recovery method is set, use default */
> + if (!xe->wedged.method)
> + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
> + | DRM_WEDGE_RECOVERY_BUS_RESET);
Although there are no strict rules about this, we usually don't begin a
new line with a symbol.
> +
> if (!atomic_xchg(&xe->wedged.flag, 1)) {
> xe->needs_flr_on_fini = true;
> drm_err(&xe->drm,
> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
> dev_name(xe->drm.dev));
>
> /* Notify userspace of wedged device */
> - drm_dev_wedged_event(&xe->drm,
> - DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
> + drm_dev_wedged_event(&xe->drm, xe->wedged.method);
I was a bit late to realize it when I originally added this. The event
call should be after xe_gt_declare_wedged() to comply with wedging rules.
We notify userspace *after* we're done with driver cleanup.
Raag
> }
>
> for_each_gt(gt, xe, id)
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 0bc3bc8e6803..06350740aac5 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> }
>
> void xe_device_declare_wedged(struct xe_device *xe);
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>
> struct xe_file *xe_file_get(struct xe_file *xef);
> void xe_file_put(struct xe_file *xef);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index b93c04466637..fb3617956d63 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -559,6 +559,8 @@ struct xe_device {
> atomic_t flag;
> /** @wedged.mode: Mode controlled by kernel parameter and debugfs */
> int mode;
> + /** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
> + unsigned long method;
> } wedged;
>
> /** @bo_device: Struct to control async free of BOs */
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent
2025-06-05 11:24 ` Riana Tauro
@ 2025-06-16 20:39 ` Rodrigo Vivi
0 siblings, 0 replies; 12+ messages in thread
From: Rodrigo Vivi @ 2025-06-16 20:39 UTC (permalink / raw)
To: Riana Tauro
Cc: Raag Jadav, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough,
andrealmeid, christian.koenig
On Thu, Jun 05, 2025 at 04:54:24PM +0530, Riana Tauro wrote:
>
> Hi Raag
>
> On 6/4/2025 4:13 PM, Raag Jadav wrote:
> > On Tue, Jun 03, 2025 at 01:43:57PM +0530, Riana Tauro wrote:
> > > A device is declared wedged when it is non-recoverable from
> > > the driver context. Some firmware errors can also cause
> > > the device to enter this state and the only method to recover
> > > from this would be to do a firmware flash
> > >
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > > Documentation/gpu/drm-uapi.rst | 6 +++---
> > > drivers/gpu/drm/drm_drv.c | 2 ++
> > > include/drm/drm_device.h | 1 +
> > > 3 files changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 4863a4deb0ee..524224afb09f 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
> > > can use any one, multiple or none. Method(s) of choice will be sent in the
> > > uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > more side-effects. If driver is unsure about recovery or method is unknown
> > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > -will be sent instead.
> > > +(like soft/hard system reboot, physical device replacement or any other procedure
> > > +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
> > > Userspace consumers can parse this event and attempt recovery as per the
> > > following expectations.
> > > @@ -435,6 +434,7 @@ following expectations.
> > > none optional telemetry collection
> > > rebind unbind + bind driver
> > > bus-reset unbind + bus reset/re-enumeration + bind
> > > + firmware-flash unbind + firmware flash + bind
> >
> > Can you guarantee this to be generic for all drivers?
>
>
> Firmware flash as a method was mentioned as unknown in the document. So if
> there is an error that requires firmware flash to recover, mentioning this
> as recovery method should be okay
>
> Wanted to get some comments on unbind/bind. If this is not required will
> remove it.
Yeap, probably better to remove the unbind/bind and keep this generic.
Even in some of our cases we should need to unbind + config-survivability + rebind + flash firmware + unbind + delete configfs + bind.
>
> Adding reviewers for inputs
>
> Thanks
> Riana
>
>
> >
> > Raag
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/xe: Add a helper function to set recovery method
2025-06-06 15:12 ` Raag Jadav
@ 2025-06-19 7:26 ` Riana Tauro
0 siblings, 0 replies; 12+ messages in thread
From: Riana Tauro @ 2025-06-19 7:26 UTC (permalink / raw)
To: Raag Jadav
Cc: intel-xe, dri-devel, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
aravind.iddamsetty, himal.prasad.ghimiray, frank.scarbrough
Hi Raag
Thank you for the review comments
On 6/6/2025 8:42 PM, Raag Jadav wrote:
> On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
>> Add a helper function to set recovery method. The recovery
>> method has to be set before declaring the device wedged and sending the
>> drm wedged uevent. If no method is set, default unbind/re-bind method
>> will be set
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_device.c | 30 +++++++++++++++++++++-------
>> drivers/gpu/drm/xe/xe_device.h | 1 +
>> drivers/gpu/drm/xe/xe_device_types.h | 2 ++
>> 3 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 660b0c5126dc..3fd604ebdc6e 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>> xe_pm_runtime_put(xe);
>> }
>>
>> +/**
>> + * xe_device_set_wedged_method - Set wedged recovery method
>> + * @xe: xe device instance
>
> Missing @method
Missed this. Will fix it>
>> + *
>> + * Set wedged recovery method to be sent using drm wedged uevent.
>> + */
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
>> +{
>> + xe->wedged.method = method;
>> +}
>> +
>> /**
>> * xe_device_declare_wedged - Declare device wedged
>> * @xe: xe device instance
>> *
>> - * This is a final state that can only be cleared with a module
>> - * re-probe (unbind + bind).
>> - * In this state every IOCTL will be blocked so the GT cannot be used.
>> + * This is a final state that can only be cleared with the method specified
>> + * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
>> + * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
>> + * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.
>
> The file convention seems like 80 characters for kernel doc, so let's
> stick to it.
okay
>
>> * In general it will be called upon any critical error such as gt reset
>> - * failure or guc loading failure. Userspace will be notified of this state
>> - * through device wedged uevent.
>> + * failure or guc loading failure or firmware failure.
>> + * Userspace will be notified of this state through device wedged uevent.
>> * If xe.wedged module parameter is set to 2, this function will be called
>> * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>> * snapshot capture. In this mode, GT reset won't be attempted so the state of
>> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
>> return;
>> }
>>
>> + /* If no wedge recovery method is set, use default */
>> + if (!xe->wedged.method)
>> + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
>> + | DRM_WEDGE_RECOVERY_BUS_RESET);
>
> Although there are no strict rules about this, we usually don't begin a
> new line with a symbol.
will fix this
>
>> +
>> if (!atomic_xchg(&xe->wedged.flag, 1)) {
>> xe->needs_flr_on_fini = true;
>> drm_err(&xe->drm,
>> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
>> dev_name(xe->drm.dev));
>>
>> /* Notify userspace of wedged device */
>> - drm_dev_wedged_event(&xe->drm,
>> - DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
>> + drm_dev_wedged_event(&xe->drm, xe->wedged.method);
>
> I was a bit late to realize it when I originally added this. The event
> call should be after xe_gt_declare_wedged() to comply with wedging rules.
> We notify userspace *after* we're done with driver cleanup.
Will move gt_wedged before uevent
Thanks
Riana
>
> Raag
>
>> }
>>
>> for_each_gt(gt, xe, id)
>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>> index 0bc3bc8e6803..06350740aac5 100644
>> --- a/drivers/gpu/drm/xe/xe_device.h
>> +++ b/drivers/gpu/drm/xe/xe_device.h
>> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>> }
>>
>> void xe_device_declare_wedged(struct xe_device *xe);
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>>
>> struct xe_file *xe_file_get(struct xe_file *xef);
>> void xe_file_put(struct xe_file *xef);
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index b93c04466637..fb3617956d63 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -559,6 +559,8 @@ struct xe_device {
>> atomic_t flag;
>> /** @wedged.mode: Mode controlled by kernel parameter and debugfs */
>> int mode;
>> + /** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
>> + unsigned long method;
>> } wedged;
>>
>> /** @bo_device: Struct to control async free of BOs */
>> --
>> 2.47.1
>>
w
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-06-19 7:26 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 8:13 [PATCH 0/4] Handle Firmware reported Hardware Errors Riana Tauro
2025-06-03 8:13 ` [PATCH 1/4] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-04 10:43 ` Raag Jadav
2025-06-05 11:24 ` Riana Tauro
2025-06-16 20:39 ` Rodrigo Vivi
2025-06-03 8:13 ` [PATCH 2/4] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-06-06 15:12 ` Raag Jadav
2025-06-19 7:26 ` Riana Tauro
2025-06-03 8:13 ` [PATCH 3/4] drm/xe: Add support to handle hardware errors Riana Tauro
2025-06-03 8:14 ` [PATCH 4/4] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-06-03 9:31 ` Ghimiray, Himal Prasad
2025-06-03 9:53 ` Riana Tauro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).