[Intel-xe] [PATCH v2 0/4] Supporting RAS on XE

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-xe] [PATCH v2 0/4] Supporting RAS on XE
@ 2023-08-10  5:07 Himal Prasad Ghimiray
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Himal Prasad Ghimiray @ 2023-08-10  5:07 UTC (permalink / raw)
  To: intel-xe

Our platforms support Reliability, Availability and Serviceability(RAS).
In case of hardware errors, our hardwares provides the causes via
sending interrupt or pcie errors. The fatal errors are propogated 
as pci errors and non fatal errors as MSI. This series focuses on 
loging and updating counters for these errors, which will be  helpful to avoid, 
detect and repair hardware faults.

This [1] series proposes mechanism to expose this counters to userspace.
[1]: https://patchwork.freedesktop.org/series/118435/

The error counters  exposed by KMD will be used by L0/sysman 
They will be categorized to specific category of error in sysman:
https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras

Current version covers the error propagation at tile level and possible 
causes from GT subsystem. Future patches will cover the possible causes
from other subsystems (for example SOC, GSC/CSC).

We have very limited capabilities for error injection to validate the
code flow.
Output of L3 fabric fatal injection from PVC is:
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: TILE0 detected GT FATAL error bit[0] is set
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected L3 FABRIC FATAL error. ERR_VECT_GT_FATAL_[7]:0x00000087

v2
- Use different headers for error registers. (Nikula)
- Correctable errors shouldn't be considered as dmesg errors (Matt)
- Limit series to HW errors.(Aravind)

Himal Prasad Ghimiray (4):
  drm/xe: Handle errors from various components.
  drm/xe: Log and count the GT hardware errors.
  drm/xe: Support GT hardware error reporting for PVC.
  drm/xe: Process fatal hardware errors.

 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h   |  99 ++++
 drivers/gpu/drm/xe/regs/xe_regs.h            |   7 +
 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++
 drivers/gpu/drm/xe/xe_device_types.h         |   6 +
 drivers/gpu/drm/xe/xe_gt_types.h             |   6 +
 drivers/gpu/drm/xe/xe_irq.c                  | 460 +++++++++++++++++++
 6 files changed, 686 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components.
  2023-08-10  5:07 [Intel-xe] [PATCH v2 0/4] Supporting RAS on XE Himal Prasad Ghimiray
@ 2023-08-10  5:07 ` Himal Prasad Ghimiray
  2023-08-10  7:54   ` Jani Nikula
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 2/4] drm/xe: Log and count the GT hardware errors Himal Prasad Ghimiray
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Himal Prasad Ghimiray @ 2023-08-10  5:07 UTC (permalink / raw)
  To: intel-xe; +Cc: Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Matt Roper

The GFX device can generate numbers of classes of error under the new
infrastructure: correctable, non-fatal, and fatal errors.

The non-fatal and fatal error classes distinguish between levels of
severity for uncorrectable errors. Driver will only handle logging
of errors and updating counters from various components within the
graphics device. Anything more will be handled at system level.

For errors that will route as interrupts, three bits in the Master
Interrupt Register will be used to convey the class of error.

For each class of error: Determine source of error (IP block) by reading
the Device Error Source Register (RW1C) that
corresponds to the class of error being serviced.

Bspec: 50875, 53073, 53074, 53075

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_regs.h            |   7 +
 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++++++
 drivers/gpu/drm/xe/xe_device_types.h         |   6 +
 drivers/gpu/drm/xe/xe_irq.c                  | 220 +++++++++++++++++++
 4 files changed, 341 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h

diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
index ec45b1ba9db1..9901e55fc89c 100644
--- a/drivers/gpu/drm/xe/regs/xe_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_regs.h
@@ -87,7 +87,14 @@
 #define   GU_MISC_IRQ				REG_BIT(29)
 #define   DISPLAY_IRQ				REG_BIT(16)
 #define   GT_DW_IRQ(x)				REG_BIT(x)
+#define   XE_ERROR_IRQ(x)			REG_BIT(26 + (x))
 
 #define PVC_RP_STATE_CAP			XE_REG(0x281014)
 
+enum hardware_error {
+	HARDWARE_ERROR_CORRECTABLE = 0,
+	HARDWARE_ERROR_NONFATAL = 1,
+	HARDWARE_ERROR_FATAL = 2,
+	HARDWARE_ERROR_MAX,
+};
 #endif
diff --git a/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
new file mode 100644
index 000000000000..fbb794b2f183
--- /dev/null
+++ b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+#ifndef XE_TILE_ERROR_REGS_H_
+#define XE_TILE_ERROR_REGS_H_
+
+#include <linux/stddef.h>
+
+#define _DEV_ERR_STAT_NONFATAL                         0x100178
+#define _DEV_ERR_STAT_CORRECTABLE                      0x10017c
+#define DEV_ERR_STAT_REG(x)                            XE_REG(_PICK_EVEN((x), \
+								_DEV_ERR_STAT_CORRECTABLE, \
+								_DEV_ERR_STAT_NONFATAL))
+
+#define  DEV_ERR_STAT_MAX_ERROR_BIT         (21)
+
+/* Count of  Correctable and Uncorrectable errors reported on tile */
+enum xe_tile_hw_errors {
+	XE_TILE_HW_ERR_GT_FATAL = 0,
+	XE_TILE_HW_ERR_SGGI_FATAL,
+	XE_TILE_HW_ERR_DISPLAY_FATAL,
+	XE_TILE_HW_ERR_SGDI_FATAL,
+	XE_TILE_HW_ERR_SGLI_FATAL,
+	XE_TILE_HW_ERR_SGUNIT_FATAL,
+	XE_TILE_HW_ERR_SGCI_FATAL,
+	XE_TILE_HW_ERR_GSC_FATAL,
+	XE_TILE_HW_ERR_SOC_FATAL,
+	XE_TILE_HW_ERR_MERT_FATAL,
+	XE_TILE_HW_ERR_SGMI_FATAL,
+	XE_TILE_HW_ERR_UNKNOWN_FATAL,
+	XE_TILE_HW_ERR_SGGI_NONFATAL,
+	XE_TILE_HW_ERR_DISPLAY_NONFATAL,
+	XE_TILE_HW_ERR_SGDI_NONFATAL,
+	XE_TILE_HW_ERR_SGLI_NONFATAL,
+	XE_TILE_HW_ERR_GT_NONFATAL,
+	XE_TILE_HW_ERR_SGUNIT_NONFATAL,
+	XE_TILE_HW_ERR_SGCI_NONFATAL,
+	XE_TILE_HW_ERR_GSC_NONFATAL,
+	XE_TILE_HW_ERR_SOC_NONFATAL,
+	XE_TILE_HW_ERR_MERT_NONFATAL,
+	XE_TILE_HW_ERR_SGMI_NONFATAL,
+	XE_TILE_HW_ERR_UNKNOWN_NONFATAL,
+	XE_TILE_HW_ERR_GT_CORR,
+	XE_TILE_HW_ERR_DISPLAY_CORR,
+	XE_TILE_HW_ERR_SGUNIT_CORR,
+	XE_TILE_HW_ERR_GSC_CORR,
+	XE_TILE_HW_ERR_SOC_CORR,
+	XE_TILE_HW_ERR_UNKNOWN_CORR,
+};
+
+#define XE_TILE_HW_ERROR_MAX (XE_TILE_HW_ERR_UNKNOWN_CORR + 1)
+
+#define PVC_DEV_ERR_STAT_FATAL_MASK \
+		(REG_BIT(0) | \
+		 REG_BIT(1) | \
+		 REG_BIT(8) | \
+		 REG_BIT(9) | \
+		 REG_BIT(13) | \
+		 REG_BIT(16) | \
+		 REG_BIT(20))
+
+#define PVC_DEV_ERR_STAT_NONFATAL_MASK \
+		(REG_BIT(0) | \
+		 REG_BIT(1) | \
+		 REG_BIT(8) | \
+		 REG_BIT(9) | \
+		 REG_BIT(13) | \
+		 REG_BIT(16) | \
+		 REG_BIT(20))
+
+#define PVC_DEV_ERR_STAT_CORRECTABLE_MASK \
+		(REG_BIT(0) | \
+		 REG_BIT(8))
+
+#define DG2_DEV_ERR_STAT_FATAL_MASK \
+		(REG_BIT(0) | \
+		 REG_BIT(4) | \
+		 REG_BIT(8) | \
+		 REG_BIT(12) | \
+		 REG_BIT(16))
+
+#define DG2_DEV_ERR_STAT_NONFATAL_MASK \
+		(REG_BIT(0) | \
+		 REG_BIT(4) | \
+		 REG_BIT(8) | \
+		 REG_BIT(12) | \
+		 REG_BIT(16) | \
+		 REG_BIT(20))
+
+#define DG2_DEV_ERR_STAT_CORRECTABLE_MASK \
+		(REG_BIT(0) | \
+		 REG_BIT(4) | \
+		 REG_BIT(8) | \
+		 REG_BIT(12) | \
+		 REG_BIT(16))
+
+#define REG_SIZE 32
+
+#define xe_tile_log_hw_err(tile, fmt, ...) \
+	drm_err_ratelimited(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
+			    tile->id, ##__VA_ARGS__)
+
+#define xe_tile_log_hw_warn(tile, fmt, ...) \
+	drm_warn(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
+		 tile->id, ##__VA_ARGS__)
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index f84ecb976f5d..1335ba74981a 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -16,6 +16,7 @@
 #include "xe_gt_types.h"
 #include "xe_platform_types.h"
 #include "xe_step_types.h"
+#include "regs/xe_tile_error_regs.h"
 
 #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
 #include "ext/intel_device_info.h"
@@ -166,6 +167,11 @@ struct xe_tile {
 
 	/** @sysfs: sysfs' kobj used by xe_tile_sysfs */
 	struct kobject *sysfs;
+
+	/** @tile_hw_errors: hardware errors reported for the tile */
+	struct tile_hw_errors {
+		unsigned long hw[XE_TILE_HW_ERROR_MAX];
+	} errors;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 2022a5643e01..04a665faea23 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -362,6 +362,223 @@ static void dg1_intr_enable(struct xe_device *xe, bool stall)
 		xe_mmio_read32(mmio, DG1_MSTR_TILE_INTR);
 }
 
+static const char *
+hardware_error_type_to_str(const enum hardware_error hw_err)
+{
+	switch (hw_err) {
+	case HARDWARE_ERROR_CORRECTABLE:
+		return "CORRECTABLE";
+	case HARDWARE_ERROR_NONFATAL:
+		return "NONFATAL";
+	case HARDWARE_ERROR_FATAL:
+		return "FATAL";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+struct error_msg_counter_pair {
+	const char *errmsg;
+	int errcounter;
+};
+
+struct error_msg_counter_pair dev_err_stat_fatal_reg[] = {
+	{"GT",			XE_TILE_HW_ERR_GT_FATAL			/* Bit Pos 0 */},
+	{"SGGI Cmd Parity",	XE_TILE_HW_ERR_SGGI_FATAL		/* Bit Pos 1 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_FATAL		/* Bit Pos 4 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"GSC error",		XE_TILE_HW_ERR_GSC_FATAL		/* Bit Pos 8 */},
+	{"SGLI Cmd Parity",	XE_TILE_HW_ERR_SGLI_FATAL		/* Bit Pos 9 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_FATAL		/* Bit Pos 12 */},
+	{"SGCI Cmd Parity",	XE_TILE_HW_ERR_SGCI_FATAL		/* Bit Pos 13 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"SOC ERROR",		XE_TILE_HW_ERR_SOC_FATAL		/* Bit Pos 16 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
+	{"MERT Cmd Parity",	XE_TILE_HW_ERR_MERT_FATAL		/* Bit Pos 20 */},
+};
+
+struct error_msg_counter_pair dev_err_stat_nonfatal_reg[] = {
+	{"GT",			XE_TILE_HW_ERR_GT_NONFATAL		/* Bit Pos 0 */},
+	{"SGGI Data Parity",	XE_TILE_HW_ERR_SGGI_NONFATAL		/* Bit Pos 1 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_NONFATAL		/* Bit Pos 4 */},
+	{"SGDI Data Parity",	XE_TILE_HW_ERR_SGDI_NONFATAL		/* Bit Pos 5 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"GSC",			XE_TILE_HW_ERR_GSC_NONFATAL		/* Bit Pos 8 */},
+	{"SGLI Data Parity",	XE_TILE_HW_ERR_SGLI_NONFATAL		/* Bit Pos 9 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_NONFATAL		/* Bit Pos 12 */},
+	{"SGCI Data Parity",	XE_TILE_HW_ERR_SGCI_NONFATAL		/* Bit Pos 13 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"SOC",			XE_TILE_HW_ERR_SOC_NONFATAL		/* Bit Pos 16 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL		/* Bit Pos 17 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
+	{"MERT Data Parity",	XE_TILE_HW_ERR_MERT_NONFATAL		/* Bit Pos 20 */},
+};
+
+struct error_msg_counter_pair dev_err_stat_correctable_reg[] = {
+	{"GT",			XE_TILE_HW_ERR_GT_CORR			/* Bit Pos 0 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_CORR		/* Bit Pos 4 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"GSC",			XE_TILE_HW_ERR_GSC_CORR			/* Bit Pos 8 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_CORR		/* Bit Pos 12 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"SOC",			XE_TILE_HW_ERR_SOC_CORR			/* Bit Pos 16 */},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
+};
+
+static void update_valid_error_regs(struct xe_device *xe)
+{
+	unsigned long mask = 0;
+
+	u32 i;
+
+	if (xe->info.platform == XE_DG2) {
+		mask = ~(0 | DG2_DEV_ERR_STAT_FATAL_MASK);
+		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
+			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
+
+		mask = ~(0 | DG2_DEV_ERR_STAT_NONFATAL_MASK);
+		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
+			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
+
+		mask = ~(0 | DG2_DEV_ERR_STAT_CORRECTABLE_MASK);
+		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
+			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
+	} else if (xe->info.platform == XE_PVC) {
+		mask = ~(0 | PVC_DEV_ERR_STAT_FATAL_MASK);
+		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
+			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
+
+		mask = ~(0 | PVC_DEV_ERR_STAT_NONFATAL_MASK);
+		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
+			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
+
+		mask = ~(0 | PVC_DEV_ERR_STAT_CORRECTABLE_MASK);
+		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
+			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
+	}
+}
+
+static void
+xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hardware_error_type_to_str(hw_err);
+	struct error_msg_counter_pair *errstat;
+	unsigned long errsrc;
+	unsigned long flags;
+	const char *errmsg;
+	struct xe_gt *mmio;
+	u32 counter;
+	u32 errcntr;
+	u32 errbit;
+
+	switch (hw_err) {
+	case HARDWARE_ERROR_FATAL:
+		errstat = (struct error_msg_counter_pair *)dev_err_stat_fatal_reg;
+		counter = XE_TILE_HW_ERR_UNKNOWN_FATAL;
+		break;
+	case HARDWARE_ERROR_NONFATAL:
+		errstat = (struct error_msg_counter_pair *)dev_err_stat_nonfatal_reg;
+		counter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL;
+		break;
+	case HARDWARE_ERROR_CORRECTABLE:
+		errstat = (struct error_msg_counter_pair *)dev_err_stat_correctable_reg;
+		counter = XE_TILE_HW_ERR_UNKNOWN_CORR;
+		break;
+	default:
+		return;
+	}
+
+	spin_lock_irqsave(&tile_to_xe(tile)->irq.lock, flags);
+	mmio = tile->primary_gt;
+	errsrc = xe_mmio_read32(mmio, DEV_ERR_STAT_REG(hw_err));
+
+	if (!errsrc) {
+		xe_tile_log_hw_err(tile, "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
+		goto unlock;
+	}
+
+	for_each_set_bit(errbit, &errsrc, REG_SIZE) {
+		if (errbit < DEV_ERR_STAT_MAX_ERROR_BIT) {
+			errmsg = errstat[errbit].errmsg;
+			errcntr = errstat[errbit].errcounter;
+		} else {
+			errmsg = "Undefined";
+			errcntr = counter;
+		}
+
+		if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+			xe_tile_log_hw_warn(tile, "%s %s error bit[%d] is set\n",
+					    errmsg, hw_err_str, errbit);
+		else
+			xe_tile_log_hw_err(tile, "%s %s error bit[%d] is set\n",
+					   errmsg, hw_err_str, errbit);
+
+		tile->errors.hw[errcntr]++;
+	}
+
+	xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
+unlock:
+	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
+}
+
+/*
+ * XE Platforms adds three Error bits to the Master Interrupt
+ * Register to support error handling. These three bits are
+ * used to convey the class of error:
+ * FATAL, NONFATAL, or CORRECTABLE.
+ *
+ * To process an interrupt:
+ *       Determine source of error (IP block) by reading
+ *	 the Device Error Source Register (RW1C) that
+ *	 corresponds to the class of error being serviced
+ *	 and log the error.
+ */
+static void
+xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
+{
+	enum hardware_error hw_err;
+
+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
+		if (master_ctl & XE_ERROR_IRQ(hw_err))
+			xe_hw_error_source_handler(tile, hw_err);
+	}
+}
+
 /*
  * Top-level interrupt handler for Xe_LP+ and beyond.  These platforms have
  * a "master tile" interrupt register which must be consulted before the
@@ -413,6 +630,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
 		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
 
 		gt_irq_handler(tile, master_ctl, intr_dw, identity);
+		xe_hw_error_irq_handler(tile, master_ctl);
 
 		/*
 		 * Display interrupts (including display backlight operations
@@ -561,6 +779,8 @@ int xe_irq_install(struct xe_device *xe)
 		return -EINVAL;
 	}
 
+	update_valid_error_regs(xe);
+
 	xe->irq.enabled = true;
 
 	xe_irq_reset(xe);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-xe] [PATCH v2 2/4] drm/xe: Log and count the GT hardware errors.
  2023-08-10  5:07 [Intel-xe] [PATCH v2 0/4] Supporting RAS on XE Himal Prasad Ghimiray
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
@ 2023-08-10  5:07 ` Himal Prasad Ghimiray
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 3/4] drm/xe: Support GT hardware error reporting for PVC Himal Prasad Ghimiray
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Himal Prasad Ghimiray @ 2023-08-10  5:07 UTC (permalink / raw)
  To: intel-xe; +Cc: Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Matt Roper

For the errors reported by GT unit, read the GT error register.
Log and count these errors and clear the error register.

Bspec: 53088, 53089, 53090

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h |  48 ++++++++++
 drivers/gpu/drm/xe/xe_gt_types.h           |   6 ++
 drivers/gpu/drm/xe/xe_irq.c                | 103 +++++++++++++++++++++
 3 files changed, 157 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h

diff --git a/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h b/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
new file mode 100644
index 000000000000..6d51cfbddb61
--- /dev/null
+++ b/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+#ifndef XE_GT_ERROR_REGS_H_
+#define XE_GT_ERROR_REGS_H_
+
+#define _ERR_STAT_GT_COR                0x100160
+#define _ERR_STAT_GT_NONFATAL           0x100164
+#define ERR_STAT_GT_REG(x)              XE_REG(_PICK_EVEN((x), \
+						_ERR_STAT_GT_COR, \
+						_ERR_STAT_GT_NONFATAL))
+
+#define  ERR_GT_STAT_MAX_ERROR_BIT	(16)
+/* Count of GT Correctable and FATAL HW ERRORS */
+enum xe_gt_hw_errors {
+	XE_GT_HW_ERR_L3_SNG_CORR,
+	XE_GT_HW_ERR_GUC_CORR,
+	XE_GT_HW_ERR_SAMPLER_CORR,
+	XE_GT_HW_ERR_SLM_CORR,
+	XE_GT_HW_ERR_EU_IC_CORR,
+	XE_GT_HW_ERR_EU_GRF_CORR,
+	XE_GT_HW_ERR_UNKNOWN_CORR,
+	XE_GT_HW_ERR_ARR_BIST_FATAL,
+	XE_GT_HW_ERR_FPU_FATAL,
+	XE_GT_HW_ERR_L3_DOUB_FATAL,
+	XE_GT_HW_ERR_L3_ECC_CHK_FATAL,
+	XE_GT_HW_ERR_GUC_FATAL,
+	XE_GT_HW_ERR_IDI_PAR_FATAL,
+	XE_GT_HW_ERR_SQIDI_FATAL,
+	XE_GT_HW_ERR_SAMPLER_FATAL,
+	XE_GT_HW_ERR_SLM_FATAL,
+	XE_GT_HW_ERR_EU_IC_FATAL,
+	XE_GT_HW_ERR_EU_GRF_FATAL,
+	XE_GT_HW_ERR_UNKNOWN_FATAL,
+};
+
+#define XE_GT_HW_ERROR_MAX (XE_GT_HW_ERR_UNKNOWN_FATAL + 1)
+
+#define xe_gt_log_hw_err(gt, fmt, ...) \
+	drm_err_ratelimited(&gt_to_xe(gt)->drm, HW_ERR "GT%d detected " fmt, \
+			    gt->info.id, ##__VA_ARGS__)
+
+#define xe_gt_log_hw_warn(gt, fmt, ...) \
+	drm_warn(&gt_to_xe(gt)->drm, HW_ERR "GT%d detected " fmt, \
+		 gt->info.id, ##__VA_ARGS__)
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
index 35b8c19fa8bf..780e192e4cc8 100644
--- a/drivers/gpu/drm/xe/xe_gt_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_types.h
@@ -13,6 +13,7 @@
 #include "xe_reg_sr_types.h"
 #include "xe_sa_types.h"
 #include "xe_uc_types.h"
+#include "regs/xe_gt_error_regs.h"
 
 struct xe_exec_queue_ops;
 struct xe_migrate;
@@ -346,6 +347,11 @@ struct xe_gt {
 		/** @oob: bitmap with active OOB workaroudns */
 		unsigned long *oob;
 	} wa_active;
+
+	/** @gt_hw_errors: hardware errors reported for the gt */
+	struct gt_hw_errors {
+		 unsigned long hw[XE_GT_HW_ERROR_MAX];
+	} errors;
 };
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 04a665faea23..ca9c4993be6a 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -454,6 +454,44 @@ struct error_msg_counter_pair dev_err_stat_correctable_reg[] = {
 	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
 };
 
+struct error_msg_counter_pair err_stat_gt_fatal_reg[] = {
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_FATAL},
+	{"Array BIST",		XE_GT_HW_ERR_ARR_BIST_FATAL		/* Bit Pos 1 */},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_FATAL},
+	{"FPU",			XE_GT_HW_ERR_FPU_FATAL			/* Bit Pos 3 */},
+	{"L3 Double",		XE_GT_HW_ERR_L3_DOUB_FATAL		/* Bit Pos 4 */},
+	{"L3 ECC Checker",	XE_GT_HW_ERR_L3_ECC_CHK_FATAL		/* Bit Pos 5 */},
+	{"GUC SRAM",		XE_GT_HW_ERR_GUC_FATAL			/* Bit Pos 6 */},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_FATAL},
+	{"IDI PARITY",		XE_GT_HW_ERR_IDI_PAR_FATAL		/* Bit Pos 8 */},
+	{"SQIDI",		XE_GT_HW_ERR_SQIDI_FATAL		/* Bit Pos 9 */},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_FATAL},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_FATAL},
+	{"SAMPLER",		XE_GT_HW_ERR_SAMPLER_FATAL		/* Bit Pos 12 */},
+	{"SLM",			XE_GT_HW_ERR_SLM_FATAL			/* Bit Pos 13 */},
+	{"EU IC",		XE_GT_HW_ERR_EU_IC_FATAL		/* Bit Pos 14 */},
+	{"EU GRF",		XE_GT_HW_ERR_EU_GRF_FATAL		/* Bit Pos 15 */},
+};
+
+struct error_msg_counter_pair err_stat_gt_correctable_reg[] = {
+	{"L3 SINGLE",		XE_GT_HW_ERR_L3_SNG_CORR		/* Bit Pos 0 */},
+	{"SINGLE BIT GUC SRAM",	XE_GT_HW_ERR_GUC_CORR			/* Bit Pos 1 */},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"Undefined",           XE_GT_HW_ERR_UNKNOWN_CORR},
+	{"SINGLE BIT SAMPLER",	XE_GT_HW_ERR_SAMPLER_CORR		/* Bit Pos 12 */},
+	{"SINGLE BIT SLM",	XE_GT_HW_ERR_SLM_CORR			/* Bit Pos 13 */},
+	{"SINGLE BIT EU IC",	XE_GT_HW_ERR_EU_IC_CORR			/* Bit Pos 14 */},
+	{"SINGLE BIT EU GRF",	XE_GT_HW_ERR_EU_GRF_CORR		/* Bit Pos 15 */},
+};
+
 static void update_valid_error_regs(struct xe_device *xe)
 {
 	unsigned long mask = 0;
@@ -493,6 +531,68 @@ static void update_valid_error_regs(struct xe_device *xe)
 	}
 }
 
+static void
+xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hardware_error_type_to_str(hw_err);
+	struct error_msg_counter_pair *errstat;
+	unsigned long errsrc;
+	const char *errmsg;
+	u32 counter;
+	u32 errcntr;
+	u32 errbit;
+
+	if (gt_to_xe(gt)->info.platform == XE_PVC)
+		return;
+
+	lockdep_assert_held(&gt_to_xe(gt)->irq.lock);
+	errsrc = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err));
+	if (!errsrc) {
+		xe_gt_log_hw_err(gt, "ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
+		return;
+	}
+
+	switch (hw_err) {
+	case HARDWARE_ERROR_FATAL:
+		errstat = (struct error_msg_counter_pair *)err_stat_gt_fatal_reg;
+		counter = XE_GT_HW_ERR_UNKNOWN_FATAL;
+		break;
+	case HARDWARE_ERROR_NONFATAL:
+		/*  The GT Non Fatal Error Status Register has only reserved bits
+		 *  Nothing to service.
+		 */
+		xe_gt_log_hw_err(gt, "%s error\n", hw_err_str);
+		goto clear_reg;
+	case HARDWARE_ERROR_CORRECTABLE:
+		errstat = (struct error_msg_counter_pair *)err_stat_gt_correctable_reg;
+		counter = XE_GT_HW_ERR_UNKNOWN_CORR;
+		break;
+	default:
+		return;
+	}
+
+	for_each_set_bit(errbit, &errsrc, 32) {
+		if (errbit < ERR_GT_STAT_MAX_ERROR_BIT) {
+			errmsg = errstat[errbit].errmsg;
+			errcntr = errstat[errbit].errcounter;
+		} else {
+			errmsg = "Undefined";
+			errcntr = counter;
+		}
+
+		if (hw_err == HARDWARE_ERROR_FATAL)
+			xe_gt_log_hw_err(gt, "%s %s error bit[%d] is set\n",
+					   errmsg, hw_err_str, errbit);
+		else
+			xe_gt_log_hw_warn(gt, "%s %s error bit[%d] is set\n",
+					    errmsg, hw_err_str, errbit);
+
+		gt->errors.hw[counter]++;
+	}
+
+clear_reg: xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err), errsrc);
+}
+
 static void
 xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
@@ -549,6 +649,9 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
 					   errmsg, hw_err_str, errbit);
 
 		tile->errors.hw[errcntr]++;
+
+		if (errbit == 0)
+			xe_gt_hw_error_handler(tile->primary_gt, hw_err);
 	}
 
 	xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-xe] [PATCH v2 3/4] drm/xe: Support GT hardware error reporting for PVC.
  2023-08-10  5:07 [Intel-xe] [PATCH v2 0/4] Supporting RAS on XE Himal Prasad Ghimiray
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 2/4] drm/xe: Log and count the GT hardware errors Himal Prasad Ghimiray
@ 2023-08-10  5:07 ` Himal Prasad Ghimiray
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 4/4] drm/xe: Process fatal hardware errors Himal Prasad Ghimiray
  2023-08-10  5:32 ` [Intel-xe] ✗ CI.Patch_applied: failure for Supporting RAS on XE Patchwork
  4 siblings, 0 replies; 10+ messages in thread
From: Himal Prasad Ghimiray @ 2023-08-10  5:07 UTC (permalink / raw)
  To: intel-xe; +Cc: Joonas Lahtinen, Rodrigo Vivi, Matt Roper

PVC supports GT error reporting via vector registers alongwith
error status register. Add support to report these errors and
update respective counters.
Incase of Subslice error reported by vector register, process the
error status register for applicable bits.

Bspec: 54179, 54177, 53088, 53089


Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h |  48 +++++++++
 drivers/gpu/drm/xe/xe_irq.c                | 112 ++++++++++++++++++++-
 2 files changed, 155 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h b/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
index 6d51cfbddb61..27a54b7c278a 100644
--- a/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
@@ -12,8 +12,40 @@
 						_ERR_STAT_GT_NONFATAL))
 
 #define  ERR_GT_STAT_MAX_ERROR_BIT	(16)
+
+enum gt_vctr_registers {
+	ERR_STAT_GT_VCTR0 = 0,
+	ERR_STAT_GT_VCTR1,
+	ERR_STAT_GT_VCTR2,
+	ERR_STAT_GT_VCTR3,
+	ERR_STAT_GT_VCTR4,
+	ERR_STAT_GT_VCTR5,
+	ERR_STAT_GT_VCTR6,
+	ERR_STAT_GT_VCTR7,
+};
+
+#define ERR_STAT_GT_COR_VCTR_LEN        (4)
+#define _ERR_STAT_GT_COR_VCTR_0         0x1002a0
+#define _ERR_STAT_GT_COR_VCTR_1         0x1002a4
+#define ERR_STAT_GT_COR_VCTR_REG(x)     XE_REG(_PICK_EVEN((x), \
+						_ERR_STAT_GT_COR_VCTR_0, \
+						_ERR_STAT_GT_COR_VCTR_1))
+
+#define ERR_STAT_GT_FATAL_VCTR_LEN      (8)
+#define _ERR_STAT_GT_FATAL_VCTR_0       0x100260
+#define _ERR_STAT_GT_FATAL_VCTR_1       0x100264
+#define ERR_STAT_GT_FATAL_VCTR_REG(x)   XE_REG(_PICK_EVEN((x), \
+						_ERR_STAT_GT_FATAL_VCTR_0, \
+						_ERR_STAT_GT_FATAL_VCTR_1))
+
+#define ERR_STAT_GT_VCTR_REG(hw_err, x) (hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+						ERR_STAT_GT_COR_VCTR_REG(x) : \
+						ERR_STAT_GT_FATAL_VCTR_REG(x))
+
 /* Count of GT Correctable and FATAL HW ERRORS */
 enum xe_gt_hw_errors {
+	XE_GT_HW_ERR_SUBSLICE_CORR,
+	XE_GT_HW_ERR_L3BANK_CORR,
 	XE_GT_HW_ERR_L3_SNG_CORR,
 	XE_GT_HW_ERR_GUC_CORR,
 	XE_GT_HW_ERR_SAMPLER_CORR,
@@ -21,6 +53,8 @@ enum xe_gt_hw_errors {
 	XE_GT_HW_ERR_EU_IC_CORR,
 	XE_GT_HW_ERR_EU_GRF_CORR,
 	XE_GT_HW_ERR_UNKNOWN_CORR,
+	XE_GT_HW_ERR_SUBSLICE_FATAL,
+	XE_GT_HW_ERR_L3BANK_FATAL,
 	XE_GT_HW_ERR_ARR_BIST_FATAL,
 	XE_GT_HW_ERR_FPU_FATAL,
 	XE_GT_HW_ERR_L3_DOUB_FATAL,
@@ -32,11 +66,25 @@ enum xe_gt_hw_errors {
 	XE_GT_HW_ERR_SLM_FATAL,
 	XE_GT_HW_ERR_EU_IC_FATAL,
 	XE_GT_HW_ERR_EU_GRF_FATAL,
+	XE_GT_HW_ERR_TLB_FATAL,
+	XE_GT_HW_ERR_L3_FABRIC_FATAL,
 	XE_GT_HW_ERR_UNKNOWN_FATAL,
 };
 
 #define XE_GT_HW_ERROR_MAX (XE_GT_HW_ERR_UNKNOWN_FATAL + 1)
 
+#define PVC_ERR_STAT_GT_FATAL_MASK \
+		(REG_BIT(3) | \
+		 REG_BIT(6) | \
+		 REG_BIT(13) | \
+		 REG_BIT(15))
+
+#define PVC_ERR_STAT_GT_CORRECTABLE_MASK \
+		(REG_BIT(1) | \
+		 REG_BIT(13) | \
+		 REG_BIT(14) | \
+		 REG_BIT(15))
+
 #define xe_gt_log_hw_err(gt, fmt, ...) \
 	drm_err_ratelimited(&gt_to_xe(gt)->drm, HW_ERR "GT%d detected " fmt, \
 			    gt->info.id, ##__VA_ARGS__)
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index ca9c4993be6a..ab29ac625d3a 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -492,6 +492,24 @@ struct error_msg_counter_pair err_stat_gt_correctable_reg[] = {
 	{"SINGLE BIT EU GRF",	XE_GT_HW_ERR_EU_GRF_CORR		/* Bit Pos 15 */},
 };
 
+struct error_msg_counter_pair err_stat_gt_fatal_vectr_reg[] = {
+	{"SUBSLICE",		XE_GT_HW_ERR_SUBSLICE_FATAL		/* vector reg 0 */},
+	{"SUBSLICE",		XE_GT_HW_ERR_SUBSLICE_FATAL		/* vector reg 1 */},
+	{"L3BANK",		XE_GT_HW_ERR_L3BANK_FATAL		/* vector reg 2 */},
+	{"L3BANK",		XE_GT_HW_ERR_L3BANK_FATAL		/* vector reg 3 */},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_FATAL		/* vector reg 4 */},
+	{"Undefined",		XE_GT_HW_ERR_UNKNOWN_FATAL              /* vector reg 5 */},
+	{"TLB",			XE_GT_HW_ERR_TLB_FATAL			/* vector reg 6 */},
+	{"L3 FABRIC",		XE_GT_HW_ERR_L3_FABRIC_FATAL		/* vector reg 7 */},
+};
+
+struct error_msg_counter_pair err_stat_gt_correctable_vectr_reg[] = {
+	{"SUBSLICE",		XE_GT_HW_ERR_SUBSLICE_CORR		/* vector reg 0 */},
+	{"SUBSLICE",		XE_GT_HW_ERR_SUBSLICE_CORR		/* vector reg 1 */},
+	{"L3BANK",		XE_GT_HW_ERR_L3BANK_CORR		/* vector reg 2 */},
+	{"L3BANK",		XE_GT_HW_ERR_L3BANK_CORR		/* vector reg 3 */},
+};
+
 static void update_valid_error_regs(struct xe_device *xe)
 {
 	unsigned long mask = 0;
@@ -528,11 +546,21 @@ static void update_valid_error_regs(struct xe_device *xe)
 		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
 			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
 			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
+
+		mask = ~(0 | PVC_ERR_STAT_GT_FATAL_MASK);
+		for_each_set_bit(i, &mask, ERR_GT_STAT_MAX_ERROR_BIT)
+			err_stat_gt_fatal_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_GT_HW_ERR_UNKNOWN_FATAL};
+
+		mask = ~(0 | PVC_ERR_STAT_GT_CORRECTABLE_MASK);
+		for_each_set_bit(i, &mask, ERR_GT_STAT_MAX_ERROR_BIT)
+			err_stat_gt_correctable_reg[i] = (struct error_msg_counter_pair)
+			{.errmsg = "Undefined", .errcounter = XE_GT_HW_ERR_UNKNOWN_CORR};
 	}
 }
 
 static void
-xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+xe_gt_hw_error_status_reg_handler(struct xe_gt *gt, const enum hardware_error hw_err)
 {
 	const char *hw_err_str = hardware_error_type_to_str(hw_err);
 	struct error_msg_counter_pair *errstat;
@@ -542,10 +570,6 @@ xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
 	u32 errcntr;
 	u32 errbit;
 
-	if (gt_to_xe(gt)->info.platform == XE_PVC)
-		return;
-
-	lockdep_assert_held(&gt_to_xe(gt)->irq.lock);
 	errsrc = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err));
 	if (!errsrc) {
 		xe_gt_log_hw_err(gt, "ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
@@ -593,6 +617,84 @@ xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
 clear_reg: xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err), errsrc);
 }
 
+static void
+xe_gt_hw_error_vectr_reg_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hardware_error_type_to_str(hw_err);
+	struct error_msg_counter_pair *errvctr;
+	const char *errmsg;
+	bool errstat_read;
+	u32 num_vctr_reg;
+	u32 counter;
+	u32 vctr;
+	u32 i;
+
+	switch (hw_err) {
+	case HARDWARE_ERROR_FATAL:
+		num_vctr_reg = ERR_STAT_GT_FATAL_VCTR_LEN;
+		errvctr = (struct error_msg_counter_pair *)err_stat_gt_fatal_vectr_reg;
+		counter = XE_GT_HW_ERR_UNKNOWN_FATAL;
+		break;
+	case HARDWARE_ERROR_NONFATAL:
+		/*  The GT Non Fatal Error Status Register has only reserved bits
+		 *  Nothing to service.
+		 */
+		xe_gt_log_hw_err(gt, "%s error\n", hw_err_str);
+		return;
+	case HARDWARE_ERROR_CORRECTABLE:
+		num_vctr_reg = ERR_STAT_GT_COR_VCTR_LEN;
+		errvctr = (struct error_msg_counter_pair *)err_stat_gt_correctable_vectr_reg;
+		counter = XE_GT_HW_ERR_UNKNOWN_CORR;
+		break;
+	default:
+		return;
+	}
+
+	errstat_read = false;
+
+	for (i = 0 ; i < num_vctr_reg; i++) {
+		vctr = xe_mmio_read32(gt, ERR_STAT_GT_VCTR_REG(hw_err, i));
+		if (!vctr)
+			continue;
+
+		errmsg = errvctr[i].errmsg;
+		counter = errvctr[i].errcounter;
+
+		if (hw_err == HARDWARE_ERROR_FATAL)
+			xe_gt_log_hw_err(gt, "%s %s error. ERR_VECT_GT_%s_[%d]:0x%08x\n",
+					 errmsg, hw_err_str, hw_err_str, i, vctr);
+		else
+			xe_gt_log_hw_warn(gt, "%s %s error. ERR_VECT_GT_%s_[%d]:0x%08x\n",
+					  errmsg, hw_err_str, hw_err_str, i, vctr);
+		if (i < ERR_STAT_GT_VCTR4)
+			gt->errors.hw[counter] += hweight32(vctr);
+
+		if (i == ERR_STAT_GT_VCTR6)
+			gt->errors.hw[counter] += hweight16(vctr);
+
+		if (i == ERR_STAT_GT_VCTR7)
+			gt->errors.hw[counter] += hweight8(vctr);
+
+		if (i < ERR_STAT_GT_VCTR2 && !errstat_read) {
+			xe_gt_hw_error_status_reg_handler(gt, hw_err);
+			errstat_read = true;
+		}
+
+		xe_mmio_write32(gt, ERR_STAT_GT_VCTR_REG(hw_err, i), vctr);
+	}
+}
+
+static void
+xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+	lockdep_assert_held(&gt_to_xe(gt)->irq.lock);
+
+	if (gt_to_xe(gt)->info.platform == XE_PVC)
+		xe_gt_hw_error_vectr_reg_handler(gt, hw_err);
+	else
+		xe_gt_hw_error_status_reg_handler(gt, hw_err);
+}
+
 static void
 xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-xe] [PATCH v2 4/4] drm/xe: Process fatal hardware errors.
  2023-08-10  5:07 [Intel-xe] [PATCH v2 0/4] Supporting RAS on XE Himal Prasad Ghimiray
                   ` (2 preceding siblings ...)
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 3/4] drm/xe: Support GT hardware error reporting for PVC Himal Prasad Ghimiray
@ 2023-08-10  5:07 ` Himal Prasad Ghimiray
  2023-08-10  5:32 ` [Intel-xe] ✗ CI.Patch_applied: failure for Supporting RAS on XE Patchwork
  4 siblings, 0 replies; 10+ messages in thread
From: Himal Prasad Ghimiray @ 2023-08-10  5:07 UTC (permalink / raw)
  To: intel-xe; +Cc: Joonas Lahtinen, Rodrigo Vivi, Matt Roper

Fatal errors are reported as PCIe errors. When a PCIe error is asserted,
the OS will perform a device warm reset which causes the driver to reload.
The error registers are sticky and the values are maintained through a
warm reset. We read these registers during the boot flow of the driver and
increment the respective error counters.

Bspec: 53076

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h |  3 ++
 drivers/gpu/drm/xe/xe_irq.c                | 37 +++++++++++++++++++++-
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h b/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
index 27a54b7c278a..b389ddd140d0 100644
--- a/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
@@ -42,6 +42,9 @@ enum gt_vctr_registers {
 						ERR_STAT_GT_COR_VCTR_REG(x) : \
 						ERR_STAT_GT_FATAL_VCTR_REG(x))
 
+#define DEV_PCIEERR_STATUS              XE_REG(0x100180)
+#define DEV_PCIEERR_IS_FATAL(x)         REG_BIT(x * 4 + 2)
+
 /* Count of GT Correctable and FATAL HW ERRORS */
 enum xe_gt_hw_errors {
 	XE_GT_HW_ERR_SUBSLICE_CORR,
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index ab29ac625d3a..7193744c2c45 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -784,6 +784,41 @@ xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 	}
 }
 
+/**
+ * process_hw_errors - checks for the occurrence of HW errors
+ *
+ * This checks for the HW Errors including FATAL error that might
+ * have occurred in the previous boot of the driver which will
+ * initiate PCIe FLR reset of the device and cause the
+ * driver to reload.
+ */
+static void process_hw_errors(struct xe_device *xe)
+{
+	struct xe_tile *root_tile = xe_device_get_root_tile(xe);
+	struct xe_gt *root_mmio = root_tile->primary_gt;
+
+	u32 dev_pcieerr_status, master_ctl;
+	struct xe_tile *tile;
+	int i;
+
+	update_valid_error_regs(xe);
+
+	dev_pcieerr_status = xe_mmio_read32(root_mmio, DEV_PCIEERR_STATUS);
+
+	for_each_tile(tile, xe, i) {
+		struct xe_gt *mmio = tile->primary_gt;
+
+		if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
+			xe_hw_error_source_handler(tile, HARDWARE_ERROR_FATAL);
+
+		master_ctl = xe_mmio_read32(mmio, GFX_MSTR_IRQ);
+		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
+		xe_hw_error_irq_handler(tile, master_ctl);
+	}
+	if (dev_pcieerr_status)
+		xe_mmio_write32(root_mmio, DEV_PCIEERR_STATUS, dev_pcieerr_status);
+}
+
 /*
  * Top-level interrupt handler for Xe_LP+ and beyond.  These platforms have
  * a "master tile" interrupt register which must be consulted before the
@@ -984,7 +1019,7 @@ int xe_irq_install(struct xe_device *xe)
 		return -EINVAL;
 	}
 
-	update_valid_error_regs(xe);
+	process_hw_errors(xe);
 
 	xe->irq.enabled = true;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-xe] ✗ CI.Patch_applied: failure for Supporting RAS on XE
  2023-08-10  5:07 [Intel-xe] [PATCH v2 0/4] Supporting RAS on XE Himal Prasad Ghimiray
                   ` (3 preceding siblings ...)
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 4/4] drm/xe: Process fatal hardware errors Himal Prasad Ghimiray
@ 2023-08-10  5:32 ` Patchwork
  4 siblings, 0 replies; 10+ messages in thread
From: Patchwork @ 2023-08-10  5:32 UTC (permalink / raw)
  To: Himal Prasad Ghimiray; +Cc: intel-xe

== Series Details ==

Series: Supporting RAS on XE
URL   : https://patchwork.freedesktop.org/series/122257/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-xe-next' with base: ===
Base commit: b921a7374 drm/xe: Add Wa_14015150844 for DG2 and Xe_LPG
=== git am output follows ===
error: patch failed: drivers/gpu/drm/xe/xe_device_types.h:16
error: drivers/gpu/drm/xe/xe_device_types.h: patch does not apply
hint: Use 'git am --show-current-patch' to see the failed patch
Applying: drm/xe: Handle errors from various components.
Patch failed at 0001 drm/xe: Handle errors from various components.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components.
  2023-08-10  5:07 ` [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
@ 2023-08-10  7:54   ` Jani Nikula
       [not found]     ` <ac574fd9-dce4-4f1e-a209-f4a400263273@intel.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Jani Nikula @ 2023-08-10  7:54 UTC (permalink / raw)
  To: Himal Prasad Ghimiray, intel-xe; +Cc: Joonas Lahtinen, Rodrigo Vivi, Matt Roper

On Thu, 10 Aug 2023, Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> wrote:
> The GFX device can generate numbers of classes of error under the new
> infrastructure: correctable, non-fatal, and fatal errors.
>
> The non-fatal and fatal error classes distinguish between levels of
> severity for uncorrectable errors. Driver will only handle logging
> of errors and updating counters from various components within the
> graphics device. Anything more will be handled at system level.
>
> For errors that will route as interrupts, three bits in the Master
> Interrupt Register will be used to convey the class of error.
>
> For each class of error: Determine source of error (IP block) by reading
> the Device Error Source Register (RW1C) that
> corresponds to the class of error being serviced.
>
> Bspec: 50875, 53073, 53074, 53075
>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> ---
>  drivers/gpu/drm/xe/regs/xe_regs.h            |   7 +
>  drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++++++
>  drivers/gpu/drm/xe/xe_device_types.h         |   6 +
>  drivers/gpu/drm/xe/xe_irq.c                  | 220 +++++++++++++++++++
>  4 files changed, 341 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
> index ec45b1ba9db1..9901e55fc89c 100644
> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> @@ -87,7 +87,14 @@
>  #define   GU_MISC_IRQ				REG_BIT(29)
>  #define   DISPLAY_IRQ				REG_BIT(16)
>  #define   GT_DW_IRQ(x)				REG_BIT(x)
> +#define   XE_ERROR_IRQ(x)			REG_BIT(26 + (x))
>  
>  #define PVC_RP_STATE_CAP			XE_REG(0x281014)
>  
> +enum hardware_error {
> +	HARDWARE_ERROR_CORRECTABLE = 0,
> +	HARDWARE_ERROR_NONFATAL = 1,
> +	HARDWARE_ERROR_FATAL = 2,
> +	HARDWARE_ERROR_MAX,
> +};

This file is about registers. IMO enums belong somewhere else. Define
hardware registers using macros.

>  #endif
> diff --git a/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
> new file mode 100644
> index 000000000000..fbb794b2f183
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
> @@ -0,0 +1,108 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +#ifndef XE_TILE_ERROR_REGS_H_
> +#define XE_TILE_ERROR_REGS_H_
> +
> +#include <linux/stddef.h>
> +
> +#define _DEV_ERR_STAT_NONFATAL                         0x100178
> +#define _DEV_ERR_STAT_CORRECTABLE                      0x10017c
> +#define DEV_ERR_STAT_REG(x)                            XE_REG(_PICK_EVEN((x), \
> +								_DEV_ERR_STAT_CORRECTABLE, \
> +								_DEV_ERR_STAT_NONFATAL))
> +
> +#define  DEV_ERR_STAT_MAX_ERROR_BIT         (21)
> +
> +/* Count of  Correctable and Uncorrectable errors reported on tile */
> +enum xe_tile_hw_errors {g
> +	XE_TILE_HW_ERR_GT_FATAL = 0,
> +	XE_TILE_HW_ERR_SGGI_FATAL,
> +	XE_TILE_HW_ERR_DISPLAY_FATAL,
> +	XE_TILE_HW_ERR_SGDI_FATAL,
> +	XE_TILE_HW_ERR_SGLI_FATAL,
> +	XE_TILE_HW_ERR_SGUNIT_FATAL,
> +	XE_TILE_HW_ERR_SGCI_FATAL,
> +	XE_TILE_HW_ERR_GSC_FATAL,
> +	XE_TILE_HW_ERR_SOC_FATAL,
> +	XE_TILE_HW_ERR_MERT_FATAL,
> +	XE_TILE_HW_ERR_SGMI_FATAL,
> +	XE_TILE_HW_ERR_UNKNOWN_FATAL,
> +	XE_TILE_HW_ERR_SGGI_NONFATAL,
> +	XE_TILE_HW_ERR_DISPLAY_NONFATAL,
> +	XE_TILE_HW_ERR_SGDI_NONFATAL,
> +	XE_TILE_HW_ERR_SGLI_NONFATAL,
> +	XE_TILE_HW_ERR_GT_NONFATAL,
> +	XE_TILE_HW_ERR_SGUNIT_NONFATAL,
> +	XE_TILE_HW_ERR_SGCI_NONFATAL,
> +	XE_TILE_HW_ERR_GSC_NONFATAL,
> +	XE_TILE_HW_ERR_SOC_NONFATAL,
> +	XE_TILE_HW_ERR_MERT_NONFATAL,
> +	XE_TILE_HW_ERR_SGMI_NONFATAL,
> +	XE_TILE_HW_ERR_UNKNOWN_NONFATAL,
> +	XE_TILE_HW_ERR_GT_CORR,
> +	XE_TILE_HW_ERR_DISPLAY_CORR,
> +	XE_TILE_HW_ERR_SGUNIT_CORR,
> +	XE_TILE_HW_ERR_GSC_CORR,
> +	XE_TILE_HW_ERR_SOC_CORR,
> +	XE_TILE_HW_ERR_UNKNOWN_CORR,
> +};

Ditto about enums and regs.

> +
> +#define XE_TILE_HW_ERROR_MAX (XE_TILE_HW_ERR_UNKNOWN_CORR + 1)

If it's an enum, adding that last in the enum does the trick.

> +
> +#define PVC_DEV_ERR_STAT_FATAL_MASK \
> +		(REG_BIT(0) | \
> +		 REG_BIT(1) | \
> +		 REG_BIT(8) | \
> +		 REG_BIT(9) | \
> +		 REG_BIT(13) | \
> +		 REG_BIT(16) | \
> +		 REG_BIT(20))
> +
> +#define PVC_DEV_ERR_STAT_NONFATAL_MASK \
> +		(REG_BIT(0) | \
> +		 REG_BIT(1) | \
> +		 REG_BIT(8) | \
> +		 REG_BIT(9) | \
> +		 REG_BIT(13) | \
> +		 REG_BIT(16) | \
> +		 REG_BIT(20))
> +
> +#define PVC_DEV_ERR_STAT_CORRECTABLE_MASK \
> +		(REG_BIT(0) | \
> +		 REG_BIT(8))
> +
> +#define DG2_DEV_ERR_STAT_FATAL_MASK \
> +		(REG_BIT(0) | \
> +		 REG_BIT(4) | \
> +		 REG_BIT(8) | \
> +		 REG_BIT(12) | \
> +		 REG_BIT(16))
> +
> +#define DG2_DEV_ERR_STAT_NONFATAL_MASK \
> +		(REG_BIT(0) | \
> +		 REG_BIT(4) | \
> +		 REG_BIT(8) | \
> +		 REG_BIT(12) | \
> +		 REG_BIT(16) | \
> +		 REG_BIT(20))
> +
> +#define DG2_DEV_ERR_STAT_CORRECTABLE_MASK \
> +		(REG_BIT(0) | \
> +		 REG_BIT(4) | \
> +		 REG_BIT(8) | \
> +		 REG_BIT(12) | \
> +		 REG_BIT(16))

Are the above supposed to match what's in xe_tile_hw_errors? Seems
rather unmaintainable.

> +
> +#define REG_SIZE 32
> +
> +#define xe_tile_log_hw_err(tile, fmt, ...) \
> +	drm_err_ratelimited(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
> +			    tile->id, ##__VA_ARGS__)
> +
> +#define xe_tile_log_hw_warn(tile, fmt, ...) \
> +	drm_warn(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
> +		 tile->id, ##__VA_ARGS__)

Do we really want to keep adding new macros for all possible scenarios
in the driver? This is getting out of hand.

Where's HW_ERR defined?

> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index f84ecb976f5d..1335ba74981a 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -16,6 +16,7 @@
>  #include "xe_gt_types.h"
>  #include "xe_platform_types.h"
>  #include "xe_step_types.h"
> +#include "regs/xe_tile_error_regs.h"
>  
>  #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>  #include "ext/intel_device_info.h"
> @@ -166,6 +167,11 @@ struct xe_tile {
>  
>  	/** @sysfs: sysfs' kobj used by xe_tile_sysfs */
>  	struct kobject *sysfs;
> +
> +	/** @tile_hw_errors: hardware errors reported for the tile */
> +	struct tile_hw_errors {
> +		unsigned long hw[XE_TILE_HW_ERROR_MAX];

Even with the documentation comment, I have to look up the source code
to realize this is the *number* of errors for each class.

Maybe "count" is more informative than "hw".

> +	} errors;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> index 2022a5643e01..04a665faea23 100644
> --- a/drivers/gpu/drm/xe/xe_irq.c
> +++ b/drivers/gpu/drm/xe/xe_irq.c
> @@ -362,6 +362,223 @@ static void dg1_intr_enable(struct xe_device *xe, bool stall)
>  		xe_mmio_read32(mmio, DG1_MSTR_TILE_INTR);
>  }
>  
> +static const char *
> +hardware_error_type_to_str(const enum hardware_error hw_err)
> +{
> +	switch (hw_err) {
> +	case HARDWARE_ERROR_CORRECTABLE:
> +		return "CORRECTABLE";
> +	case HARDWARE_ERROR_NONFATAL:
> +		return "NONFATAL";
> +	case HARDWARE_ERROR_FATAL:
> +		return "FATAL";
> +	default:
> +		return "UNKNOWN";
> +	}
> +}
> +
> +struct error_msg_counter_pair {
> +	const char *errmsg;
> +	int errcounter;

Counter? Or type/class/whatever?

> +};
> +
> +struct error_msg_counter_pair dev_err_stat_fatal_reg[] = {
> +	{"GT",			XE_TILE_HW_ERR_GT_FATAL			/* Bit Pos 0 */},

Does this again tie the enums and the bit positions together, similar to
how the mask macros also do above?

There needs to be a single point of truth for all of this.

I think this needs a redesign.

BR,
Jani.

> +	{"SGGI Cmd Parity",	XE_TILE_HW_ERR_SGGI_FATAL		/* Bit Pos 1 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_FATAL		/* Bit Pos 4 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"GSC error",		XE_TILE_HW_ERR_GSC_FATAL		/* Bit Pos 8 */},
> +	{"SGLI Cmd Parity",	XE_TILE_HW_ERR_SGLI_FATAL		/* Bit Pos 9 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_FATAL		/* Bit Pos 12 */},
> +	{"SGCI Cmd Parity",	XE_TILE_HW_ERR_SGCI_FATAL		/* Bit Pos 13 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"SOC ERROR",		XE_TILE_HW_ERR_SOC_FATAL		/* Bit Pos 16 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
> +	{"MERT Cmd Parity",	XE_TILE_HW_ERR_MERT_FATAL		/* Bit Pos 20 */},
> +};
> +
> +struct error_msg_counter_pair dev_err_stat_nonfatal_reg[] = {
> +	{"GT",			XE_TILE_HW_ERR_GT_NONFATAL		/* Bit Pos 0 */},
> +	{"SGGI Data Parity",	XE_TILE_HW_ERR_SGGI_NONFATAL		/* Bit Pos 1 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_NONFATAL		/* Bit Pos 4 */},
> +	{"SGDI Data Parity",	XE_TILE_HW_ERR_SGDI_NONFATAL		/* Bit Pos 5 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"GSC",			XE_TILE_HW_ERR_GSC_NONFATAL		/* Bit Pos 8 */},
> +	{"SGLI Data Parity",	XE_TILE_HW_ERR_SGLI_NONFATAL		/* Bit Pos 9 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_NONFATAL		/* Bit Pos 12 */},
> +	{"SGCI Data Parity",	XE_TILE_HW_ERR_SGCI_NONFATAL		/* Bit Pos 13 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"SOC",			XE_TILE_HW_ERR_SOC_NONFATAL		/* Bit Pos 16 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL		/* Bit Pos 17 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
> +	{"MERT Data Parity",	XE_TILE_HW_ERR_MERT_NONFATAL		/* Bit Pos 20 */},
> +};
> +
> +struct error_msg_counter_pair dev_err_stat_correctable_reg[] = {
> +	{"GT",			XE_TILE_HW_ERR_GT_CORR			/* Bit Pos 0 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_CORR		/* Bit Pos 4 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"GSC",			XE_TILE_HW_ERR_GSC_CORR			/* Bit Pos 8 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_CORR		/* Bit Pos 12 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"SOC",			XE_TILE_HW_ERR_SOC_CORR			/* Bit Pos 16 */},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
> +};
> +
> +static void update_valid_error_regs(struct xe_device *xe)
> +{
> +	unsigned long mask = 0;
> +
> +	u32 i;
> +
> +	if (xe->info.platform == XE_DG2) {
> +		mask = ~(0 | DG2_DEV_ERR_STAT_FATAL_MASK);
> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
> +			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};

Nope. For one thing, the arrays really should be static const, placed in
rodata, and not mutable.

For another, if you have a platform with two or more different devices,
whichever gets probed last clobbers the data.

> +
> +		mask = ~(0 | DG2_DEV_ERR_STAT_NONFATAL_MASK);
> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
> +			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
> +
> +		mask = ~(0 | DG2_DEV_ERR_STAT_CORRECTABLE_MASK);
> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
> +			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
> +	} else if (xe->info.platform == XE_PVC) {
> +		mask = ~(0 | PVC_DEV_ERR_STAT_FATAL_MASK);
> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
> +			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
> +
> +		mask = ~(0 | PVC_DEV_ERR_STAT_NONFATAL_MASK);
> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
> +			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
> +
> +		mask = ~(0 | PVC_DEV_ERR_STAT_CORRECTABLE_MASK);
> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
> +			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
> +	}
> +}
> +
> +static void
> +xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> +{
> +	const char *hw_err_str = hardware_error_type_to_str(hw_err);
> +	struct error_msg_counter_pair *errstat;
> +	unsigned long errsrc;
> +	unsigned long flags;
> +	const char *errmsg;
> +	struct xe_gt *mmio;
> +	u32 counter;
> +	u32 errcntr;
> +	u32 errbit;
> +
> +	switch (hw_err) {
> +	case HARDWARE_ERROR_FATAL:
> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_fatal_reg;

Why the casts?

> +		counter = XE_TILE_HW_ERR_UNKNOWN_FATAL;
> +		break;
> +	case HARDWARE_ERROR_NONFATAL:
> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_nonfatal_reg;
> +		counter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL;
> +		break;
> +	case HARDWARE_ERROR_CORRECTABLE:
> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_correctable_reg;
> +		counter = XE_TILE_HW_ERR_UNKNOWN_CORR;
> +		break;
> +	default:
> +		return;
> +	}
> +
> +	spin_lock_irqsave(&tile_to_xe(tile)->irq.lock, flags);
> +	mmio = tile->primary_gt;
> +	errsrc = xe_mmio_read32(mmio, DEV_ERR_STAT_REG(hw_err));
> +
> +	if (!errsrc) {
> +		xe_tile_log_hw_err(tile, "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
> +		goto unlock;
> +	}
> +
> +	for_each_set_bit(errbit, &errsrc, REG_SIZE) {
> +		if (errbit < DEV_ERR_STAT_MAX_ERROR_BIT) {
> +			errmsg = errstat[errbit].errmsg;
> +			errcntr = errstat[errbit].errcounter;
> +		} else {
> +			errmsg = "Undefined";
> +			errcntr = counter;
> +		}
> +
> +		if (hw_err == HARDWARE_ERROR_CORRECTABLE)
> +			xe_tile_log_hw_warn(tile, "%s %s error bit[%d] is set\n",
> +					    errmsg, hw_err_str, errbit);
> +		else
> +			xe_tile_log_hw_err(tile, "%s %s error bit[%d] is set\n",
> +					   errmsg, hw_err_str, errbit);
> +
> +		tile->errors.hw[errcntr]++;
> +	}
> +
> +	xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
> +unlock:
> +	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
> +}
> +
> +/*
> + * XE Platforms adds three Error bits to the Master Interrupt
> + * Register to support error handling. These three bits are
> + * used to convey the class of error:
> + * FATAL, NONFATAL, or CORRECTABLE.
> + *
> + * To process an interrupt:
> + *       Determine source of error (IP block) by reading
> + *	 the Device Error Source Register (RW1C) that
> + *	 corresponds to the class of error being serviced
> + *	 and log the error.
> + */
> +static void
> +xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
> +{
> +	enum hardware_error hw_err;
> +
> +	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
> +		if (master_ctl & XE_ERROR_IRQ(hw_err))
> +			xe_hw_error_source_handler(tile, hw_err);
> +	}
> +}
> +
>  /*
>   * Top-level interrupt handler for Xe_LP+ and beyond.  These platforms have
>   * a "master tile" interrupt register which must be consulted before the
> @@ -413,6 +630,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>  		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>  
>  		gt_irq_handler(tile, master_ctl, intr_dw, identity);
> +		xe_hw_error_irq_handler(tile, master_ctl);
>  
>  		/*
>  		 * Display interrupts (including display backlight operations
> @@ -561,6 +779,8 @@ int xe_irq_install(struct xe_device *xe)
>  		return -EINVAL;
>  	}
>  
> +	update_valid_error_regs(xe);
> +
>  	xe->irq.enabled = true;
>  
>  	xe_irq_reset(xe);

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components.
       [not found]     ` <ac574fd9-dce4-4f1e-a209-f4a400263273@intel.com>
@ 2023-08-10  9:38       ` Ghimiray, Himal Prasad
  2023-08-10 10:17         ` Jani Nikula
  0 siblings, 1 reply; 10+ messages in thread
From: Ghimiray, Himal Prasad @ 2023-08-10  9:38 UTC (permalink / raw)
  To: Jani Nikula, intel-xe

[-- Attachment #1: Type: text/plain, Size: 18861 bytes --]


On 10-08-2023 15:01, Ghimiray, Himal Prasad wrote:
>
>
> On 10-08-2023 13:24, Jani Nikula wrote:
>> On Thu, 10 Aug 2023, Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>  wrote:
>>> The GFX device can generate numbers of classes of error under the new
>>> infrastructure: correctable, non-fatal, and fatal errors.
>>>
>>> The non-fatal and fatal error classes distinguish between levels of
>>> severity for uncorrectable errors. Driver will only handle logging
>>> of errors and updating counters from various components within the
>>> graphics device. Anything more will be handled at system level.
>>>
>>> For errors that will route as interrupts, three bits in the Master
>>> Interrupt Register will be used to convey the class of error.
>>>
>>> For each class of error: Determine source of error (IP block) by reading
>>> the Device Error Source Register (RW1C) that
>>> corresponds to the class of error being serviced.
>>>
>>> Bspec: 50875, 53073, 53074, 53075
>>>
>>> Cc: Rodrigo Vivi<rodrigo.vivi@intel.com>
>>> Cc: Aravind Iddamsetty<aravind.iddamsetty@intel.com>
>>> Cc: Matthew Brost<matthew.brost@intel.com>
>>> Cc: Matt Roper<matthew.d.roper@intel.com>
>>> Cc: Joonas Lahtinen<joonas.lahtinen@linux.intel.com>
>>> Cc: Jani Nikula<jani.nikula@intel.com>
>>> Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/regs/xe_regs.h            |   7 +
>>>   drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++++++
>>>   drivers/gpu/drm/xe/xe_device_types.h         |   6 +
>>>   drivers/gpu/drm/xe/xe_irq.c                  | 220 +++++++++++++++++++
>>>   4 files changed, 341 insertions(+)
>>>   create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>>
>>> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
>>> index ec45b1ba9db1..9901e55fc89c 100644
>>> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
>>> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
>>> @@ -87,7 +87,14 @@
>>>   #define   GU_MISC_IRQ				REG_BIT(29)
>>>   #define   DISPLAY_IRQ				REG_BIT(16)
>>>   #define   GT_DW_IRQ(x)				REG_BIT(x)
>>> +#define   XE_ERROR_IRQ(x)			REG_BIT(26 + (x))
>>>   
>>>   #define PVC_RP_STATE_CAP			XE_REG(0x281014)
>>>   
>>> +enum hardware_error {
>>> +	HARDWARE_ERROR_CORRECTABLE = 0,
>>> +	HARDWARE_ERROR_NONFATAL = 1,
>>> +	HARDWARE_ERROR_FATAL = 2,
>>> +	HARDWARE_ERROR_MAX,
>>> +};
>> This file is about registers. IMO enums belong somewhere else. Define
>> hardware registers using macros.
>
> Hmm. Will look for better placeholder for this enum.
>
>>>   #endif
>>> diff --git a/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>> new file mode 100644
>>> index 000000000000..fbb794b2f183
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>> @@ -0,0 +1,108 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2023 Intel Corporation
>>> + */
>>> +#ifndef XE_TILE_ERROR_REGS_H_
>>> +#define XE_TILE_ERROR_REGS_H_
>>> +
>>> +#include <linux/stddef.h>
>>> +
>>> +#define _DEV_ERR_STAT_NONFATAL                         0x100178
>>> +#define _DEV_ERR_STAT_CORRECTABLE                      0x10017c
>>> +#define DEV_ERR_STAT_REG(x)                            XE_REG(_PICK_EVEN((x), \
>>> +								_DEV_ERR_STAT_CORRECTABLE, \
>>> +								_DEV_ERR_STAT_NONFATAL))
>>> +
>>> +#define  DEV_ERR_STAT_MAX_ERROR_BIT         (21)
>>> +
>>> +/* Count of  Correctable and Uncorrectable errors reported on tile */
>>> +enum xe_tile_hw_errors {g
>>> +	XE_TILE_HW_ERR_GT_FATAL = 0,
>>> +	XE_TILE_HW_ERR_SGGI_FATAL,
>>> +	XE_TILE_HW_ERR_DISPLAY_FATAL,
>>> +	XE_TILE_HW_ERR_SGDI_FATAL,
>>> +	XE_TILE_HW_ERR_SGLI_FATAL,
>>> +	XE_TILE_HW_ERR_SGUNIT_FATAL,
>>> +	XE_TILE_HW_ERR_SGCI_FATAL,
>>> +	XE_TILE_HW_ERR_GSC_FATAL,
>>> +	XE_TILE_HW_ERR_SOC_FATAL,
>>> +	XE_TILE_HW_ERR_MERT_FATAL,
>>> +	XE_TILE_HW_ERR_SGMI_FATAL,
>>> +	XE_TILE_HW_ERR_UNKNOWN_FATAL,
>>> +	XE_TILE_HW_ERR_SGGI_NONFATAL,
>>> +	XE_TILE_HW_ERR_DISPLAY_NONFATAL,
>>> +	XE_TILE_HW_ERR_SGDI_NONFATAL,
>>> +	XE_TILE_HW_ERR_SGLI_NONFATAL,
>>> +	XE_TILE_HW_ERR_GT_NONFATAL,
>>> +	XE_TILE_HW_ERR_SGUNIT_NONFATAL,
>>> +	XE_TILE_HW_ERR_SGCI_NONFATAL,
>>> +	XE_TILE_HW_ERR_GSC_NONFATAL,
>>> +	XE_TILE_HW_ERR_SOC_NONFATAL,
>>> +	XE_TILE_HW_ERR_MERT_NONFATAL,
>>> +	XE_TILE_HW_ERR_SGMI_NONFATAL,
>>> +	XE_TILE_HW_ERR_UNKNOWN_NONFATAL,
>>> +	XE_TILE_HW_ERR_GT_CORR,
>>> +	XE_TILE_HW_ERR_DISPLAY_CORR,
>>> +	XE_TILE_HW_ERR_SGUNIT_CORR,
>>> +	XE_TILE_HW_ERR_GSC_CORR,
>>> +	XE_TILE_HW_ERR_SOC_CORR,
>>> +	XE_TILE_HW_ERR_UNKNOWN_CORR,
>>> +};
>> Ditto about enums and regs.
> Will address.
>>> +
>>> +#define XE_TILE_HW_ERROR_MAX (XE_TILE_HW_ERR_UNKNOWN_CORR + 1)
>> If it's an enum, adding that last in the enum does the trick.
> Makes sense.
>>> +
>>> +#define PVC_DEV_ERR_STAT_FATAL_MASK \
>>> +		(REG_BIT(0) | \
>>> +		 REG_BIT(1) | \
>>> +		 REG_BIT(8) | \
>>> +		 REG_BIT(9) | \
>>> +		 REG_BIT(13) | \
>>> +		 REG_BIT(16) | \
>>> +		 REG_BIT(20))
>>> +
>>> +#define PVC_DEV_ERR_STAT_NONFATAL_MASK \
>>> +		(REG_BIT(0) | \
>>> +		 REG_BIT(1) | \
>>> +		 REG_BIT(8) | \
>>> +		 REG_BIT(9) | \
>>> +		 REG_BIT(13) | \
>>> +		 REG_BIT(16) | \
>>> +		 REG_BIT(20))
>>> +
>>> +#define PVC_DEV_ERR_STAT_CORRECTABLE_MASK \
>>> +		(REG_BIT(0) | \
>>> +		 REG_BIT(8))
>>> +
>>> +#define DG2_DEV_ERR_STAT_FATAL_MASK \
>>> +		(REG_BIT(0) | \
>>> +		 REG_BIT(4) | \
>>> +		 REG_BIT(8) | \
>>> +		 REG_BIT(12) | \
>>> +		 REG_BIT(16))
>>> +
>>> +#define DG2_DEV_ERR_STAT_NONFATAL_MASK \
>>> +		(REG_BIT(0) | \
>>> +		 REG_BIT(4) | \
>>> +		 REG_BIT(8) | \
>>> +		 REG_BIT(12) | \
>>> +		 REG_BIT(16) | \
>>> +		 REG_BIT(20))
>>> +
>>> +#define DG2_DEV_ERR_STAT_CORRECTABLE_MASK \
>>> +		(REG_BIT(0) | \
>>> +		 REG_BIT(4) | \
>>> +		 REG_BIT(8) | \
>>> +		 REG_BIT(12) | \
>>> +		 REG_BIT(16))
>> Are the above supposed to match what's in xe_tile_hw_errors? Seems
>> rather unmaintainable.
> xe_tile_hw_errors contains superset of applicable platforms and mask determines
> what are applicable bits for a platform.
>>> +
>>> +#define REG_SIZE 32
>>> +
>>> +#define xe_tile_log_hw_err(tile, fmt, ...) \
>>> +	drm_err_ratelimited(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
>>> +			    tile->id, ##__VA_ARGS__)
>>> +
>>> +#define xe_tile_log_hw_warn(tile, fmt, ...) \
>>> +	drm_warn(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
>>> +		 tile->id, ##__VA_ARGS__)
>> Do we really want to keep adding new macros for all possible scenarios
>> in the driver? This is getting out of hand.
>>
>> Where's HW_ERR defined?
> include/linux/printk.h defines HW_ERR.
>>> +
>>> +#endif
>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>>> index f84ecb976f5d..1335ba74981a 100644
>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>> @@ -16,6 +16,7 @@
>>>   #include "xe_gt_types.h"
>>>   #include "xe_platform_types.h"
>>>   #include "xe_step_types.h"
>>> +#include "regs/xe_tile_error_regs.h"
>>>   
>>>   #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>>>   #include "ext/intel_device_info.h"
>>> @@ -166,6 +167,11 @@ struct xe_tile {
>>>   
>>>   	/** @sysfs: sysfs' kobj used by xe_tile_sysfs */
>>>   	struct kobject *sysfs;
>>> +
>>> +	/** @tile_hw_errors: hardware errors reported for the tile */
>>> +	struct tile_hw_errors {
>>> +		unsigned long hw[XE_TILE_HW_ERROR_MAX];
>> Even with the documentation comment, I have to look up the source code
>> to realize this is the *number* of errors for each class.
>>
>> Maybe "count" is more informative than "hw".
> Ok.
>>> +	} errors;
>>>   };
>>>   
>>>   /**
>>> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>>> index 2022a5643e01..04a665faea23 100644
>>> --- a/drivers/gpu/drm/xe/xe_irq.c
>>> +++ b/drivers/gpu/drm/xe/xe_irq.c
>>> @@ -362,6 +362,223 @@ static void dg1_intr_enable(struct xe_device *xe, bool stall)
>>>   		xe_mmio_read32(mmio, DG1_MSTR_TILE_INTR);
>>>   }
>>>   
>>> +static const char *
>>> +hardware_error_type_to_str(const enum hardware_error hw_err)
>>> +{
>>> +	switch (hw_err) {
>>> +	case HARDWARE_ERROR_CORRECTABLE:
>>> +		return "CORRECTABLE";
>>> +	case HARDWARE_ERROR_NONFATAL:
>>> +		return "NONFATAL";
>>> +	case HARDWARE_ERROR_FATAL:
>>> +		return "FATAL";
>>> +	default:
>>> +		return "UNKNOWN";
>>> +	}
>>> +}
>>> +
>>> +struct error_msg_counter_pair {
>>> +	const char *errmsg;
>>> +	int errcounter;
>> Counter? Or type/class/whatever?
>>
>>> +};
>>> +
>>> +struct error_msg_counter_pair dev_err_stat_fatal_reg[] = {
>>> +	{"GT",			XE_TILE_HW_ERR_GT_FATAL			/* Bit Pos 0 */},
>> Does this again tie the enums and the bit positions together, similar to
>> how the mask macros also do above?
>>
>> There needs to be a single point of truth for all of this.
>>
>> I think this needs a redesign.
>
> As commented above this is array which defines all valid bit positions 
> irrespective of platform. Mask was
>
> to determine platform specific applicability.
>
>> BR,
>> Jani.
>>
>>> +	{"SGGI Cmd Parity",	XE_TILE_HW_ERR_SGGI_FATAL		/* Bit Pos 1 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_FATAL		/* Bit Pos 4 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"GSC error",		XE_TILE_HW_ERR_GSC_FATAL		/* Bit Pos 8 */},
>>> +	{"SGLI Cmd Parity",	XE_TILE_HW_ERR_SGLI_FATAL		/* Bit Pos 9 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_FATAL		/* Bit Pos 12 */},
>>> +	{"SGCI Cmd Parity",	XE_TILE_HW_ERR_SGCI_FATAL		/* Bit Pos 13 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"SOC ERROR",		XE_TILE_HW_ERR_SOC_FATAL		/* Bit Pos 16 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>> +	{"MERT Cmd Parity",	XE_TILE_HW_ERR_MERT_FATAL		/* Bit Pos 20 */},
>>> +};
>>> +
>>> +struct error_msg_counter_pair dev_err_stat_nonfatal_reg[] = {
>>> +	{"GT",			XE_TILE_HW_ERR_GT_NONFATAL		/* Bit Pos 0 */},
>>> +	{"SGGI Data Parity",	XE_TILE_HW_ERR_SGGI_NONFATAL		/* Bit Pos 1 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_NONFATAL		/* Bit Pos 4 */},
>>> +	{"SGDI Data Parity",	XE_TILE_HW_ERR_SGDI_NONFATAL		/* Bit Pos 5 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"GSC",			XE_TILE_HW_ERR_GSC_NONFATAL		/* Bit Pos 8 */},
>>> +	{"SGLI Data Parity",	XE_TILE_HW_ERR_SGLI_NONFATAL		/* Bit Pos 9 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_NONFATAL		/* Bit Pos 12 */},
>>> +	{"SGCI Data Parity",	XE_TILE_HW_ERR_SGCI_NONFATAL		/* Bit Pos 13 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"SOC",			XE_TILE_HW_ERR_SOC_NONFATAL		/* Bit Pos 16 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL		/* Bit Pos 17 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>> +	{"MERT Data Parity",	XE_TILE_HW_ERR_MERT_NONFATAL		/* Bit Pos 20 */},
>>> +};
>>> +
>>> +struct error_msg_counter_pair dev_err_stat_correctable_reg[] = {
>>> +	{"GT",			XE_TILE_HW_ERR_GT_CORR			/* Bit Pos 0 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_CORR		/* Bit Pos 4 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"GSC",			XE_TILE_HW_ERR_GSC_CORR			/* Bit Pos 8 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_CORR		/* Bit Pos 12 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"SOC",			XE_TILE_HW_ERR_SOC_CORR			/* Bit Pos 16 */},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>> +};
>>> +
>>> +static void update_valid_error_regs(struct xe_device *xe)
>>> +{
>>> +	unsigned long mask = 0;
>>> +
>>> +	u32 i;
>>> +
>>> +	if (xe->info.platform == XE_DG2) {
>>> +		mask = ~(0 | DG2_DEV_ERR_STAT_FATAL_MASK);
>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>> +			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
>> Nope. For one thing, the arrays really should be static const, placed in
>> rodata, and not mutable.
> My Idea of keeping it mutable is to avoid multiple platform specific 
> arrays.
>> For another, if you have a platform with two or more different devices,
>> whichever gets probed last clobbers the data.
>
> Thanks for pointing it out. I had missed this particular scenario.  I 
> believe defining platform specific array is better
>
> because then we can ensure them to be immutable and we will not be 
> required to have platform specific mask also.
>
> I believe this will help with maintainability too.
>
> BR
>
> Himal Ghimiray
>
>>> +
>>> +		mask = ~(0 | DG2_DEV_ERR_STAT_NONFATAL_MASK);
>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>> +			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
>>> +
>>> +		mask = ~(0 | DG2_DEV_ERR_STAT_CORRECTABLE_MASK);
>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>> +			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
>>> +	} else if (xe->info.platform == XE_PVC) {
>>> +		mask = ~(0 | PVC_DEV_ERR_STAT_FATAL_MASK);
>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>> +			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
>>> +
>>> +		mask = ~(0 | PVC_DEV_ERR_STAT_NONFATAL_MASK);
>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>> +			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
>>> +
>>> +		mask = ~(0 | PVC_DEV_ERR_STAT_CORRECTABLE_MASK);
>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>> +			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
>>> +	}
>>> +}
>>> +
>>> +static void
>>> +xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>> +{
>>> +	const char *hw_err_str = hardware_error_type_to_str(hw_err);
>>> +	struct error_msg_counter_pair *errstat;
>>> +	unsigned long errsrc;
>>> +	unsigned long flags;
>>> +	const char *errmsg;
>>> +	struct xe_gt *mmio;
>>> +	u32 counter;
>>> +	u32 errcntr;
>>> +	u32 errbit;
>>> +
>>> +	switch (hw_err) {
>>> +	case HARDWARE_ERROR_FATAL:
>>> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_fatal_reg;
>> Why the casts?
>>
>>> +		counter = XE_TILE_HW_ERR_UNKNOWN_FATAL;
>>> +		break;
>>> +	case HARDWARE_ERROR_NONFATAL:
>>> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_nonfatal_reg;
>>> +		counter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL;
>>> +		break;
>>> +	case HARDWARE_ERROR_CORRECTABLE:
>>> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_correctable_reg;
>>> +		counter = XE_TILE_HW_ERR_UNKNOWN_CORR;
>>> +		break;
>>> +	default:
>>> +		return;
>>> +	}
>>> +
>>> +	spin_lock_irqsave(&tile_to_xe(tile)->irq.lock, flags);
>>> +	mmio = tile->primary_gt;
>>> +	errsrc = xe_mmio_read32(mmio, DEV_ERR_STAT_REG(hw_err));
>>> +
>>> +	if (!errsrc) {
>>> +		xe_tile_log_hw_err(tile, "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	for_each_set_bit(errbit, &errsrc, REG_SIZE) {
>>> +		if (errbit < DEV_ERR_STAT_MAX_ERROR_BIT) {
>>> +			errmsg = errstat[errbit].errmsg;
>>> +			errcntr = errstat[errbit].errcounter;
>>> +		} else {
>>> +			errmsg = "Undefined";
>>> +			errcntr = counter;
>>> +		}
>>> +
>>> +		if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>>> +			xe_tile_log_hw_warn(tile, "%s %s error bit[%d] is set\n",
>>> +					    errmsg, hw_err_str, errbit);
>>> +		else
>>> +			xe_tile_log_hw_err(tile, "%s %s error bit[%d] is set\n",
>>> +					   errmsg, hw_err_str, errbit);
>>> +
>>> +		tile->errors.hw[errcntr]++;
>>> +	}
>>> +
>>> +	xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
>>> +unlock:
>>> +	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>>> +}
>>> +
>>> +/*
>>> + * XE Platforms adds three Error bits to the Master Interrupt
>>> + * Register to support error handling. These three bits are
>>> + * used to convey the class of error:
>>> + * FATAL, NONFATAL, or CORRECTABLE.
>>> + *
>>> + * To process an interrupt:
>>> + *       Determine source of error (IP block) by reading
>>> + *	 the Device Error Source Register (RW1C) that
>>> + *	 corresponds to the class of error being serviced
>>> + *	 and log the error.
>>> + */
>>> +static void
>>> +xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>>> +{
>>> +	enum hardware_error hw_err;
>>> +
>>> +	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
>>> +		if (master_ctl & XE_ERROR_IRQ(hw_err))
>>> +			xe_hw_error_source_handler(tile, hw_err);
>>> +	}
>>> +}
>>> +
>>>   /*
>>>    * Top-level interrupt handler for Xe_LP+ and beyond.  These platforms have
>>>    * a "master tile" interrupt register which must be consulted before the
>>> @@ -413,6 +630,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>>   		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>>>   
>>>   		gt_irq_handler(tile, master_ctl, intr_dw, identity);
>>> +		xe_hw_error_irq_handler(tile, master_ctl);
>>>   
>>>   		/*
>>>   		 * Display interrupts (including display backlight operations
>>> @@ -561,6 +779,8 @@ int xe_irq_install(struct xe_device *xe)
>>>   		return -EINVAL;
>>>   	}
>>>   
>>> +	update_valid_error_regs(xe);
>>> +
>>>   	xe->irq.enabled = true;
>>>   
>>>   	xe_irq_reset(xe);

[-- Attachment #2: Type: text/html, Size: 22242 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components.
  2023-08-10  9:38       ` Ghimiray, Himal Prasad
@ 2023-08-10 10:17         ` Jani Nikula
  2023-08-10 11:20           ` Ghimiray, Himal Prasad
  0 siblings, 1 reply; 10+ messages in thread
From: Jani Nikula @ 2023-08-10 10:17 UTC (permalink / raw)
  To: Ghimiray, Himal Prasad, intel-xe


Is there any new content here? This is what it looks like to me:

https://lore.kernel.org/r/78b0e2d2-3c4c-484f-bcd4-fa896adef534@intel.com

BR,
Jani

On Thu, 10 Aug 2023, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com> wrote:
> On 10-08-2023 15:01, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 10-08-2023 13:24, Jani Nikula wrote:
>>> On Thu, 10 Aug 2023, Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>  wrote:
>>>> The GFX device can generate numbers of classes of error under the new
>>>> infrastructure: correctable, non-fatal, and fatal errors.
>>>>
>>>> The non-fatal and fatal error classes distinguish between levels of
>>>> severity for uncorrectable errors. Driver will only handle logging
>>>> of errors and updating counters from various components within the
>>>> graphics device. Anything more will be handled at system level.
>>>>
>>>> For errors that will route as interrupts, three bits in the Master
>>>> Interrupt Register will be used to convey the class of error.
>>>>
>>>> For each class of error: Determine source of error (IP block) by reading
>>>> the Device Error Source Register (RW1C) that
>>>> corresponds to the class of error being serviced.
>>>>
>>>> Bspec: 50875, 53073, 53074, 53075
>>>>
>>>> Cc: Rodrigo Vivi<rodrigo.vivi@intel.com>
>>>> Cc: Aravind Iddamsetty<aravind.iddamsetty@intel.com>
>>>> Cc: Matthew Brost<matthew.brost@intel.com>
>>>> Cc: Matt Roper<matthew.d.roper@intel.com>
>>>> Cc: Joonas Lahtinen<joonas.lahtinen@linux.intel.com>
>>>> Cc: Jani Nikula<jani.nikula@intel.com>
>>>> Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/xe/regs/xe_regs.h            |   7 +
>>>>   drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++++++
>>>>   drivers/gpu/drm/xe/xe_device_types.h         |   6 +
>>>>   drivers/gpu/drm/xe/xe_irq.c                  | 220 +++++++++++++++++++
>>>>   4 files changed, 341 insertions(+)
>>>>   create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
>>>> index ec45b1ba9db1..9901e55fc89c 100644
>>>> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
>>>> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
>>>> @@ -87,7 +87,14 @@
>>>>   #define   GU_MISC_IRQ				REG_BIT(29)
>>>>   #define   DISPLAY_IRQ				REG_BIT(16)
>>>>   #define   GT_DW_IRQ(x)				REG_BIT(x)
>>>> +#define   XE_ERROR_IRQ(x)			REG_BIT(26 + (x))
>>>>   
>>>>   #define PVC_RP_STATE_CAP			XE_REG(0x281014)
>>>>   
>>>> +enum hardware_error {
>>>> +	HARDWARE_ERROR_CORRECTABLE = 0,
>>>> +	HARDWARE_ERROR_NONFATAL = 1,
>>>> +	HARDWARE_ERROR_FATAL = 2,
>>>> +	HARDWARE_ERROR_MAX,
>>>> +};
>>> This file is about registers. IMO enums belong somewhere else. Define
>>> hardware registers using macros.
>>
>> Hmm. Will look for better placeholder for this enum.
>>
>>>>   #endif
>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>>> new file mode 100644
>>>> index 000000000000..fbb794b2f183
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>>> @@ -0,0 +1,108 @@
>>>> +/* SPDX-License-Identifier: MIT */
>>>> +/*
>>>> + * Copyright © 2023 Intel Corporation
>>>> + */
>>>> +#ifndef XE_TILE_ERROR_REGS_H_
>>>> +#define XE_TILE_ERROR_REGS_H_
>>>> +
>>>> +#include <linux/stddef.h>
>>>> +
>>>> +#define _DEV_ERR_STAT_NONFATAL                         0x100178
>>>> +#define _DEV_ERR_STAT_CORRECTABLE                      0x10017c
>>>> +#define DEV_ERR_STAT_REG(x)                            XE_REG(_PICK_EVEN((x), \
>>>> +								_DEV_ERR_STAT_CORRECTABLE, \
>>>> +								_DEV_ERR_STAT_NONFATAL))
>>>> +
>>>> +#define  DEV_ERR_STAT_MAX_ERROR_BIT         (21)
>>>> +
>>>> +/* Count of  Correctable and Uncorrectable errors reported on tile */
>>>> +enum xe_tile_hw_errors {g
>>>> +	XE_TILE_HW_ERR_GT_FATAL = 0,
>>>> +	XE_TILE_HW_ERR_SGGI_FATAL,
>>>> +	XE_TILE_HW_ERR_DISPLAY_FATAL,
>>>> +	XE_TILE_HW_ERR_SGDI_FATAL,
>>>> +	XE_TILE_HW_ERR_SGLI_FATAL,
>>>> +	XE_TILE_HW_ERR_SGUNIT_FATAL,
>>>> +	XE_TILE_HW_ERR_SGCI_FATAL,
>>>> +	XE_TILE_HW_ERR_GSC_FATAL,
>>>> +	XE_TILE_HW_ERR_SOC_FATAL,
>>>> +	XE_TILE_HW_ERR_MERT_FATAL,
>>>> +	XE_TILE_HW_ERR_SGMI_FATAL,
>>>> +	XE_TILE_HW_ERR_UNKNOWN_FATAL,
>>>> +	XE_TILE_HW_ERR_SGGI_NONFATAL,
>>>> +	XE_TILE_HW_ERR_DISPLAY_NONFATAL,
>>>> +	XE_TILE_HW_ERR_SGDI_NONFATAL,
>>>> +	XE_TILE_HW_ERR_SGLI_NONFATAL,
>>>> +	XE_TILE_HW_ERR_GT_NONFATAL,
>>>> +	XE_TILE_HW_ERR_SGUNIT_NONFATAL,
>>>> +	XE_TILE_HW_ERR_SGCI_NONFATAL,
>>>> +	XE_TILE_HW_ERR_GSC_NONFATAL,
>>>> +	XE_TILE_HW_ERR_SOC_NONFATAL,
>>>> +	XE_TILE_HW_ERR_MERT_NONFATAL,
>>>> +	XE_TILE_HW_ERR_SGMI_NONFATAL,
>>>> +	XE_TILE_HW_ERR_UNKNOWN_NONFATAL,
>>>> +	XE_TILE_HW_ERR_GT_CORR,
>>>> +	XE_TILE_HW_ERR_DISPLAY_CORR,
>>>> +	XE_TILE_HW_ERR_SGUNIT_CORR,
>>>> +	XE_TILE_HW_ERR_GSC_CORR,
>>>> +	XE_TILE_HW_ERR_SOC_CORR,
>>>> +	XE_TILE_HW_ERR_UNKNOWN_CORR,
>>>> +};
>>> Ditto about enums and regs.
>> Will address.
>>>> +
>>>> +#define XE_TILE_HW_ERROR_MAX (XE_TILE_HW_ERR_UNKNOWN_CORR + 1)
>>> If it's an enum, adding that last in the enum does the trick.
>> Makes sense.
>>>> +
>>>> +#define PVC_DEV_ERR_STAT_FATAL_MASK \
>>>> +		(REG_BIT(0) | \
>>>> +		 REG_BIT(1) | \
>>>> +		 REG_BIT(8) | \
>>>> +		 REG_BIT(9) | \
>>>> +		 REG_BIT(13) | \
>>>> +		 REG_BIT(16) | \
>>>> +		 REG_BIT(20))
>>>> +
>>>> +#define PVC_DEV_ERR_STAT_NONFATAL_MASK \
>>>> +		(REG_BIT(0) | \
>>>> +		 REG_BIT(1) | \
>>>> +		 REG_BIT(8) | \
>>>> +		 REG_BIT(9) | \
>>>> +		 REG_BIT(13) | \
>>>> +		 REG_BIT(16) | \
>>>> +		 REG_BIT(20))
>>>> +
>>>> +#define PVC_DEV_ERR_STAT_CORRECTABLE_MASK \
>>>> +		(REG_BIT(0) | \
>>>> +		 REG_BIT(8))
>>>> +
>>>> +#define DG2_DEV_ERR_STAT_FATAL_MASK \
>>>> +		(REG_BIT(0) | \
>>>> +		 REG_BIT(4) | \
>>>> +		 REG_BIT(8) | \
>>>> +		 REG_BIT(12) | \
>>>> +		 REG_BIT(16))
>>>> +
>>>> +#define DG2_DEV_ERR_STAT_NONFATAL_MASK \
>>>> +		(REG_BIT(0) | \
>>>> +		 REG_BIT(4) | \
>>>> +		 REG_BIT(8) | \
>>>> +		 REG_BIT(12) | \
>>>> +		 REG_BIT(16) | \
>>>> +		 REG_BIT(20))
>>>> +
>>>> +#define DG2_DEV_ERR_STAT_CORRECTABLE_MASK \
>>>> +		(REG_BIT(0) | \
>>>> +		 REG_BIT(4) | \
>>>> +		 REG_BIT(8) | \
>>>> +		 REG_BIT(12) | \
>>>> +		 REG_BIT(16))
>>> Are the above supposed to match what's in xe_tile_hw_errors? Seems
>>> rather unmaintainable.
>> xe_tile_hw_errors contains superset of applicable platforms and mask determines
>> what are applicable bits for a platform.
>>>> +
>>>> +#define REG_SIZE 32
>>>> +
>>>> +#define xe_tile_log_hw_err(tile, fmt, ...) \
>>>> +	drm_err_ratelimited(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
>>>> +			    tile->id, ##__VA_ARGS__)
>>>> +
>>>> +#define xe_tile_log_hw_warn(tile, fmt, ...) \
>>>> +	drm_warn(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
>>>> +		 tile->id, ##__VA_ARGS__)
>>> Do we really want to keep adding new macros for all possible scenarios
>>> in the driver? This is getting out of hand.
>>>
>>> Where's HW_ERR defined?
>> include/linux/printk.h defines HW_ERR.
>>>> +
>>>> +#endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>>>> index f84ecb976f5d..1335ba74981a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>>> @@ -16,6 +16,7 @@
>>>>   #include "xe_gt_types.h"
>>>>   #include "xe_platform_types.h"
>>>>   #include "xe_step_types.h"
>>>> +#include "regs/xe_tile_error_regs.h"
>>>>   
>>>>   #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>>>>   #include "ext/intel_device_info.h"
>>>> @@ -166,6 +167,11 @@ struct xe_tile {
>>>>   
>>>>   	/** @sysfs: sysfs' kobj used by xe_tile_sysfs */
>>>>   	struct kobject *sysfs;
>>>> +
>>>> +	/** @tile_hw_errors: hardware errors reported for the tile */
>>>> +	struct tile_hw_errors {
>>>> +		unsigned long hw[XE_TILE_HW_ERROR_MAX];
>>> Even with the documentation comment, I have to look up the source code
>>> to realize this is the *number* of errors for each class.
>>>
>>> Maybe "count" is more informative than "hw".
>> Ok.
>>>> +	} errors;
>>>>   };
>>>>   
>>>>   /**
>>>> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>>>> index 2022a5643e01..04a665faea23 100644
>>>> --- a/drivers/gpu/drm/xe/xe_irq.c
>>>> +++ b/drivers/gpu/drm/xe/xe_irq.c
>>>> @@ -362,6 +362,223 @@ static void dg1_intr_enable(struct xe_device *xe, bool stall)
>>>>   		xe_mmio_read32(mmio, DG1_MSTR_TILE_INTR);
>>>>   }
>>>>   
>>>> +static const char *
>>>> +hardware_error_type_to_str(const enum hardware_error hw_err)
>>>> +{
>>>> +	switch (hw_err) {
>>>> +	case HARDWARE_ERROR_CORRECTABLE:
>>>> +		return "CORRECTABLE";
>>>> +	case HARDWARE_ERROR_NONFATAL:
>>>> +		return "NONFATAL";
>>>> +	case HARDWARE_ERROR_FATAL:
>>>> +		return "FATAL";
>>>> +	default:
>>>> +		return "UNKNOWN";
>>>> +	}
>>>> +}
>>>> +
>>>> +struct error_msg_counter_pair {
>>>> +	const char *errmsg;
>>>> +	int errcounter;
>>> Counter? Or type/class/whatever?
>>>
>>>> +};
>>>> +
>>>> +struct error_msg_counter_pair dev_err_stat_fatal_reg[] = {
>>>> +	{"GT",			XE_TILE_HW_ERR_GT_FATAL			/* Bit Pos 0 */},
>>> Does this again tie the enums and the bit positions together, similar to
>>> how the mask macros also do above?
>>>
>>> There needs to be a single point of truth for all of this.
>>>
>>> I think this needs a redesign.
>>
>> As commented above this is array which defines all valid bit positions 
>> irrespective of platform. Mask was
>>
>> to determine platform specific applicability.
>>
>>> BR,
>>> Jani.
>>>
>>>> +	{"SGGI Cmd Parity",	XE_TILE_HW_ERR_SGGI_FATAL		/* Bit Pos 1 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_FATAL		/* Bit Pos 4 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"GSC error",		XE_TILE_HW_ERR_GSC_FATAL		/* Bit Pos 8 */},
>>>> +	{"SGLI Cmd Parity",	XE_TILE_HW_ERR_SGLI_FATAL		/* Bit Pos 9 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_FATAL		/* Bit Pos 12 */},
>>>> +	{"SGCI Cmd Parity",	XE_TILE_HW_ERR_SGCI_FATAL		/* Bit Pos 13 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"SOC ERROR",		XE_TILE_HW_ERR_SOC_FATAL		/* Bit Pos 16 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +	{"MERT Cmd Parity",	XE_TILE_HW_ERR_MERT_FATAL		/* Bit Pos 20 */},
>>>> +};
>>>> +
>>>> +struct error_msg_counter_pair dev_err_stat_nonfatal_reg[] = {
>>>> +	{"GT",			XE_TILE_HW_ERR_GT_NONFATAL		/* Bit Pos 0 */},
>>>> +	{"SGGI Data Parity",	XE_TILE_HW_ERR_SGGI_NONFATAL		/* Bit Pos 1 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_NONFATAL		/* Bit Pos 4 */},
>>>> +	{"SGDI Data Parity",	XE_TILE_HW_ERR_SGDI_NONFATAL		/* Bit Pos 5 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"GSC",			XE_TILE_HW_ERR_GSC_NONFATAL		/* Bit Pos 8 */},
>>>> +	{"SGLI Data Parity",	XE_TILE_HW_ERR_SGLI_NONFATAL		/* Bit Pos 9 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_NONFATAL		/* Bit Pos 12 */},
>>>> +	{"SGCI Data Parity",	XE_TILE_HW_ERR_SGCI_NONFATAL		/* Bit Pos 13 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"SOC",			XE_TILE_HW_ERR_SOC_NONFATAL		/* Bit Pos 16 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL		/* Bit Pos 17 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +	{"MERT Data Parity",	XE_TILE_HW_ERR_MERT_NONFATAL		/* Bit Pos 20 */},
>>>> +};
>>>> +
>>>> +struct error_msg_counter_pair dev_err_stat_correctable_reg[] = {
>>>> +	{"GT",			XE_TILE_HW_ERR_GT_CORR			/* Bit Pos 0 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"DISPLAY",		XE_TILE_HW_ERR_DISPLAY_CORR		/* Bit Pos 4 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"GSC",			XE_TILE_HW_ERR_GSC_CORR			/* Bit Pos 8 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"SGUNIT",		XE_TILE_HW_ERR_SGUNIT_CORR		/* Bit Pos 12 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"SOC",			XE_TILE_HW_ERR_SOC_CORR			/* Bit Pos 16 */},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +	{"Undefined",		XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +};
>>>> +
>>>> +static void update_valid_error_regs(struct xe_device *xe)
>>>> +{
>>>> +	unsigned long mask = 0;
>>>> +
>>>> +	u32 i;
>>>> +
>>>> +	if (xe->info.platform == XE_DG2) {
>>>> +		mask = ~(0 | DG2_DEV_ERR_STAT_FATAL_MASK);
>>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
>>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
>>> Nope. For one thing, the arrays really should be static const, placed in
>>> rodata, and not mutable.
>> My Idea of keeping it mutable is to avoid multiple platform specific 
>> arrays.
>>> For another, if you have a platform with two or more different devices,
>>> whichever gets probed last clobbers the data.
>>
>> Thanks for pointing it out. I had missed this particular scenario.  I 
>> believe defining platform specific array is better
>>
>> because then we can ensure them to be immutable and we will not be 
>> required to have platform specific mask also.
>>
>> I believe this will help with maintainability too.
>>
>> BR
>>
>> Himal Ghimiray
>>
>>>> +
>>>> +		mask = ~(0 | DG2_DEV_ERR_STAT_NONFATAL_MASK);
>>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
>>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
>>>> +
>>>> +		mask = ~(0 | DG2_DEV_ERR_STAT_CORRECTABLE_MASK);
>>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
>>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
>>>> +	} else if (xe->info.platform == XE_PVC) {
>>>> +		mask = ~(0 | PVC_DEV_ERR_STAT_FATAL_MASK);
>>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +			dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
>>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
>>>> +
>>>> +		mask = ~(0 | PVC_DEV_ERR_STAT_NONFATAL_MASK);
>>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +			dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
>>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
>>>> +
>>>> +		mask = ~(0 | PVC_DEV_ERR_STAT_CORRECTABLE_MASK);
>>>> +		for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +			dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
>>>> +			{.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
>>>> +	}
>>>> +}
>>>> +
>>>> +static void
>>>> +xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>>> +{
>>>> +	const char *hw_err_str = hardware_error_type_to_str(hw_err);
>>>> +	struct error_msg_counter_pair *errstat;
>>>> +	unsigned long errsrc;
>>>> +	unsigned long flags;
>>>> +	const char *errmsg;
>>>> +	struct xe_gt *mmio;
>>>> +	u32 counter;
>>>> +	u32 errcntr;
>>>> +	u32 errbit;
>>>> +
>>>> +	switch (hw_err) {
>>>> +	case HARDWARE_ERROR_FATAL:
>>>> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_fatal_reg;
>>> Why the casts?
>>>
>>>> +		counter = XE_TILE_HW_ERR_UNKNOWN_FATAL;
>>>> +		break;
>>>> +	case HARDWARE_ERROR_NONFATAL:
>>>> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_nonfatal_reg;
>>>> +		counter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL;
>>>> +		break;
>>>> +	case HARDWARE_ERROR_CORRECTABLE:
>>>> +		errstat = (struct error_msg_counter_pair *)dev_err_stat_correctable_reg;
>>>> +		counter = XE_TILE_HW_ERR_UNKNOWN_CORR;
>>>> +		break;
>>>> +	default:
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	spin_lock_irqsave(&tile_to_xe(tile)->irq.lock, flags);
>>>> +	mmio = tile->primary_gt;
>>>> +	errsrc = xe_mmio_read32(mmio, DEV_ERR_STAT_REG(hw_err));
>>>> +
>>>> +	if (!errsrc) {
>>>> +		xe_tile_log_hw_err(tile, "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
>>>> +		goto unlock;
>>>> +	}
>>>> +
>>>> +	for_each_set_bit(errbit, &errsrc, REG_SIZE) {
>>>> +		if (errbit < DEV_ERR_STAT_MAX_ERROR_BIT) {
>>>> +			errmsg = errstat[errbit].errmsg;
>>>> +			errcntr = errstat[errbit].errcounter;
>>>> +		} else {
>>>> +			errmsg = "Undefined";
>>>> +			errcntr = counter;
>>>> +		}
>>>> +
>>>> +		if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>>>> +			xe_tile_log_hw_warn(tile, "%s %s error bit[%d] is set\n",
>>>> +					    errmsg, hw_err_str, errbit);
>>>> +		else
>>>> +			xe_tile_log_hw_err(tile, "%s %s error bit[%d] is set\n",
>>>> +					   errmsg, hw_err_str, errbit);
>>>> +
>>>> +		tile->errors.hw[errcntr]++;
>>>> +	}
>>>> +
>>>> +	xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
>>>> +unlock:
>>>> +	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>>>> +}
>>>> +
>>>> +/*
>>>> + * XE Platforms adds three Error bits to the Master Interrupt
>>>> + * Register to support error handling. These three bits are
>>>> + * used to convey the class of error:
>>>> + * FATAL, NONFATAL, or CORRECTABLE.
>>>> + *
>>>> + * To process an interrupt:
>>>> + *       Determine source of error (IP block) by reading
>>>> + *	 the Device Error Source Register (RW1C) that
>>>> + *	 corresponds to the class of error being serviced
>>>> + *	 and log the error.
>>>> + */
>>>> +static void
>>>> +xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>>>> +{
>>>> +	enum hardware_error hw_err;
>>>> +
>>>> +	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
>>>> +		if (master_ctl & XE_ERROR_IRQ(hw_err))
>>>> +			xe_hw_error_source_handler(tile, hw_err);
>>>> +	}
>>>> +}
>>>> +
>>>>   /*
>>>>    * Top-level interrupt handler for Xe_LP+ and beyond.  These platforms have
>>>>    * a "master tile" interrupt register which must be consulted before the
>>>> @@ -413,6 +630,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>>>   		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>>>>   
>>>>   		gt_irq_handler(tile, master_ctl, intr_dw, identity);
>>>> +		xe_hw_error_irq_handler(tile, master_ctl);
>>>>   
>>>>   		/*
>>>>   		 * Display interrupts (including display backlight operations
>>>> @@ -561,6 +779,8 @@ int xe_irq_install(struct xe_device *xe)
>>>>   		return -EINVAL;
>>>>   	}
>>>>   
>>>> +	update_valid_error_regs(xe);
>>>> +
>>>>   	xe->irq.enabled = true;
>>>>   
>>>>   	xe_irq_reset(xe);

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components.
  2023-08-10 10:17         ` Jani Nikula
@ 2023-08-10 11:20           ` Ghimiray, Himal Prasad
  0 siblings, 0 replies; 10+ messages in thread
From: Ghimiray, Himal Prasad @ 2023-08-10 11:20 UTC (permalink / raw)
  To: Nikula, Jani, intel-xe@lists.freedesktop.org

[-- Attachment #1: Type: text/plain, Size: 22826 bytes --]

Hi Nikula,

I had missed adding mailing list to reply done to you.

Below mail is  cc addition to mailing list without change in content sent to you as reply.

BR
Himal

________________________________
From: Nikula, Jani <jani.nikula@intel.com>
Sent: Thursday, August 10, 2023 3:47:49 pm
To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; intel-xe@lists.freedesktop.org <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2 1/4] drm/xe: Handle errors from various components.


Is there any new content here? This is what it looks like to me:

https://lore.kernel.org/r/78b0e2d2-3c4c-484f-bcd4-fa896adef534@intel.com

BR,
Jani

On Thu, 10 Aug 2023, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com> wrote:
> On 10-08-2023 15:01, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 10-08-2023 13:24, Jani Nikula wrote:
>>> On Thu, 10 Aug 2023, Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>  wrote:
>>>> The GFX device can generate numbers of classes of error under the new
>>>> infrastructure: correctable, non-fatal, and fatal errors.
>>>>
>>>> The non-fatal and fatal error classes distinguish between levels of
>>>> severity for uncorrectable errors. Driver will only handle logging
>>>> of errors and updating counters from various components within the
>>>> graphics device. Anything more will be handled at system level.
>>>>
>>>> For errors that will route as interrupts, three bits in the Master
>>>> Interrupt Register will be used to convey the class of error.
>>>>
>>>> For each class of error: Determine source of error (IP block) by reading
>>>> the Device Error Source Register (RW1C) that
>>>> corresponds to the class of error being serviced.
>>>>
>>>> Bspec: 50875, 53073, 53074, 53075
>>>>
>>>> Cc: Rodrigo Vivi<rodrigo.vivi@intel.com>
>>>> Cc: Aravind Iddamsetty<aravind.iddamsetty@intel.com>
>>>> Cc: Matthew Brost<matthew.brost@intel.com>
>>>> Cc: Matt Roper<matthew.d.roper@intel.com>
>>>> Cc: Joonas Lahtinen<joonas.lahtinen@linux.intel.com>
>>>> Cc: Jani Nikula<jani.nikula@intel.com>
>>>> Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/xe/regs/xe_regs.h            |   7 +
>>>>   drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++++++
>>>>   drivers/gpu/drm/xe/xe_device_types.h         |   6 +
>>>>   drivers/gpu/drm/xe/xe_irq.c                  | 220 +++++++++++++++++++
>>>>   4 files changed, 341 insertions(+)
>>>>   create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
>>>> index ec45b1ba9db1..9901e55fc89c 100644
>>>> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
>>>> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
>>>> @@ -87,7 +87,14 @@
>>>>   #define   GU_MISC_IRQ                            REG_BIT(29)
>>>>   #define   DISPLAY_IRQ                            REG_BIT(16)
>>>>   #define   GT_DW_IRQ(x)                           REG_BIT(x)
>>>> +#define   XE_ERROR_IRQ(x)                 REG_BIT(26 + (x))
>>>>
>>>>   #define PVC_RP_STATE_CAP                 XE_REG(0x281014)
>>>>
>>>> +enum hardware_error {
>>>> +  HARDWARE_ERROR_CORRECTABLE = 0,
>>>> +  HARDWARE_ERROR_NONFATAL = 1,
>>>> +  HARDWARE_ERROR_FATAL = 2,
>>>> +  HARDWARE_ERROR_MAX,
>>>> +};
>>> This file is about registers. IMO enums belong somewhere else. Define
>>> hardware registers using macros.
>>
>> Hmm. Will look for better placeholder for this enum.
>>
>>>>   #endif
>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>>> new file mode 100644
>>>> index 000000000000..fbb794b2f183
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>>>> @@ -0,0 +1,108 @@
>>>> +/* SPDX-License-Identifier: MIT */
>>>> +/*
>>>> + * Copyright © 2023 Intel Corporation
>>>> + */
>>>> +#ifndef XE_TILE_ERROR_REGS_H_
>>>> +#define XE_TILE_ERROR_REGS_H_
>>>> +
>>>> +#include <linux/stddef.h>
>>>> +
>>>> +#define _DEV_ERR_STAT_NONFATAL                         0x100178
>>>> +#define _DEV_ERR_STAT_CORRECTABLE                      0x10017c
>>>> +#define DEV_ERR_STAT_REG(x)                            XE_REG(_PICK_EVEN((x), \
>>>> +                                                          _DEV_ERR_STAT_CORRECTABLE, \
>>>> +                                                          _DEV_ERR_STAT_NONFATAL))
>>>> +
>>>> +#define  DEV_ERR_STAT_MAX_ERROR_BIT         (21)
>>>> +
>>>> +/* Count of  Correctable and Uncorrectable errors reported on tile */
>>>> +enum xe_tile_hw_errors {g
>>>> +  XE_TILE_HW_ERR_GT_FATAL = 0,
>>>> +  XE_TILE_HW_ERR_SGGI_FATAL,
>>>> +  XE_TILE_HW_ERR_DISPLAY_FATAL,
>>>> +  XE_TILE_HW_ERR_SGDI_FATAL,
>>>> +  XE_TILE_HW_ERR_SGLI_FATAL,
>>>> +  XE_TILE_HW_ERR_SGUNIT_FATAL,
>>>> +  XE_TILE_HW_ERR_SGCI_FATAL,
>>>> +  XE_TILE_HW_ERR_GSC_FATAL,
>>>> +  XE_TILE_HW_ERR_SOC_FATAL,
>>>> +  XE_TILE_HW_ERR_MERT_FATAL,
>>>> +  XE_TILE_HW_ERR_SGMI_FATAL,
>>>> +  XE_TILE_HW_ERR_UNKNOWN_FATAL,
>>>> +  XE_TILE_HW_ERR_SGGI_NONFATAL,
>>>> +  XE_TILE_HW_ERR_DISPLAY_NONFATAL,
>>>> +  XE_TILE_HW_ERR_SGDI_NONFATAL,
>>>> +  XE_TILE_HW_ERR_SGLI_NONFATAL,
>>>> +  XE_TILE_HW_ERR_GT_NONFATAL,
>>>> +  XE_TILE_HW_ERR_SGUNIT_NONFATAL,
>>>> +  XE_TILE_HW_ERR_SGCI_NONFATAL,
>>>> +  XE_TILE_HW_ERR_GSC_NONFATAL,
>>>> +  XE_TILE_HW_ERR_SOC_NONFATAL,
>>>> +  XE_TILE_HW_ERR_MERT_NONFATAL,
>>>> +  XE_TILE_HW_ERR_SGMI_NONFATAL,
>>>> +  XE_TILE_HW_ERR_UNKNOWN_NONFATAL,
>>>> +  XE_TILE_HW_ERR_GT_CORR,
>>>> +  XE_TILE_HW_ERR_DISPLAY_CORR,
>>>> +  XE_TILE_HW_ERR_SGUNIT_CORR,
>>>> +  XE_TILE_HW_ERR_GSC_CORR,
>>>> +  XE_TILE_HW_ERR_SOC_CORR,
>>>> +  XE_TILE_HW_ERR_UNKNOWN_CORR,
>>>> +};
>>> Ditto about enums and regs.
>> Will address.
>>>> +
>>>> +#define XE_TILE_HW_ERROR_MAX (XE_TILE_HW_ERR_UNKNOWN_CORR + 1)
>>> If it's an enum, adding that last in the enum does the trick.
>> Makes sense.
>>>> +
>>>> +#define PVC_DEV_ERR_STAT_FATAL_MASK \
>>>> +          (REG_BIT(0) | \
>>>> +           REG_BIT(1) | \
>>>> +           REG_BIT(8) | \
>>>> +           REG_BIT(9) | \
>>>> +           REG_BIT(13) | \
>>>> +           REG_BIT(16) | \
>>>> +           REG_BIT(20))
>>>> +
>>>> +#define PVC_DEV_ERR_STAT_NONFATAL_MASK \
>>>> +          (REG_BIT(0) | \
>>>> +           REG_BIT(1) | \
>>>> +           REG_BIT(8) | \
>>>> +           REG_BIT(9) | \
>>>> +           REG_BIT(13) | \
>>>> +           REG_BIT(16) | \
>>>> +           REG_BIT(20))
>>>> +
>>>> +#define PVC_DEV_ERR_STAT_CORRECTABLE_MASK \
>>>> +          (REG_BIT(0) | \
>>>> +           REG_BIT(8))
>>>> +
>>>> +#define DG2_DEV_ERR_STAT_FATAL_MASK \
>>>> +          (REG_BIT(0) | \
>>>> +           REG_BIT(4) | \
>>>> +           REG_BIT(8) | \
>>>> +           REG_BIT(12) | \
>>>> +           REG_BIT(16))
>>>> +
>>>> +#define DG2_DEV_ERR_STAT_NONFATAL_MASK \
>>>> +          (REG_BIT(0) | \
>>>> +           REG_BIT(4) | \
>>>> +           REG_BIT(8) | \
>>>> +           REG_BIT(12) | \
>>>> +           REG_BIT(16) | \
>>>> +           REG_BIT(20))
>>>> +
>>>> +#define DG2_DEV_ERR_STAT_CORRECTABLE_MASK \
>>>> +          (REG_BIT(0) | \
>>>> +           REG_BIT(4) | \
>>>> +           REG_BIT(8) | \
>>>> +           REG_BIT(12) | \
>>>> +           REG_BIT(16))
>>> Are the above supposed to match what's in xe_tile_hw_errors? Seems
>>> rather unmaintainable.
>> xe_tile_hw_errors contains superset of applicable platforms and mask determines
>> what are applicable bits for a platform.
>>>> +
>>>> +#define REG_SIZE 32
>>>> +
>>>> +#define xe_tile_log_hw_err(tile, fmt, ...) \
>>>> +  drm_err_ratelimited(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
>>>> +                      tile->id, ##__VA_ARGS__)
>>>> +
>>>> +#define xe_tile_log_hw_warn(tile, fmt, ...) \
>>>> +  drm_warn(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \
>>>> +           tile->id, ##__VA_ARGS__)
>>> Do we really want to keep adding new macros for all possible scenarios
>>> in the driver? This is getting out of hand.
>>>
>>> Where's HW_ERR defined?
>> include/linux/printk.h defines HW_ERR.
>>>> +
>>>> +#endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>>>> index f84ecb976f5d..1335ba74981a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>>> @@ -16,6 +16,7 @@
>>>>   #include "xe_gt_types.h"
>>>>   #include "xe_platform_types.h"
>>>>   #include "xe_step_types.h"
>>>> +#include "regs/xe_tile_error_regs.h"
>>>>
>>>>   #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>>>>   #include "ext/intel_device_info.h"
>>>> @@ -166,6 +167,11 @@ struct xe_tile {
>>>>
>>>>     /** @sysfs: sysfs' kobj used by xe_tile_sysfs */
>>>>     struct kobject *sysfs;
>>>> +
>>>> +  /** @tile_hw_errors: hardware errors reported for the tile */
>>>> +  struct tile_hw_errors {
>>>> +          unsigned long hw[XE_TILE_HW_ERROR_MAX];
>>> Even with the documentation comment, I have to look up the source code
>>> to realize this is the *number* of errors for each class.
>>>
>>> Maybe "count" is more informative than "hw".
>> Ok.
>>>> +  } errors;
>>>>   };
>>>>
>>>>   /**
>>>> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>>>> index 2022a5643e01..04a665faea23 100644
>>>> --- a/drivers/gpu/drm/xe/xe_irq.c
>>>> +++ b/drivers/gpu/drm/xe/xe_irq.c
>>>> @@ -362,6 +362,223 @@ static void dg1_intr_enable(struct xe_device *xe, bool stall)
>>>>             xe_mmio_read32(mmio, DG1_MSTR_TILE_INTR);
>>>>   }
>>>>
>>>> +static const char *
>>>> +hardware_error_type_to_str(const enum hardware_error hw_err)
>>>> +{
>>>> +  switch (hw_err) {
>>>> +  case HARDWARE_ERROR_CORRECTABLE:
>>>> +          return "CORRECTABLE";
>>>> +  case HARDWARE_ERROR_NONFATAL:
>>>> +          return "NONFATAL";
>>>> +  case HARDWARE_ERROR_FATAL:
>>>> +          return "FATAL";
>>>> +  default:
>>>> +          return "UNKNOWN";
>>>> +  }
>>>> +}
>>>> +
>>>> +struct error_msg_counter_pair {
>>>> +  const char *errmsg;
>>>> +  int errcounter;
>>> Counter? Or type/class/whatever?
>>>
>>>> +};
>>>> +
>>>> +struct error_msg_counter_pair dev_err_stat_fatal_reg[] = {
>>>> +  {"GT",                  XE_TILE_HW_ERR_GT_FATAL                 /* Bit Pos 0 */},
>>> Does this again tie the enums and the bit positions together, similar to
>>> how the mask macros also do above?
>>>
>>> There needs to be a single point of truth for all of this.
>>>
>>> I think this needs a redesign.
>>
>> As commented above this is array which defines all valid bit positions
>> irrespective of platform. Mask was
>>
>> to determine platform specific applicability.
>>
>>> BR,
>>> Jani.
>>>
>>>> +  {"SGGI Cmd Parity",     XE_TILE_HW_ERR_SGGI_FATAL               /* Bit Pos 1 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"DISPLAY",             XE_TILE_HW_ERR_DISPLAY_FATAL            /* Bit Pos 4 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"GSC error",           XE_TILE_HW_ERR_GSC_FATAL                /* Bit Pos 8 */},
>>>> +  {"SGLI Cmd Parity",     XE_TILE_HW_ERR_SGLI_FATAL               /* Bit Pos 9 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"SGUNIT",              XE_TILE_HW_ERR_SGUNIT_FATAL             /* Bit Pos 12 */},
>>>> +  {"SGCI Cmd Parity",     XE_TILE_HW_ERR_SGCI_FATAL               /* Bit Pos 13 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"SOC ERROR",           XE_TILE_HW_ERR_SOC_FATAL                /* Bit Pos 16 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},
>>>> +  {"MERT Cmd Parity",     XE_TILE_HW_ERR_MERT_FATAL               /* Bit Pos 20 */},
>>>> +};
>>>> +
>>>> +struct error_msg_counter_pair dev_err_stat_nonfatal_reg[] = {
>>>> +  {"GT",                  XE_TILE_HW_ERR_GT_NONFATAL              /* Bit Pos 0 */},
>>>> +  {"SGGI Data Parity",    XE_TILE_HW_ERR_SGGI_NONFATAL            /* Bit Pos 1 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"DISPLAY",             XE_TILE_HW_ERR_DISPLAY_NONFATAL         /* Bit Pos 4 */},
>>>> +  {"SGDI Data Parity",    XE_TILE_HW_ERR_SGDI_NONFATAL            /* Bit Pos 5 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"GSC",                 XE_TILE_HW_ERR_GSC_NONFATAL             /* Bit Pos 8 */},
>>>> +  {"SGLI Data Parity",    XE_TILE_HW_ERR_SGLI_NONFATAL            /* Bit Pos 9 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"SGUNIT",              XE_TILE_HW_ERR_SGUNIT_NONFATAL          /* Bit Pos 12 */},
>>>> +  {"SGCI Data Parity",    XE_TILE_HW_ERR_SGCI_NONFATAL            /* Bit Pos 13 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"SOC",                 XE_TILE_HW_ERR_SOC_NONFATAL             /* Bit Pos 16 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL         /* Bit Pos 17 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},
>>>> +  {"MERT Data Parity",    XE_TILE_HW_ERR_MERT_NONFATAL            /* Bit Pos 20 */},
>>>> +};
>>>> +
>>>> +struct error_msg_counter_pair dev_err_stat_correctable_reg[] = {
>>>> +  {"GT",                  XE_TILE_HW_ERR_GT_CORR                  /* Bit Pos 0 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"DISPLAY",             XE_TILE_HW_ERR_DISPLAY_CORR             /* Bit Pos 4 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"GSC",                 XE_TILE_HW_ERR_GSC_CORR                 /* Bit Pos 8 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"SGUNIT",              XE_TILE_HW_ERR_SGUNIT_CORR              /* Bit Pos 12 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"SOC",                 XE_TILE_HW_ERR_SOC_CORR                 /* Bit Pos 16 */},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},
>>>> +};
>>>> +
>>>> +static void update_valid_error_regs(struct xe_device *xe)
>>>> +{
>>>> +  unsigned long mask = 0;
>>>> +
>>>> +  u32 i;
>>>> +
>>>> +  if (xe->info.platform == XE_DG2) {
>>>> +          mask = ~(0 | DG2_DEV_ERR_STAT_FATAL_MASK);
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +                  dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
>>> Nope. For one thing, the arrays really should be static const, placed in
>>> rodata, and not mutable.
>> My Idea of keeping it mutable is to avoid multiple platform specific
>> arrays.
>>> For another, if you have a platform with two or more different devices,
>>> whichever gets probed last clobbers the data.
>>
>> Thanks for pointing it out. I had missed this particular scenario.  I
>> believe defining platform specific array is better
>>
>> because then we can ensure them to be immutable and we will not be
>> required to have platform specific mask also.
>>
>> I believe this will help with maintainability too.
>>
>> BR
>>
>> Himal Ghimiray
>>
>>>> +
>>>> +          mask = ~(0 | DG2_DEV_ERR_STAT_NONFATAL_MASK);
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +                  dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
>>>> +
>>>> +          mask = ~(0 | DG2_DEV_ERR_STAT_CORRECTABLE_MASK);
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +                  dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
>>>> +  } else if (xe->info.platform == XE_PVC) {
>>>> +          mask = ~(0 | PVC_DEV_ERR_STAT_FATAL_MASK);
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +                  dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};
>>>> +
>>>> +          mask = ~(0 | PVC_DEV_ERR_STAT_NONFATAL_MASK);
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +                  dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};
>>>> +
>>>> +          mask = ~(0 | PVC_DEV_ERR_STAT_CORRECTABLE_MASK);
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)
>>>> +                  dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};
>>>> +  }
>>>> +}
>>>> +
>>>> +static void
>>>> +xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>>> +{
>>>> +  const char *hw_err_str = hardware_error_type_to_str(hw_err);
>>>> +  struct error_msg_counter_pair *errstat;
>>>> +  unsigned long errsrc;
>>>> +  unsigned long flags;
>>>> +  const char *errmsg;
>>>> +  struct xe_gt *mmio;
>>>> +  u32 counter;
>>>> +  u32 errcntr;
>>>> +  u32 errbit;
>>>> +
>>>> +  switch (hw_err) {
>>>> +  case HARDWARE_ERROR_FATAL:
>>>> +          errstat = (struct error_msg_counter_pair *)dev_err_stat_fatal_reg;
>>> Why the casts?
>>>
>>>> +          counter = XE_TILE_HW_ERR_UNKNOWN_FATAL;
>>>> +          break;
>>>> +  case HARDWARE_ERROR_NONFATAL:
>>>> +          errstat = (struct error_msg_counter_pair *)dev_err_stat_nonfatal_reg;
>>>> +          counter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL;
>>>> +          break;
>>>> +  case HARDWARE_ERROR_CORRECTABLE:
>>>> +          errstat = (struct error_msg_counter_pair *)dev_err_stat_correctable_reg;
>>>> +          counter = XE_TILE_HW_ERR_UNKNOWN_CORR;
>>>> +          break;
>>>> +  default:
>>>> +          return;
>>>> +  }
>>>> +
>>>> +  spin_lock_irqsave(&tile_to_xe(tile)->irq.lock, flags);
>>>> +  mmio = tile->primary_gt;
>>>> +  errsrc = xe_mmio_read32(mmio, DEV_ERR_STAT_REG(hw_err));
>>>> +
>>>> +  if (!errsrc) {
>>>> +          xe_tile_log_hw_err(tile, "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
>>>> +          goto unlock;
>>>> +  }
>>>> +
>>>> +  for_each_set_bit(errbit, &errsrc, REG_SIZE) {
>>>> +          if (errbit < DEV_ERR_STAT_MAX_ERROR_BIT) {
>>>> +                  errmsg = errstat[errbit].errmsg;
>>>> +                  errcntr = errstat[errbit].errcounter;
>>>> +          } else {
>>>> +                  errmsg = "Undefined";
>>>> +                  errcntr = counter;
>>>> +          }
>>>> +
>>>> +          if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>>>> +                  xe_tile_log_hw_warn(tile, "%s %s error bit[%d] is set\n",
>>>> +                                      errmsg, hw_err_str, errbit);
>>>> +          else
>>>> +                  xe_tile_log_hw_err(tile, "%s %s error bit[%d] is set\n",
>>>> +                                     errmsg, hw_err_str, errbit);
>>>> +
>>>> +          tile->errors.hw[errcntr]++;
>>>> +  }
>>>> +
>>>> +  xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
>>>> +unlock:
>>>> +  spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>>>> +}
>>>> +
>>>> +/*
>>>> + * XE Platforms adds three Error bits to the Master Interrupt
>>>> + * Register to support error handling. These three bits are
>>>> + * used to convey the class of error:
>>>> + * FATAL, NONFATAL, or CORRECTABLE.
>>>> + *
>>>> + * To process an interrupt:
>>>> + *       Determine source of error (IP block) by reading
>>>> + *         the Device Error Source Register (RW1C) that
>>>> + *         corresponds to the class of error being serviced
>>>> + *         and log the error.
>>>> + */
>>>> +static void
>>>> +xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>>>> +{
>>>> +  enum hardware_error hw_err;
>>>> +
>>>> +  for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
>>>> +          if (master_ctl & XE_ERROR_IRQ(hw_err))
>>>> +                  xe_hw_error_source_handler(tile, hw_err);
>>>> +  }
>>>> +}
>>>> +
>>>>   /*
>>>>    * Top-level interrupt handler for Xe_LP+ and beyond.  These platforms have
>>>>    * a "master tile" interrupt register which must be consulted before the
>>>> @@ -413,6 +630,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>>>             xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>>>>
>>>>             gt_irq_handler(tile, master_ctl, intr_dw, identity);
>>>> +          xe_hw_error_irq_handler(tile, master_ctl);
>>>>
>>>>             /*
>>>>              * Display interrupts (including display backlight operations
>>>> @@ -561,6 +779,8 @@ int xe_irq_install(struct xe_device *xe)
>>>>             return -EINVAL;
>>>>     }
>>>>
>>>> +  update_valid_error_regs(xe);
>>>> +
>>>>     xe->irq.enabled = true;
>>>>
>>>>     xe_irq_reset(xe);

--
Jani Nikula, Intel Open Source Graphics Center


[-- Attachment #2: Type: text/html, Size: 46120 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-08-10 11:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-10  5:07 [Intel-xe] [PATCH v2 0/4] Supporting RAS on XE Himal Prasad Ghimiray
2023-08-10  5:07 ` [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
2023-08-10  7:54   ` Jani Nikula
     [not found]     ` <ac574fd9-dce4-4f1e-a209-f4a400263273@intel.com>
2023-08-10  9:38       ` Ghimiray, Himal Prasad
2023-08-10 10:17         ` Jani Nikula
2023-08-10 11:20           ` Ghimiray, Himal Prasad
2023-08-10  5:07 ` [Intel-xe] [PATCH v2 2/4] drm/xe: Log and count the GT hardware errors Himal Prasad Ghimiray
2023-08-10  5:07 ` [Intel-xe] [PATCH v2 3/4] drm/xe: Support GT hardware error reporting for PVC Himal Prasad Ghimiray
2023-08-10  5:07 ` [Intel-xe] [PATCH v2 4/4] drm/xe: Process fatal hardware errors Himal Prasad Ghimiray
2023-08-10  5:32 ` [Intel-xe] ✗ CI.Patch_applied: failure for Supporting RAS on XE Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.