* [Intel-xe] ✗ CI.Patch_applied: failure for RFC: drm/xe/ras: Supporting RAS on XE.
2023-04-06 9:26 [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Himal Prasad Ghimiray
@ 2023-04-06 9:26 ` Patchwork
2023-04-06 9:26 ` [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq Himal Prasad Ghimiray
` (4 subsequent siblings)
5 siblings, 0 replies; 22+ messages in thread
From: Patchwork @ 2023-04-06 9:26 UTC (permalink / raw)
To: Himal Prasad Ghimiray; +Cc: intel-xe
== Series Details ==
Series: RFC: drm/xe/ras: Supporting RAS on XE.
URL : https://patchwork.freedesktop.org/series/116181/
State : failure
== Summary ==
=== Applying kernel patches on branch 'drm-xe-next' with base: ===
commit 58e5cc5ff399294871892dfbc06ae7b8597ee67a
Author: Matthew Brost <matthew.brost@intel.com>
AuthorDate: Wed Apr 5 16:20:03 2023 -0700
Commit: Matthew Brost <matthew.brost@intel.com>
CommitDate: Wed Apr 5 17:26:51 2023 -0700
drm/xe: Always write GEN12_RCU_MODE.GEN12_RCU_MODE_CCS_ENABLE for CCS engines
If CCS0 was fused we did not write this register thus CCS engine were
not enabled resulting in driver load failures.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
=== git am output follows ===
error: patch failed: drivers/gpu/drm/xe/regs/xe_regs.h:92
error: drivers/gpu/drm/xe/regs/xe_regs.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_irq.c:344
error: drivers/gpu/drm/xe/xe_irq.c: patch does not apply
hint: Use 'git am --show-current-patch' to see the failed patch
Applying: drm/xe: Handle GRF/IC ECC error irq
Patch failed at 0001 drm/xe: Handle GRF/IC ECC error irq
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
^ permalink raw reply [flat|nested] 22+ messages in thread* [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq
2023-04-06 9:26 [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Himal Prasad Ghimiray
2023-04-06 9:26 ` [Intel-xe] ✗ CI.Patch_applied: failure for " Patchwork
@ 2023-04-06 9:26 ` Himal Prasad Ghimiray
2023-04-24 23:51 ` Matt Roper
2023-04-06 9:26 ` [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors Himal Prasad Ghimiray
` (3 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Himal Prasad Ghimiray @ 2023-04-06 9:26 UTC (permalink / raw)
To: intel-xe; +Cc: Himal Prasad Ghimiray, Fernando Pacheco
The error detection and correction capability
for GRF and instruction cache (IC) will utilize
the new interrupt and error handling infrastructure
for dgfx products. The GFX device can generate
a number of classes of error under the new
infrastructure: correctable, non-fatal, and
fatal errors.
The non-fatal and fatal error classes distinguish
between levels of severity for uncorrectable errors.
All ECC uncorrectable errors will be reported as
fatal to produce the desired system response. Fatal
errors are expected to route as PCIe error messages
which should result in OS issuing a GFX device FLR.
But the option exists to route fatal errors as
interrupts.
Driver will only handle logging of errors. Anything
more will be handled at system level.
For errors that will route as interrupts, three
bits in the Master Interrupt Register will be used
to convey the class of error.
For each class of error:
1. Determine source of error (IP block) by reading
the Device Error Source Register (RW1C) that
corresponds to the class of error being serviced.
2. If the generating IP block is GT, read and log the
GT Error Register (RW1C) that corresponds to the
class of error being serviced. Non-GT errors will
be logged in aggregate for now.
Bspec: 50875
Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Original-author: Fernando Pacheco
---
drivers/gpu/drm/xe/regs/xe_regs.h | 29 ++++++++
drivers/gpu/drm/xe/xe_irq.c | 108 ++++++++++++++++++++++++++++++
2 files changed, 137 insertions(+)
diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
index c1c829c23df1..dff74b093d4e 100644
--- a/drivers/gpu/drm/xe/regs/xe_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_regs.h
@@ -92,6 +92,10 @@
#define GEN11_GU_MISC_IRQ (1 << 29)
#define GEN11_DISPLAY_IRQ (1 << 16)
#define GEN11_GT_DW_IRQ(x) (1 << (x))
+#define XE_FATAL_ERROR_IRQ REG_BIT(28)
+#define XE_NON_FATAL_ERROR_IRQ REG_BIT(27)
+#define XE_CORRECTABLE_ERROR_IRQ REG_BIT(26)
+#define XE_ERROR_IRQ(x) REG_BIT(26 + (x))
#define DG1_MSTR_TILE_INTR _MMIO(0x190008)
#define DG1_MSTR_IRQ REG_BIT(31)
@@ -111,4 +115,29 @@
#define GEN12_DSMBASE _MMIO(0x1080C0)
#define GEN12_BDSM_MASK REG_GENMASK64(63, 20)
+enum hardware_error {
+ HARDWARE_ERROR_CORRECTABLE = 0,
+ HARDWARE_ERROR_NONFATAL = 1,
+ HARDWARE_ERROR_FATAL = 2,
+ HARDWARE_ERROR_MAX,
+};
+
+#define _DEV_ERR_STAT_FATAL 0x100174
+#define _DEV_ERR_STAT_NONFATAL 0x100178
+#define _DEV_ERR_STAT_CORRECTABLE 0x10017c
+#define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
+ _DEV_ERR_STAT_CORRECTABLE, \
+ _DEV_ERR_STAT_NONFATAL))
+#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
+
+#define _ERR_STAT_GT_COR 0x100160
+#define _ERR_STAT_GT_NONFATAL 0x100164
+#define _ERR_STAT_GT_FATAL 0x100168
+#define ERR_STAT_GT_REG(x) _MMIO(_PICK_EVEN((x), \
+ _ERR_STAT_GT_COR, \
+ _ERR_STAT_GT_NONFATAL))
+
+#define EU_GRF_ERROR REG_BIT(15)
+#define EU_IC_ERROR REG_BIT(14)
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 529b42d9c9af..6b922332bff1 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -344,6 +344,113 @@ static void dg1_irq_postinstall(struct xe_device *xe, struct xe_gt *gt)
dg1_intr_enable(xe, true);
}
+static const char *
+hardware_error_type_to_str(const enum hardware_error hw_err)
+{
+ switch (hw_err) {
+ case HARDWARE_ERROR_CORRECTABLE:
+ return "CORRECTABLE";
+ case HARDWARE_ERROR_NONFATAL:
+ return "NONFATAL";
+ case HARDWARE_ERROR_FATAL:
+ return "FATAL";
+ default:
+ return "UNKNOWN";
+ }
+}
+
+static void
+xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+ const char *hw_err_str = hardware_error_type_to_str(hw_err);
+ u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
+ u32 errstat;
+
+ lockdep_assert_held(>_to_xe(gt)->irq.lock);
+
+ errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
+
+ if (unlikely(!errstat)) {
+ DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
+ return;
+ }
+
+ /*
+ * TODO: The GT Non Fatal Error Status Register
+ * only has reserved bitfields defined.
+ * Remove once there is something to service.
+ */
+ if (hw_err == HARDWARE_ERROR_NONFATAL) {
+ DRM_ERROR("detected Non-Fatal error\n");
+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
+ return;
+ }
+
+ /*
+ * TODO: The remaining GT errors don't have a
+ * need for targeted logging at the moment. We
+ * still want to log detection of these errors, but
+ * let's aggregate them until someone has a need for them.
+ */
+ if (errstat & other_errors)
+ DRM_ERROR("detected hardware error(s) in ERR_STAT_GT_REG_%s: 0x%08x\n",
+ hw_err_str, errstat & other_errors);
+
+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
+}
+
+static void
+xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+ const char *hw_err_str = hardware_error_type_to_str(hw_err);
+ unsigned long flags;
+ u32 errsrc;
+
+ spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
+ errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
+ if (unlikely(!errsrc)) {
+ DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
+ goto out_unlock;
+ }
+
+ if (errsrc & DEV_ERR_STAT_GT_ERROR)
+ xe_gt_hw_error_handler(gt, hw_err);
+
+ if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
+ DRM_ERROR("non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
+ hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
+
+ xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
+
+out_unlock:
+ spin_unlock_irqrestore(>_to_xe(gt)->irq.lock, flags);
+}
+
+/*
+ * XE Platforms adds three Error bits to the Master Interrupt
+ * Register to support dgfx card error handling.
+ * These three bits are used to convey the class of error:
+ * FATAL, NONFATAL, or CORRECTABLE.
+ *
+ * To process an interrupt:
+ * 1. Determine source of error (IP block) by reading
+ * the Device Error Source Register (RW1C) that
+ * corresponds to the class of error being serviced.
+ * 2. For GT as the generating IP block, read and log
+ * the GT Error Register (RW1C) that corresponds to
+ * the class of error being serviced.
+ */
+static void
+xe_hw_error_irq_handler(struct xe_gt *gt, const u32 master_ctl)
+{
+ enum hardware_error hw_err;
+
+ for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
+ if (master_ctl & XE_ERROR_IRQ(hw_err))
+ xe_hw_error_source_handler(gt, hw_err);
+ }
+}
+
static irqreturn_t dg1_irq_handler(int irq, void *arg)
{
struct xe_device *xe = arg;
@@ -382,6 +489,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
if (!xe_gt_is_media_type(gt))
xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg, master_ctl);
gen11_gt_irq_handler(xe, gt, master_ctl, intr_dw, identity);
+ xe_hw_error_irq_handler(gt, master_ctl);
}
xe_display_irq_handler(xe, master_ctl);
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq
2023-04-06 9:26 ` [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq Himal Prasad Ghimiray
@ 2023-04-24 23:51 ` Matt Roper
2023-04-27 7:18 ` Ghimiray, Himal Prasad
2023-05-02 19:56 ` Rodrigo Vivi
0 siblings, 2 replies; 22+ messages in thread
From: Matt Roper @ 2023-04-24 23:51 UTC (permalink / raw)
To: Himal Prasad Ghimiray; +Cc: intel-xe, Fernando Pacheco
On Thu, Apr 06, 2023 at 02:56:28PM +0530, Himal Prasad Ghimiray wrote:
> The error detection and correction capability
> for GRF and instruction cache (IC) will utilize
> the new interrupt and error handling infrastructure
> for dgfx products. The GFX device can generate
> a number of classes of error under the new
> infrastructure: correctable, non-fatal, and
> fatal errors.
Nitpick: Your commit message wrapping is too aggressive. Generally
commit message bodies are wrapped somewhere from 72-75 characters per
line.
>
> The non-fatal and fatal error classes distinguish
> between levels of severity for uncorrectable errors.
> All ECC uncorrectable errors will be reported as
> fatal to produce the desired system response. Fatal
> errors are expected to route as PCIe error messages
> which should result in OS issuing a GFX device FLR.
> But the option exists to route fatal errors as
> interrupts.
>
> Driver will only handle logging of errors. Anything
> more will be handled at system level.
>
> For errors that will route as interrupts, three
> bits in the Master Interrupt Register will be used
> to convey the class of error.
>
> For each class of error:
> 1. Determine source of error (IP block) by reading
> the Device Error Source Register (RW1C) that
> corresponds to the class of error being serviced.
> 2. If the generating IP block is GT, read and log the
> GT Error Register (RW1C) that corresponds to the
> class of error being serviced. Non-GT errors will
> be logged in aggregate for now.
>
> Bspec: 50875
>
> Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Original-author: Fernando Pacheco
Is this true? This may be inspired by preliminary i915 work Fernando
did 4-5 years ago, but I don't think anything in this patch is actually
his code at this point (especially since the Xe driver didn't exist back
then, and even the i915 driver looked completely different at the time).
We shouldn't really forge s-o-b lines when it isn't actually true.
> ---
> drivers/gpu/drm/xe/regs/xe_regs.h | 29 ++++++++
> drivers/gpu/drm/xe/xe_irq.c | 108 ++++++++++++++++++++++++++++++
> 2 files changed, 137 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
> index c1c829c23df1..dff74b093d4e 100644
> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> @@ -92,6 +92,10 @@
> #define GEN11_GU_MISC_IRQ (1 << 29)
> #define GEN11_DISPLAY_IRQ (1 << 16)
> #define GEN11_GT_DW_IRQ(x) (1 << (x))
> +#define XE_FATAL_ERROR_IRQ REG_BIT(28)
> +#define XE_NON_FATAL_ERROR_IRQ REG_BIT(27)
> +#define XE_CORRECTABLE_ERROR_IRQ REG_BIT(26)
You only use the parameterized marcro below; the three here never get
used.
> +#define XE_ERROR_IRQ(x) REG_BIT(26 + (x))
>
> #define DG1_MSTR_TILE_INTR _MMIO(0x190008)
> #define DG1_MSTR_IRQ REG_BIT(31)
> @@ -111,4 +115,29 @@
> #define GEN12_DSMBASE _MMIO(0x1080C0)
> #define GEN12_BDSM_MASK REG_GENMASK64(63, 20)
>
> +enum hardware_error {
> + HARDWARE_ERROR_CORRECTABLE = 0,
> + HARDWARE_ERROR_NONFATAL = 1,
> + HARDWARE_ERROR_FATAL = 2,
> + HARDWARE_ERROR_MAX,
> +};
> +
> +#define _DEV_ERR_STAT_FATAL 0x100174
> +#define _DEV_ERR_STAT_NONFATAL 0x100178
> +#define _DEV_ERR_STAT_CORRECTABLE 0x10017c
> +#define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
> + _DEV_ERR_STAT_CORRECTABLE, \
> + _DEV_ERR_STAT_NONFATAL))
> +#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
> +
> +#define _ERR_STAT_GT_COR 0x100160
> +#define _ERR_STAT_GT_NONFATAL 0x100164
> +#define _ERR_STAT_GT_FATAL 0x100168
> +#define ERR_STAT_GT_REG(x) _MMIO(_PICK_EVEN((x), \
> + _ERR_STAT_GT_COR, \
> + _ERR_STAT_GT_NONFATAL))
> +
> +#define EU_GRF_ERROR REG_BIT(15)
> +#define EU_IC_ERROR REG_BIT(14)
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> index 529b42d9c9af..6b922332bff1 100644
> --- a/drivers/gpu/drm/xe/xe_irq.c
> +++ b/drivers/gpu/drm/xe/xe_irq.c
> @@ -344,6 +344,113 @@ static void dg1_irq_postinstall(struct xe_device *xe, struct xe_gt *gt)
> dg1_intr_enable(xe, true);
> }
>
> +static const char *
> +hardware_error_type_to_str(const enum hardware_error hw_err)
> +{
> + switch (hw_err) {
> + case HARDWARE_ERROR_CORRECTABLE:
> + return "CORRECTABLE";
> + case HARDWARE_ERROR_NONFATAL:
> + return "NONFATAL";
> + case HARDWARE_ERROR_FATAL:
> + return "FATAL";
> + default:
> + return "UNKNOWN";
> + }
> +}
> +
> +static void
> +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> +{
> + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> + u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> + u32 errstat;
> +
> + lockdep_assert_held(>_to_xe(gt)->irq.lock);
> +
> + errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> +
> + if (unlikely(!errstat)) {
> + DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
We should be using the per-device error macros (e.g., drm_err) rather
than the legacy macros. In fact maybe we should be using the
_ratelimited variants so that if something does go wrong (and keeps
going wrong), we're not spamming so much output into the log from an
interrupt handler that we crash the system.
> + return;
> + }
> +
> + /*
> + * TODO: The GT Non Fatal Error Status Register
> + * only has reserved bitfields defined.
> + * Remove once there is something to service.
> + */
> + if (hw_err == HARDWARE_ERROR_NONFATAL) {
> + DRM_ERROR("detected Non-Fatal error\n");
> + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> + return;
> + }
> +
> + /*
> + * TODO: The remaining GT errors don't have a
> + * need for targeted logging at the moment. We
> + * still want to log detection of these errors, but
> + * let's aggregate them until someone has a need for them.
> + */
> + if (errstat & other_errors)
So is the goal here to just silently ignore GRF and IC errors (unless
they're non-fatal, which is impossible), but print a message about
everything else? That seems to be what the code is doing right now.
> + DRM_ERROR("detected hardware error(s) in ERR_STAT_GT_REG_%s: 0x%08x\n",
> + hw_err_str, errstat & other_errors);
For correctable errors should we even be printing an error? I thought
correctable errors were basically something the hardware could clean up
by itself and there's no impact to the user. It seems like if we raise
DRM errors on correctable errors we're spamming the user about something
they aren't supposed to need to worry about?
> +
> + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> +}
> +
> +static void
> +xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> +{
> + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> + unsigned long flags;
> + u32 errsrc;
> +
> + spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> + errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> + if (unlikely(!errsrc)) {
> + DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
Is this actually an error? You've wrapped this function in a spinlock,
so if there are two concurrent entries to this function, the one that
winds up waiting is likely to see a register value of zero once the
first one clears the interrupts and releases the lock...
Matt
> + goto out_unlock;
> + }
> +
> + if (errsrc & DEV_ERR_STAT_GT_ERROR)
> + xe_gt_hw_error_handler(gt, hw_err);
> +
> + if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
> + DRM_ERROR("non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
> + hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
> +
> + xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
> +
> +out_unlock:
> + spin_unlock_irqrestore(>_to_xe(gt)->irq.lock, flags);
> +}
> +
> +/*
> + * XE Platforms adds three Error bits to the Master Interrupt
> + * Register to support dgfx card error handling.
> + * These three bits are used to convey the class of error:
> + * FATAL, NONFATAL, or CORRECTABLE.
> + *
> + * To process an interrupt:
> + * 1. Determine source of error (IP block) by reading
> + * the Device Error Source Register (RW1C) that
> + * corresponds to the class of error being serviced.
> + * 2. For GT as the generating IP block, read and log
> + * the GT Error Register (RW1C) that corresponds to
> + * the class of error being serviced.
> + */
> +static void
> +xe_hw_error_irq_handler(struct xe_gt *gt, const u32 master_ctl)
> +{
> + enum hardware_error hw_err;
> +
> + for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
> + if (master_ctl & XE_ERROR_IRQ(hw_err))
> + xe_hw_error_source_handler(gt, hw_err);
> + }
> +}
> +
> static irqreturn_t dg1_irq_handler(int irq, void *arg)
> {
> struct xe_device *xe = arg;
> @@ -382,6 +489,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
> if (!xe_gt_is_media_type(gt))
> xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg, master_ctl);
> gen11_gt_irq_handler(xe, gt, master_ctl, intr_dw, identity);
> + xe_hw_error_irq_handler(gt, master_ctl);
> }
>
> xe_display_irq_handler(xe, master_ctl);
> --
> 2.25.1
>
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq
2023-04-24 23:51 ` Matt Roper
@ 2023-04-27 7:18 ` Ghimiray, Himal Prasad
2023-05-02 19:56 ` Rodrigo Vivi
1 sibling, 0 replies; 22+ messages in thread
From: Ghimiray, Himal Prasad @ 2023-04-27 7:18 UTC (permalink / raw)
To: Roper, Matthew D; +Cc: intel-xe@lists.freedesktop.org, Fernando Pacheco
> -----Original Message-----
> From: Roper, Matthew D <matthew.d.roper@intel.com>
> Sent: 25 April 2023 05:22
> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Fernando Pacheco
> <fernando.pacheco@intel.com>
> Subject: Re: [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq
>
> On Thu, Apr 06, 2023 at 02:56:28PM +0530, Himal Prasad Ghimiray wrote:
> > The error detection and correction capability for GRF and instruction
> > cache (IC) will utilize the new interrupt and error handling
> > infrastructure for dgfx products. The GFX device can generate a number
> > of classes of error under the new
> > infrastructure: correctable, non-fatal, and fatal errors.
>
> Nitpick: Your commit message wrapping is too aggressive. Generally commit
> message bodies are wrapped somewhere from 72-75 characters per line.
Will address this.
>
> >
> > The non-fatal and fatal error classes distinguish between levels of
> > severity for uncorrectable errors.
> > All ECC uncorrectable errors will be reported as fatal to produce the
> > desired system response. Fatal errors are expected to route as PCIe
> > error messages which should result in OS issuing a GFX device FLR.
> > But the option exists to route fatal errors as interrupts.
> >
> > Driver will only handle logging of errors. Anything more will be
> > handled at system level.
> >
> > For errors that will route as interrupts, three bits in the Master
> > Interrupt Register will be used to convey the class of error.
> >
> > For each class of error:
> > 1. Determine source of error (IP block) by reading
> > the Device Error Source Register (RW1C) that
> > corresponds to the class of error being serviced.
> > 2. If the generating IP block is GT, read and log the
> > GT Error Register (RW1C) that corresponds to the
> > class of error being serviced. Non-GT errors will
> > be logged in aggregate for now.
> >
> > Bspec: 50875
> >
> > Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Original-author: Fernando Pacheco
>
> Is this true? This may be inspired by preliminary i915 work Fernando did 4-5
> years ago, but I don't think anything in this patch is actually his code at this
> point (especially since the Xe driver didn't exist back then, and even the i915
> driver looked completely different at the time).
> We shouldn't really forge s-o-b lines when it isn't actually true.
I agree with you regarding not using s-o-b. The patch is inspired from Fernando Pacheco's work in i915.
Is adding co-developed-by is right way to go ahead ?
>
> > ---
> > drivers/gpu/drm/xe/regs/xe_regs.h | 29 ++++++++
> > drivers/gpu/drm/xe/xe_irq.c | 108
> ++++++++++++++++++++++++++++++
> > 2 files changed, 137 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h
> > b/drivers/gpu/drm/xe/regs/xe_regs.h
> > index c1c829c23df1..dff74b093d4e 100644
> > --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> > +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> > @@ -92,6 +92,10 @@
> > #define GEN11_GU_MISC_IRQ (1 << 29)
> > #define GEN11_DISPLAY_IRQ (1 << 16)
> > #define GEN11_GT_DW_IRQ(x) (1 << (x))
> > +#define XE_FATAL_ERROR_IRQ REG_BIT(28)
> > +#define XE_NON_FATAL_ERROR_IRQ REG_BIT(27)
> > +#define XE_CORRECTABLE_ERROR_IRQ REG_BIT(26)
>
> You only use the parameterized marcro below; the three here never get
> used.
Will address this.
>
> > +#define XE_ERROR_IRQ(x) REG_BIT(26 + (x))
> >
> > #define DG1_MSTR_TILE_INTR _MMIO(0x190008)
> > #define DG1_MSTR_IRQ REG_BIT(31)
> > @@ -111,4 +115,29 @@
> > #define GEN12_DSMBASE _MMIO(0x1080C0)
> > #define GEN12_BDSM_MASK REG_GENMASK64(63,
> 20)
> >
> > +enum hardware_error {
> > + HARDWARE_ERROR_CORRECTABLE = 0,
> > + HARDWARE_ERROR_NONFATAL = 1,
> > + HARDWARE_ERROR_FATAL = 2,
> > + HARDWARE_ERROR_MAX,
> > +};
> > +
> > +#define _DEV_ERR_STAT_FATAL 0x100174
> > +#define _DEV_ERR_STAT_NONFATAL 0x100178
> > +#define _DEV_ERR_STAT_CORRECTABLE 0x10017c
> > +#define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
> > +
> _DEV_ERR_STAT_CORRECTABLE, \
> > +
> _DEV_ERR_STAT_NONFATAL))
> > +#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
> > +
> > +#define _ERR_STAT_GT_COR 0x100160
> > +#define _ERR_STAT_GT_NONFATAL 0x100164
> > +#define _ERR_STAT_GT_FATAL 0x100168
> > +#define ERR_STAT_GT_REG(x) _MMIO(_PICK_EVEN((x), \
> > + _ERR_STAT_GT_COR, \
> > + _ERR_STAT_GT_NONFATAL))
> > +
> > +#define EU_GRF_ERROR REG_BIT(15)
> > +#define EU_IC_ERROR REG_BIT(14)
> > +
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> > index 529b42d9c9af..6b922332bff1 100644
> > --- a/drivers/gpu/drm/xe/xe_irq.c
> > +++ b/drivers/gpu/drm/xe/xe_irq.c
> > @@ -344,6 +344,113 @@ static void dg1_irq_postinstall(struct xe_device
> *xe, struct xe_gt *gt)
> > dg1_intr_enable(xe, true);
> > }
> >
> > +static const char *
> > +hardware_error_type_to_str(const enum hardware_error hw_err) {
> > + switch (hw_err) {
> > + case HARDWARE_ERROR_CORRECTABLE:
> > + return "CORRECTABLE";
> > + case HARDWARE_ERROR_NONFATAL:
> > + return "NONFATAL";
> > + case HARDWARE_ERROR_FATAL:
> > + return "FATAL";
> > + default:
> > + return "UNKNOWN";
> > + }
> > +}
> > +
> > +static void
> > +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
> > +hw_err) {
> > + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > + u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> > + u32 errstat;
> > +
> > + lockdep_assert_held(>_to_xe(gt)->irq.lock);
> > +
> > + errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> > +
> > + if (unlikely(!errstat)) {
> > + DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n",
> hw_err_str);
>
> We should be using the per-device error macros (e.g., drm_err) rather than
> the legacy macros. In fact maybe we should be using the _ratelimited
> variants so that if something does go wrong (and keeps going wrong), we're
> not spamming so much output into the log from an interrupt handler that we
> crash the system.
Will address this in next patch.
>
> > + return;
> > + }
> > +
> > + /*
> > + * TODO: The GT Non Fatal Error Status Register
> > + * only has reserved bitfields defined.
> > + * Remove once there is something to service.
> > + */
> > + if (hw_err == HARDWARE_ERROR_NONFATAL) {
> > + DRM_ERROR("detected Non-Fatal error\n");
> > + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> errstat);
> > + return;
> > + }
> > +
> > + /*
> > + * TODO: The remaining GT errors don't have a
> > + * need for targeted logging at the moment. We
> > + * still want to log detection of these errors, but
> > + * let's aggregate them until someone has a need for them.
> > + */
> > + if (errstat & other_errors)
>
> So is the goal here to just silently ignore GRF and IC errors (unless they're
> non-fatal, which is impossible), but print a message about everything else?
> That seems to be what the code is doing right now.
FOR GRF and IC errors we should log them. Will address them in next patch.
>
> > + DRM_ERROR("detected hardware error(s) in
> ERR_STAT_GT_REG_%s: 0x%08x\n",
> > + hw_err_str, errstat & other_errors);
>
> For correctable errors should we even be printing an error? I thought
> correctable errors were basically something the hardware could clean up by
> itself and there's no impact to the user. It seems like if we raise DRM errors
> on correctable errors we're spamming the user about something they aren't
> supposed to need to worry about?
Hmm. You are right. In case of correctable error should we print warning message instead
Of error ?
>
> > +
> > + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat); }
> > +
> > +static void
> > +xe_hw_error_source_handler(struct xe_gt *gt, const enum
> > +hardware_error hw_err) {
> > + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > + unsigned long flags;
> > + u32 errsrc;
> > +
> > + spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> > + errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> > + if (unlikely(!errsrc)) {
> > + DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n",
> hw_err_str);
>
> Is this actually an error? You've wrapped this function in a spinlock, so if
> there are two concurrent entries to this function, the one that winds up
> waiting is likely to see a register value of zero once the first one clears the
> interrupts and releases the lock...
Register bits are R/W one clear and we are writing value which is read, hence we clear only serviced bits.
>
>
> Matt
>
> > + goto out_unlock;
> > + }
> > +
> > + if (errsrc & DEV_ERR_STAT_GT_ERROR)
> > + xe_gt_hw_error_handler(gt, hw_err);
> > +
> > + if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
> > + DRM_ERROR("non-GT hardware error(s) in
> DEV_ERR_STAT_REG_%s: 0x%08x\n",
> > + hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
> > +
> > + xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
> > +
> > +out_unlock:
> > + spin_unlock_irqrestore(>_to_xe(gt)->irq.lock, flags); }
> > +
> > +/*
> > + * XE Platforms adds three Error bits to the Master Interrupt
> > + * Register to support dgfx card error handling.
> > + * These three bits are used to convey the class of error:
> > + * FATAL, NONFATAL, or CORRECTABLE.
> > + *
> > + * To process an interrupt:
> > + * 1. Determine source of error (IP block) by reading
> > + * the Device Error Source Register (RW1C) that
> > + * corresponds to the class of error being serviced.
> > + * 2. For GT as the generating IP block, read and log
> > + * the GT Error Register (RW1C) that corresponds to
> > + * the class of error being serviced.
> > + */
> > +static void
> > +xe_hw_error_irq_handler(struct xe_gt *gt, const u32 master_ctl) {
> > + enum hardware_error hw_err;
> > +
> > + for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
> > + if (master_ctl & XE_ERROR_IRQ(hw_err))
> > + xe_hw_error_source_handler(gt, hw_err);
> > + }
> > +}
> > +
> > static irqreturn_t dg1_irq_handler(int irq, void *arg) {
> > struct xe_device *xe = arg;
> > @@ -382,6 +489,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
> > if (!xe_gt_is_media_type(gt))
> > xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg,
> master_ctl);
> > gen11_gt_irq_handler(xe, gt, master_ctl, intr_dw, identity);
> > + xe_hw_error_irq_handler(gt, master_ctl);
> > }
> >
> > xe_display_irq_handler(xe, master_ctl);
> > --
> > 2.25.1
> >
>
> --
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq
2023-04-24 23:51 ` Matt Roper
2023-04-27 7:18 ` Ghimiray, Himal Prasad
@ 2023-05-02 19:56 ` Rodrigo Vivi
1 sibling, 0 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2023-05-02 19:56 UTC (permalink / raw)
To: Matt Roper; +Cc: Himal Prasad Ghimiray, intel-xe, Fernando Pacheco
On Mon, Apr 24, 2023 at 04:51:48PM -0700, Matt Roper wrote:
> On Thu, Apr 06, 2023 at 02:56:28PM +0530, Himal Prasad Ghimiray wrote:
> > The error detection and correction capability
> > for GRF and instruction cache (IC) will utilize
> > the new interrupt and error handling infrastructure
> > for dgfx products. The GFX device can generate
> > a number of classes of error under the new
> > infrastructure: correctable, non-fatal, and
> > fatal errors.
>
> Nitpick: Your commit message wrapping is too aggressive. Generally
> commit message bodies are wrapped somewhere from 72-75 characters per
> line.
>
> >
> > The non-fatal and fatal error classes distinguish
> > between levels of severity for uncorrectable errors.
> > All ECC uncorrectable errors will be reported as
> > fatal to produce the desired system response. Fatal
> > errors are expected to route as PCIe error messages
> > which should result in OS issuing a GFX device FLR.
> > But the option exists to route fatal errors as
> > interrupts.
> >
> > Driver will only handle logging of errors. Anything
> > more will be handled at system level.
> >
> > For errors that will route as interrupts, three
> > bits in the Master Interrupt Register will be used
> > to convey the class of error.
> >
> > For each class of error:
> > 1. Determine source of error (IP block) by reading
> > the Device Error Source Register (RW1C) that
> > corresponds to the class of error being serviced.
> > 2. If the generating IP block is GT, read and log the
> > GT Error Register (RW1C) that corresponds to the
> > class of error being serviced. Non-GT errors will
> > be logged in aggregate for now.
> >
> > Bspec: 50875
> >
> > Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Original-author: Fernando Pacheco
>
> Is this true? This may be inspired by preliminary i915 work Fernando
> did 4-5 years ago, but I don't think anything in this patch is actually
> his code at this point (especially since the Xe driver didn't exist back
> then, and even the i915 driver looked completely different at the time).
> We shouldn't really forge s-o-b lines when it isn't actually true.
Exactly. Please do not forge signatures.
Our drivers are MIT and/or GPLv2 and there's no need for faking authorship
in the code to attend to these license.
Giving credits is a good idea. Use CC and mentions in the message, but
do not invent new tags please. When in doubt ask the original author if
some mention is needed. Remember that they might look to the code and
think: "What? I have never touched this driver in my life, why my name
is associated with it" on the other hand, there might be cases where
they would think "What, this is my code, they should had at least mentioned
my name in the commit message"... So, when in doubt ask, but do not forge things.
And I do believe the original authors were aware of the licenses obligations
and freedom when they posted their original patch.
I will copy and paste here again another answer that I had done in another
similar case where my signature was forget:
https://lore.kernel.org/all/ZCM3mchRbwxU15L1@intel.com/
First of all, there's a very common misunderstanding of the meaning of the
'Signed-off-by:' (sob).
**hint**: It does *not* mean 'authorship'!
Although we are in an IGT patch, let's use the kernel definition so we
are aligned in some well documented rule:
https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#developer-s-certificate-of-origin-1-1
So, like defined on the official rules above, in this very specific case,
when you created the patch, your 'sob' certified ('b') that:
"The contribution is based upon previous work that, to the best of my knowledge,
is covered under an appropriate open source license and I have the right under
that license to submit that work with modifications"
Any extra Sob would be added as the patch could be in its transportation.
"Any further SoBs (Signed-off-by:’s) following the author’s SoB are from people
handling and transporting the patch, but were not involved in its development.
SoB chains should reflect the real route a patch took as it was propagated to
the maintainers and ultimately to Linus, with the first SoB entry signalling
primary authorship of a single author."
Same as 'c' of the certificate of origin: "The contribution was provided directly
to me by some other person who certified (a), (b) or (c) and I have not modified it.
It is very important to highlight this transportation rules because there
are many new devs that think that we maintainers are stealing ownership.
As you can see, this is not the case, but the rule.
Back to your case, since I had never seen this patch in my life before it hit
the mailing list, I couldn't have certified this patch in any possible way,
so the forged sob is the improper approach.
It is very hard to define a written rule on what to do with the code copied
from one driver to the other. In many cases the recognition is important,
but in other cases it is even hard to find who was actually the true author of
that code.
There are many options available. A simple one could be 'Cc' the person and
write in the commit message that the code was based on the other driver from
that person, but maybe there are better options available. So let's say that
when in doubt: ask. Contact the original author and ask what he/she has
to suggest. Maybe just this mention and cc would be enough, maybe even with
an acked-by or with the explicit sob, or maybe with some other tag like
'co-developed-by'.
>
> > ---
> > drivers/gpu/drm/xe/regs/xe_regs.h | 29 ++++++++
> > drivers/gpu/drm/xe/xe_irq.c | 108 ++++++++++++++++++++++++++++++
> > 2 files changed, 137 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
> > index c1c829c23df1..dff74b093d4e 100644
> > --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> > +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> > @@ -92,6 +92,10 @@
> > #define GEN11_GU_MISC_IRQ (1 << 29)
> > #define GEN11_DISPLAY_IRQ (1 << 16)
> > #define GEN11_GT_DW_IRQ(x) (1 << (x))
> > +#define XE_FATAL_ERROR_IRQ REG_BIT(28)
> > +#define XE_NON_FATAL_ERROR_IRQ REG_BIT(27)
> > +#define XE_CORRECTABLE_ERROR_IRQ REG_BIT(26)
>
> You only use the parameterized marcro below; the three here never get
> used.
>
> > +#define XE_ERROR_IRQ(x) REG_BIT(26 + (x))
> >
> > #define DG1_MSTR_TILE_INTR _MMIO(0x190008)
> > #define DG1_MSTR_IRQ REG_BIT(31)
> > @@ -111,4 +115,29 @@
> > #define GEN12_DSMBASE _MMIO(0x1080C0)
> > #define GEN12_BDSM_MASK REG_GENMASK64(63, 20)
> >
> > +enum hardware_error {
> > + HARDWARE_ERROR_CORRECTABLE = 0,
> > + HARDWARE_ERROR_NONFATAL = 1,
> > + HARDWARE_ERROR_FATAL = 2,
> > + HARDWARE_ERROR_MAX,
> > +};
> > +
> > +#define _DEV_ERR_STAT_FATAL 0x100174
> > +#define _DEV_ERR_STAT_NONFATAL 0x100178
> > +#define _DEV_ERR_STAT_CORRECTABLE 0x10017c
> > +#define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
> > + _DEV_ERR_STAT_CORRECTABLE, \
> > + _DEV_ERR_STAT_NONFATAL))
> > +#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
> > +
> > +#define _ERR_STAT_GT_COR 0x100160
> > +#define _ERR_STAT_GT_NONFATAL 0x100164
> > +#define _ERR_STAT_GT_FATAL 0x100168
> > +#define ERR_STAT_GT_REG(x) _MMIO(_PICK_EVEN((x), \
> > + _ERR_STAT_GT_COR, \
> > + _ERR_STAT_GT_NONFATAL))
> > +
> > +#define EU_GRF_ERROR REG_BIT(15)
> > +#define EU_IC_ERROR REG_BIT(14)
> > +
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> > index 529b42d9c9af..6b922332bff1 100644
> > --- a/drivers/gpu/drm/xe/xe_irq.c
> > +++ b/drivers/gpu/drm/xe/xe_irq.c
> > @@ -344,6 +344,113 @@ static void dg1_irq_postinstall(struct xe_device *xe, struct xe_gt *gt)
> > dg1_intr_enable(xe, true);
> > }
> >
> > +static const char *
> > +hardware_error_type_to_str(const enum hardware_error hw_err)
> > +{
> > + switch (hw_err) {
> > + case HARDWARE_ERROR_CORRECTABLE:
> > + return "CORRECTABLE";
> > + case HARDWARE_ERROR_NONFATAL:
> > + return "NONFATAL";
> > + case HARDWARE_ERROR_FATAL:
> > + return "FATAL";
> > + default:
> > + return "UNKNOWN";
> > + }
> > +}
> > +
> > +static void
> > +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> > +{
> > + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > + u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> > + u32 errstat;
> > +
> > + lockdep_assert_held(>_to_xe(gt)->irq.lock);
> > +
> > + errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> > +
> > + if (unlikely(!errstat)) {
> > + DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
>
> We should be using the per-device error macros (e.g., drm_err) rather
> than the legacy macros. In fact maybe we should be using the
> _ratelimited variants so that if something does go wrong (and keeps
> going wrong), we're not spamming so much output into the log from an
> interrupt handler that we crash the system.
>
> > + return;
> > + }
> > +
> > + /*
> > + * TODO: The GT Non Fatal Error Status Register
> > + * only has reserved bitfields defined.
> > + * Remove once there is something to service.
> > + */
> > + if (hw_err == HARDWARE_ERROR_NONFATAL) {
> > + DRM_ERROR("detected Non-Fatal error\n");
> > + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> > + return;
> > + }
> > +
> > + /*
> > + * TODO: The remaining GT errors don't have a
> > + * need for targeted logging at the moment. We
> > + * still want to log detection of these errors, but
> > + * let's aggregate them until someone has a need for them.
> > + */
> > + if (errstat & other_errors)
>
> So is the goal here to just silently ignore GRF and IC errors (unless
> they're non-fatal, which is impossible), but print a message about
> everything else? That seems to be what the code is doing right now.
>
> > + DRM_ERROR("detected hardware error(s) in ERR_STAT_GT_REG_%s: 0x%08x\n",
> > + hw_err_str, errstat & other_errors);
>
> For correctable errors should we even be printing an error? I thought
> correctable errors were basically something the hardware could clean up
> by itself and there's no impact to the user. It seems like if we raise
> DRM errors on correctable errors we're spamming the user about something
> they aren't supposed to need to worry about?
>
> > +
> > + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> > +}
> > +
> > +static void
> > +xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> > +{
> > + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > + unsigned long flags;
> > + u32 errsrc;
> > +
> > + spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> > + errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> > + if (unlikely(!errsrc)) {
> > + DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
>
> Is this actually an error? You've wrapped this function in a spinlock,
> so if there are two concurrent entries to this function, the one that
> winds up waiting is likely to see a register value of zero once the
> first one clears the interrupts and releases the lock...
>
>
> Matt
>
> > + goto out_unlock;
> > + }
> > +
> > + if (errsrc & DEV_ERR_STAT_GT_ERROR)
> > + xe_gt_hw_error_handler(gt, hw_err);
> > +
> > + if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
> > + DRM_ERROR("non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
> > + hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
> > +
> > + xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
> > +
> > +out_unlock:
> > + spin_unlock_irqrestore(>_to_xe(gt)->irq.lock, flags);
> > +}
> > +
> > +/*
> > + * XE Platforms adds three Error bits to the Master Interrupt
> > + * Register to support dgfx card error handling.
> > + * These three bits are used to convey the class of error:
> > + * FATAL, NONFATAL, or CORRECTABLE.
> > + *
> > + * To process an interrupt:
> > + * 1. Determine source of error (IP block) by reading
> > + * the Device Error Source Register (RW1C) that
> > + * corresponds to the class of error being serviced.
> > + * 2. For GT as the generating IP block, read and log
> > + * the GT Error Register (RW1C) that corresponds to
> > + * the class of error being serviced.
> > + */
> > +static void
> > +xe_hw_error_irq_handler(struct xe_gt *gt, const u32 master_ctl)
> > +{
> > + enum hardware_error hw_err;
> > +
> > + for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
> > + if (master_ctl & XE_ERROR_IRQ(hw_err))
> > + xe_hw_error_source_handler(gt, hw_err);
> > + }
> > +}
> > +
> > static irqreturn_t dg1_irq_handler(int irq, void *arg)
> > {
> > struct xe_device *xe = arg;
> > @@ -382,6 +489,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
> > if (!xe_gt_is_media_type(gt))
> > xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg, master_ctl);
> > gen11_gt_irq_handler(xe, gt, master_ctl, intr_dw, identity);
> > + xe_hw_error_irq_handler(gt, master_ctl);
> > }
> >
> > xe_display_irq_handler(xe, master_ctl);
> > --
> > 2.25.1
> >
>
> --
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-04-06 9:26 [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Himal Prasad Ghimiray
2023-04-06 9:26 ` [Intel-xe] ✗ CI.Patch_applied: failure for " Patchwork
2023-04-06 9:26 ` [Intel-xe] [PATCH 1/4] drm/xe: Handle GRF/IC ECC error irq Himal Prasad Ghimiray
@ 2023-04-06 9:26 ` Himal Prasad Ghimiray
2023-04-24 13:37 ` Dafna Hirschfeld
` (2 more replies)
2023-04-06 9:26 ` [Intel-xe] [PATCH 3/4] drm/xe/ras: Count SOC and SGUNIT errors Himal Prasad Ghimiray
` (2 subsequent siblings)
5 siblings, 3 replies; 22+ messages in thread
From: Himal Prasad Ghimiray @ 2023-04-06 9:26 UTC (permalink / raw)
To: intel-xe; +Cc: Himal Prasad Ghimiray
From: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Count the CORRECTABLE and FATAL GT hardware errors as
signaled by relevant interrupt and respective registers.
For non relevant interrupts count them as driver interrupt error.
For platform supporting error vector registers count and report
the respective vector errors.
Co-authored-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
---
drivers/gpu/drm/xe/regs/xe_regs.h | 77 ++++++-
drivers/gpu/drm/xe/xe_device_types.h | 2 +
drivers/gpu/drm/xe/xe_gt.c | 29 +++
drivers/gpu/drm/xe/xe_gt_types.h | 43 ++++
drivers/gpu/drm/xe/xe_irq.c | 332 ++++++++++++++++++++++++---
drivers/gpu/drm/xe/xe_pci.c | 3 +
6 files changed, 453 insertions(+), 33 deletions(-)
diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
index dff74b093d4e..b3d35d0c5a77 100644
--- a/drivers/gpu/drm/xe/regs/xe_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_regs.h
@@ -122,14 +122,50 @@ enum hardware_error {
HARDWARE_ERROR_MAX,
};
+#define DEV_PCIEERR_STATUS _MMIO(0x100180)
+#define DEV_PCIEERR_IS_FATAL(x) (REG_BIT(2) << (x * 4))
#define _DEV_ERR_STAT_FATAL 0x100174
#define _DEV_ERR_STAT_NONFATAL 0x100178
#define _DEV_ERR_STAT_CORRECTABLE 0x10017c
#define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
_DEV_ERR_STAT_CORRECTABLE, \
_DEV_ERR_STAT_NONFATAL))
+
#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
+enum gt_vctr_registers {
+ ERR_STAT_GT_VCTR0 = 0,
+ ERR_STAT_GT_VCTR1,
+ ERR_STAT_GT_VCTR2,
+ ERR_STAT_GT_VCTR3,
+ ERR_STAT_GT_VCTR4,
+ ERR_STAT_GT_VCTR5,
+ ERR_STAT_GT_VCTR6,
+ ERR_STAT_GT_VCTR7,
+};
+
+#define ERR_STAT_GT_COR_VCTR_LEN (4)
+#define _ERR_STAT_GT_COR_VCTR_0 0x1002a0
+#define _ERR_STAT_GT_COR_VCTR_1 0x1002a4
+#define _ERR_STAT_GT_COR_VCTR_2 0x1002a8
+#define _ERR_STAT_GT_COR_VCTR_3 0x1002ac
+#define ERR_STAT_GT_COR_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
+ _ERR_STAT_GT_COR_VCTR_0, \
+ _ERR_STAT_GT_COR_VCTR_1))
+
+#define ERR_STAT_GT_FATAL_VCTR_LEN (8)
+#define _ERR_STAT_GT_FATAL_VCTR_0 0x100260
+#define _ERR_STAT_GT_FATAL_VCTR_1 0x100264
+#define _ERR_STAT_GT_FATAL_VCTR_2 0x100268
+#define _ERR_STAT_GT_FATAL_VCTR_3 0x10026c
+#define _ERR_STAT_GT_FATAL_VCTR_4 0x100270
+#define _ERR_STAT_GT_FATAL_VCTR_5 0x100274
+#define _ERR_STAT_GT_FATAL_VCTR_6 0x100278
+#define _ERR_STAT_GT_FATAL_VCTR_7 0x10027c
+#define ERR_STAT_GT_FATAL_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
+ _ERR_STAT_GT_FATAL_VCTR_0, \
+ _ERR_STAT_GT_FATAL_VCTR_1))
+
#define _ERR_STAT_GT_COR 0x100160
#define _ERR_STAT_GT_NONFATAL 0x100164
#define _ERR_STAT_GT_FATAL 0x100168
@@ -137,7 +173,42 @@ enum hardware_error {
_ERR_STAT_GT_COR, \
_ERR_STAT_GT_NONFATAL))
-#define EU_GRF_ERROR REG_BIT(15)
-#define EU_IC_ERROR REG_BIT(14)
-
+#define EU_GRF_COR_ERR (15)
+#define EU_IC_COR_ERR (14)
+#define SLM_COR_ERR (13)
+#define SAMPLER_COR_ERR (12)
+#define GUC_COR_ERR (1)
+#define L3_SNG_COR_ERR (0)
+
+#define PVC_COR_ERR_MASK \
+ (REG_BIT(GUC_COR_ERR) | \
+ REG_BIT(SLM_COR_ERR) | \
+ REG_BIT(EU_IC_COR_ERR) | \
+ REG_BIT(EU_GRF_COR_ERR))
+
+#define EU_GRF_FAT_ERR (15)
+#define EU_IC_FAT_ERR (14)
+#define SLM_FAT_ERR (13)
+#define SAMPLER_FAT_ERR (12)
+#define SQIDI_FAT_ERR (9)
+#define IDI_PAR_FAT_ERR (8)
+#define GUC_FAT_ERR (6)
+#define L3_ECC_CHK_FAT_ERR (5)
+#define L3_DOUBLE_FAT_ERR (4)
+#define FPU_UNCORR_FAT_ERR (3)
+#define ARRAY_BIST_FAT_ERR (1)
+
+#define PVC_FAT_ERR_MASK \
+ (REG_BIT(FPU_UNCORR_FAT_ERR) | \
+ REG_BIT(GUC_FAT_ERR) | \
+ REG_BIT(SLM_FAT_ERR) | \
+ REG_BIT(EU_GRF_FAT_ERR))
+
+#define GT_HW_ERROR_MAX_ERR_BITS 16
+
+#define _SLM_ECC_ERROR_CNT 0xe7f4
+#define _SLM_UNCORR_ECC_ERROR_CNT 0xe7c0
+#define SLM_ECC_ERROR_CNTR(x) _MMIO((x) == HARDWARE_ERROR_CORRECTABLE ? \
+ _SLM_ECC_ERROR_CNT : \
+ _SLM_UNCORR_ECC_ERROR_CNT)
#endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 88f863edc41c..ecabf4d6690d 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -99,6 +99,8 @@ struct xe_device {
bool has_link_copy_engine;
/** @enable_display: display enabled */
bool enable_display;
+ /** @has_gt_error_vectors: whether platform supports ERROR VECTORS */
+ bool has_gt_error_vectors;
#if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
struct xe_device_display_info {
diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index bc821f431c45..ce9ce2748394 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -44,6 +44,35 @@
#include "xe_wa.h"
#include "xe_wopcm.h"
+static const char * const xe_gt_driver_errors_to_str[] = {
+ [INTEL_GT_DRIVER_ERROR_INTERRUPT] = "INTERRUPT",
+};
+
+void xe_gt_log_driver_error(struct xe_gt *gt,
+ const enum xe_gt_driver_errors error,
+ const char *fmt, ...)
+{
+ struct va_format vaf;
+ va_list args;
+
+ va_start(args, fmt);
+ vaf.fmt = fmt;
+ vaf.va = &args;
+
+ BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
+ INTEL_GT_DRIVER_ERROR_COUNT);
+
+ WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
+
+ gt->errors.driver[error]++;
+
+ drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
+ gt->info.id,
+ xe_gt_driver_errors_to_str[error],
+ &vaf);
+ va_end(args);
+}
+
struct xe_gt *xe_find_full_gt(struct xe_gt *gt)
{
struct xe_gt *search;
diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
index 8f29aba455e0..9580a40c0142 100644
--- a/drivers/gpu/drm/xe/xe_gt_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_types.h
@@ -33,6 +33,43 @@ enum xe_gt_type {
typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 * XE_MAX_DSS_FUSE_REGS)];
typedef unsigned long xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
+/* Count of GT Correctable and FATAL HW ERRORS */
+enum intel_gt_hw_errors {
+ INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
+ INTEL_GT_HW_ERROR_COR_L3BANK,
+ INTEL_GT_HW_ERROR_COR_L3_SNG,
+ INTEL_GT_HW_ERROR_COR_GUC,
+ INTEL_GT_HW_ERROR_COR_SAMPLER,
+ INTEL_GT_HW_ERROR_COR_SLM,
+ INTEL_GT_HW_ERROR_COR_EU_IC,
+ INTEL_GT_HW_ERROR_COR_EU_GRF,
+ INTEL_GT_HW_ERROR_FAT_SUBSLICE,
+ INTEL_GT_HW_ERROR_FAT_L3BANK,
+ INTEL_GT_HW_ERROR_FAT_ARR_BIST,
+ INTEL_GT_HW_ERROR_FAT_FPU,
+ INTEL_GT_HW_ERROR_FAT_L3_DOUB,
+ INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
+ INTEL_GT_HW_ERROR_FAT_GUC,
+ INTEL_GT_HW_ERROR_FAT_IDI_PAR,
+ INTEL_GT_HW_ERROR_FAT_SQIDI,
+ INTEL_GT_HW_ERROR_FAT_SAMPLER,
+ INTEL_GT_HW_ERROR_FAT_SLM,
+ INTEL_GT_HW_ERROR_FAT_EU_IC,
+ INTEL_GT_HW_ERROR_FAT_EU_GRF,
+ INTEL_GT_HW_ERROR_FAT_TLB,
+ INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
+ INTEL_GT_HW_ERROR_COUNT
+};
+
+enum xe_gt_driver_errors {
+ INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
+ INTEL_GT_DRIVER_ERROR_COUNT
+};
+
+void xe_gt_log_driver_error(struct xe_gt *gt,
+ const enum xe_gt_driver_errors error,
+ const char *fmt, ...);
+
struct xe_mmio_range {
u32 start;
u32 end;
@@ -357,6 +394,12 @@ struct xe_gt {
* of a steered operation
*/
spinlock_t mcr_lock;
+
+ struct intel_hw_errors {
+ unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
+ unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
+ } errors;
+
};
#endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 6b922332bff1..4626f7280aaf 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -19,6 +19,7 @@
#include "xe_hw_engine.h"
#include "xe_mmio.h"
+#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
{
u32 val = xe_mmio_read32(gt, reg.reg);
@@ -359,44 +360,281 @@ hardware_error_type_to_str(const enum hardware_error hw_err)
}
}
+#define xe_gt_hw_err(gt, fmt, ...) \
+ drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "GT%d detected " fmt, \
+ (gt)->info.id, ##__VA_ARGS__)
+
static void
-xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+xe_gt_correctable_hw_error_stats_update(struct xe_gt *gt, unsigned long errstat)
{
- const char *hw_err_str = hardware_error_type_to_str(hw_err);
- u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
- u32 errstat;
+ u32 errbit, cnt;
- lockdep_assert_held(>_to_xe(gt)->irq.lock);
+ if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
+ return;
- errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
+ for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
+ if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) & PVC_COR_ERR_MASK)) {
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
+ "UNKNOWN CORRECTABLE error\n");
+ continue;
+ }
- if (unlikely(!errstat)) {
- DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
- return;
+ switch (errbit) {
+ case L3_SNG_COR_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_L3_SNG]++;
+ xe_gt_hw_err(gt, "L3 SINGLE CORRECTABLE error\n");
+ break;
+ case GUC_COR_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_GUC]++;
+ xe_gt_hw_err(gt, "SINGLE BIT GUC SRAM CORRECTABLE error\n");
+ break;
+ case SAMPLER_COR_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_SAMPLER]++;
+ xe_gt_hw_err(gt, "SINGLE BIT SAMPLER CORRECTABLE error\n");
+ break;
+ case SLM_COR_ERR:
+ cnt = xe_mmio_read32(gt, SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_CORRECTABLE).reg);
+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_SLM] = cnt;
+ xe_gt_hw_err(gt, "%u SINGLE BIT SLM CORRECTABLE error\n", cnt);
+ break;
+ case EU_IC_COR_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_IC]++;
+ xe_gt_hw_err(gt, "SINGLE BIT EU IC CORRECTABLE error\n");
+ break;
+ case EU_GRF_COR_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_GRF]++;
+ xe_gt_hw_err(gt, "SINGLE BIT EU GRF CORRECTABLE error\n");
+ break;
+ default:
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN CORRECTABLE error\n");
+ break;
+ }
}
+}
- /*
- * TODO: The GT Non Fatal Error Status Register
- * only has reserved bitfields defined.
- * Remove once there is something to service.
- */
- if (hw_err == HARDWARE_ERROR_NONFATAL) {
- DRM_ERROR("detected Non-Fatal error\n");
- xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
+static void xe_gt_fatal_hw_error_stats_update(struct xe_gt *gt, unsigned long errstat)
+{
+ u32 errbit, cnt;
+
+ if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
return;
+
+ for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
+ if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) & PVC_FAT_ERR_MASK)) {
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
+ "UNKNOWN FATAL error\n");
+ continue;
+ }
+
+ switch (errbit) {
+ case ARRAY_BIST_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_ARR_BIST]++;
+ xe_gt_hw_err(gt, "Array BIST FATAL error\n");
+ break;
+ case FPU_UNCORR_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_FPU]++;
+ xe_gt_hw_err(gt, "FPU FATAL error\n");
+ break;
+ case L3_DOUBLE_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_DOUB]++;
+ xe_gt_hw_err(gt, "L3 Double FATAL error\n");
+ break;
+ case L3_ECC_CHK_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK]++;
+ xe_gt_hw_err(gt, "L3 ECC Checker FATAL error\n");
+ break;
+ case GUC_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_GUC]++;
+ xe_gt_hw_err(gt, "GUC SRAM FATAL error\n");
+ break;
+ case IDI_PAR_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_IDI_PAR]++;
+ xe_gt_hw_err(gt, "IDI PARITY FATAL error\n");
+ break;
+ case SQIDI_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SQIDI]++;
+ xe_gt_hw_err(gt, "SQIDI FATAL error\n");
+ break;
+ case SAMPLER_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SAMPLER]++;
+ xe_gt_hw_err(gt, "SAMPLER FATAL error\n");
+ break;
+ case SLM_FAT_ERR:
+ cnt = xe_mmio_read32(gt, SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_FATAL).reg);
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SLM] = cnt;
+ xe_gt_hw_err(gt, "%u SLM FATAL error\n", cnt);
+ break;
+ case EU_IC_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_IC]++;
+ xe_gt_hw_err(gt, "EU IC FATAL error\n");
+ break;
+ case EU_GRF_FAT_ERR:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_GRF]++;
+ xe_gt_hw_err(gt, "EU GRF FATAL error\n");
+ break;
+ default:
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
+ "UNKNOWN FATAL error\n");
+ break;
+ }
}
+}
- /*
- * TODO: The remaining GT errors don't have a
- * need for targeted logging at the moment. We
- * still want to log detection of these errors, but
- * let's aggregate them until someone has a need for them.
- */
- if (errstat & other_errors)
- DRM_ERROR("detected hardware error(s) in ERR_STAT_GT_REG_%s: 0x%08x\n",
- hw_err_str, errstat & other_errors);
+static void
+xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+ const char *hw_err_str = hardware_error_type_to_str(hw_err);
+ unsigned long errstat;
+
+ lockdep_assert_held(>_to_xe(gt)->irq.lock);
- xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
+ if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
+ errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
+ if (unlikely(!errstat)) {
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
+ "ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
+ return;
+ }
+ }
+
+ switch (hw_err) {
+ case HARDWARE_ERROR_CORRECTABLE:
+ if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
+ bool error = false;
+ int i;
+
+ errstat = 0;
+ for (i = 0; i < ERR_STAT_GT_COR_VCTR_LEN; i++) {
+ u32 err_type = ERR_STAT_GT_COR_VCTR_LEN;
+ unsigned long vctr;
+ const char *name;
+
+ vctr = xe_mmio_read32(gt, ERR_STAT_GT_COR_VCTR_REG(i).reg);
+ if (!vctr)
+ continue;
+
+ switch (i) {
+ case ERR_STAT_GT_VCTR0:
+ case ERR_STAT_GT_VCTR1:
+ err_type = INTEL_GT_HW_ERROR_COR_SUBSLICE;
+ gt->errors.hw[err_type] += hweight32(vctr);
+ name = "SUBSLICE";
+
+ /* Avoid second read/write to error status register*/
+ if (errstat)
+ break;
+
+ errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
+ xe_gt_hw_err(gt, "ERR_STAT_GT_CORRECTABLE:0x%08lx\n",
+ errstat);
+ xe_gt_correctable_hw_error_stats_update(gt, errstat);
+ if (errstat)
+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
+ errstat);
+ break;
+
+ case ERR_STAT_GT_VCTR2:
+ case ERR_STAT_GT_VCTR3:
+ err_type = INTEL_GT_HW_ERROR_COR_L3BANK;
+ gt->errors.hw[err_type] += hweight32(vctr);
+ name = "L3 BANK";
+ break;
+ default:
+ name = "UNKNOWN";
+ break;
+ }
+ xe_mmio_write32(gt, ERR_STAT_GT_COR_VCTR_REG(i).reg, vctr);
+ xe_gt_hw_err(gt, "%s CORRECTABLE error, ERR_VECT_GT_CORRECTABLE_%d:0x%08lx\n",
+ name, i, vctr);
+ error = true;
+ }
+
+ if (!error)
+ xe_gt_hw_err(gt, "UNKNOWN CORRECTABLE error\n");
+ } else {
+ xe_gt_correctable_hw_error_stats_update(gt, errstat);
+ xe_gt_hw_err(gt, "ERR_STAT_GT_CORRECTABLE:0x%08lx\n", errstat);
+ }
+ break;
+ case HARDWARE_ERROR_NONFATAL:
+ /*
+ * TODO: The GT Non Fatal Error Status Register
+ * only has reserved bitfields defined.
+ * Remove once there is something to service.
+ */
+ drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "detected Non-Fatal error\n");
+ break;
+ case HARDWARE_ERROR_FATAL:
+ if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
+ bool error = false;
+ int i;
+
+ errstat = 0;
+ for (i = 0; i < ERR_STAT_GT_FATAL_VCTR_LEN; i++) {
+ u32 err_type = ERR_STAT_GT_FATAL_VCTR_LEN;
+ unsigned long vctr;
+ const char *name;
+
+ vctr = xe_mmio_read32(gt, ERR_STAT_GT_FATAL_VCTR_REG(i).reg);
+ if (!vctr)
+ continue;
+
+ /* i represents the vector register index */
+ switch (i) {
+ case ERR_STAT_GT_VCTR0:
+ case ERR_STAT_GT_VCTR1:
+ err_type = INTEL_GT_HW_ERROR_FAT_SUBSLICE;
+ gt->errors.hw[err_type] += hweight32(vctr);
+ name = "SUBSLICE";
+
+ /*Avoid second read/write to error status register.*/
+ if (errstat)
+ break;
+
+ errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
+ xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
+ xe_gt_fatal_hw_error_stats_update(gt, errstat);
+ if (errstat)
+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
+ errstat);
+ break;
+
+ case ERR_STAT_GT_VCTR2:
+ case ERR_STAT_GT_VCTR3:
+ err_type = INTEL_GT_HW_ERROR_FAT_L3BANK;
+ gt->errors.hw[err_type] += hweight32(vctr);
+ name = "L3 BANK";
+ break;
+ case ERR_STAT_GT_VCTR6:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_TLB] += hweight16(vctr);
+ name = "TLB";
+ break;
+ case ERR_STAT_GT_VCTR7:
+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_FABRIC] += hweight8(vctr);
+ name = "L3 FABRIC";
+ break;
+ default:
+ name = "UNKNOWN";
+ break;
+ }
+ xe_mmio_write32(gt, ERR_STAT_GT_FATAL_VCTR_REG(i).reg, vctr);
+ xe_gt_hw_err(gt, "%s FATAL error, ERR_VECT_GT_FATAL_%d:0x%08lx\n",
+ name, i, vctr);
+ error = true;
+ }
+ if (!error)
+ xe_gt_hw_err(gt, "UNKNOWN FATAL error\n");
+ } else {
+ xe_gt_fatal_hw_error_stats_update(gt, errstat);
+ xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
+ }
+ break;
+ default:
+ break;
+ }
+
+ if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
}
static void
@@ -409,7 +647,8 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
if (unlikely(!errsrc)) {
- DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
+ "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
goto out_unlock;
}
@@ -417,8 +656,9 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
xe_gt_hw_error_handler(gt, hw_err);
if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
- DRM_ERROR("non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
- hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
+ "non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
+ hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
@@ -634,12 +874,44 @@ static void irq_uninstall(struct drm_device *drm, void *arg)
pci_disable_msi(pdev);
}
+/**
+ * process_hw_errors - checks for the occurrence of HW errors
+ *
+ * This checks for the HW Errors including FATAL error that might
+ * have occurred in the previous boot of the driver which will
+ * initiate PCIe FLR reset of the device and cause the
+ * driver to reload.
+ */
+static void process_hw_errors(struct xe_device *xe)
+{
+ struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
+ u32 dev_pcieerr_status, master_ctl;
+ struct xe_gt *gt;
+ int i;
+
+ dev_pcieerr_status = xe_mmio_read32(gt0, DEV_PCIEERR_STATUS.reg);
+
+ for_each_gt(gt, xe, i) {
+ if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
+ xe_hw_error_source_handler(gt, HARDWARE_ERROR_FATAL);
+
+ master_ctl = xe_mmio_read32(gt, GEN11_GFX_MSTR_IRQ.reg);
+ xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg, master_ctl);
+ xe_hw_error_irq_handler(gt, master_ctl);
+ }
+ if (dev_pcieerr_status)
+ xe_mmio_write32(gt, DEV_PCIEERR_STATUS.reg, dev_pcieerr_status);
+}
+
int xe_irq_install(struct xe_device *xe)
{
int irq = to_pci_dev(xe->drm.dev)->irq;
irq_handler_t irq_handler;
int err;
+ if (IS_DGFX(xe))
+ process_hw_errors(xe);
+
irq_handler = xe_irq_handler(xe);
if (!irq_handler) {
drm_err(&xe->drm, "No supported interrupt handler");
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 1844cff8fba8..69098194cef8 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -73,6 +73,7 @@ struct xe_device_desc {
bool has_range_tlb_invalidation;
bool has_asid;
bool has_link_copy_engine;
+ bool has_gt_error_vectors;
};
__diag_push();
@@ -232,6 +233,7 @@ static const struct xe_device_desc pvc_desc = {
.supports_usm = true,
.has_asid = true,
.has_link_copy_engine = true,
+ .has_gt_error_vectors = true,
};
#define MTL_MEDIA_ENGINES \
@@ -418,6 +420,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
xe->info.vm_max_level = desc->vm_max_level;
xe->info.supports_usm = desc->supports_usm;
xe->info.has_asid = desc->has_asid;
+ xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
xe->info.has_flat_ccs = desc->has_flat_ccs;
xe->info.has_4tile = desc->has_4tile;
xe->info.has_range_tlb_invalidation = desc->has_range_tlb_invalidation;
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-04-06 9:26 ` [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors Himal Prasad Ghimiray
@ 2023-04-24 13:37 ` Dafna Hirschfeld
2023-04-25 0:22 ` Matt Roper
2023-04-25 4:26 ` Iddamsetty, Aravind
2 siblings, 0 replies; 22+ messages in thread
From: Dafna Hirschfeld @ 2023-04-24 13:37 UTC (permalink / raw)
To: Himal Prasad Ghimiray; +Cc: intel-xe, tomer.tayar
On 06.04.2023 14:56, Himal Prasad Ghimiray wrote:
>From: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>
>Count the CORRECTABLE and FATAL GT hardware errors as
>signaled by relevant interrupt and respective registers.
>
>For non relevant interrupts count them as driver interrupt error.
>
>For platform supporting error vector registers count and report
>the respective vector errors.
>
>Co-authored-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>---
> drivers/gpu/drm/xe/regs/xe_regs.h | 77 ++++++-
> drivers/gpu/drm/xe/xe_device_types.h | 2 +
> drivers/gpu/drm/xe/xe_gt.c | 29 +++
> drivers/gpu/drm/xe/xe_gt_types.h | 43 ++++
> drivers/gpu/drm/xe/xe_irq.c | 332 ++++++++++++++++++++++++---
> drivers/gpu/drm/xe/xe_pci.c | 3 +
> 6 files changed, 453 insertions(+), 33 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
>index dff74b093d4e..b3d35d0c5a77 100644
>--- a/drivers/gpu/drm/xe/regs/xe_regs.h
>+++ b/drivers/gpu/drm/xe/regs/xe_regs.h
>@@ -122,14 +122,50 @@ enum hardware_error {
> HARDWARE_ERROR_MAX,
> };
>
>+#define DEV_PCIEERR_STATUS _MMIO(0x100180)
>+#define DEV_PCIEERR_IS_FATAL(x) (REG_BIT(2) << (x * 4))
> #define _DEV_ERR_STAT_FATAL 0x100174
> #define _DEV_ERR_STAT_NONFATAL 0x100178
> #define _DEV_ERR_STAT_CORRECTABLE 0x10017c
> #define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
> _DEV_ERR_STAT_CORRECTABLE, \
> _DEV_ERR_STAT_NONFATAL))
>+
> #define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
>
>+enum gt_vctr_registers {
>+ ERR_STAT_GT_VCTR0 = 0,
>+ ERR_STAT_GT_VCTR1,
>+ ERR_STAT_GT_VCTR2,
>+ ERR_STAT_GT_VCTR3,
>+ ERR_STAT_GT_VCTR4,
>+ ERR_STAT_GT_VCTR5,
>+ ERR_STAT_GT_VCTR6,
>+ ERR_STAT_GT_VCTR7,
>+};
>+
>+#define ERR_STAT_GT_COR_VCTR_LEN (4)
>+#define _ERR_STAT_GT_COR_VCTR_0 0x1002a0
>+#define _ERR_STAT_GT_COR_VCTR_1 0x1002a4
>+#define _ERR_STAT_GT_COR_VCTR_2 0x1002a8
>+#define _ERR_STAT_GT_COR_VCTR_3 0x1002ac
>+#define ERR_STAT_GT_COR_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
>+ _ERR_STAT_GT_COR_VCTR_0, \
>+ _ERR_STAT_GT_COR_VCTR_1))
>+
>+#define ERR_STAT_GT_FATAL_VCTR_LEN (8)
>+#define _ERR_STAT_GT_FATAL_VCTR_0 0x100260
>+#define _ERR_STAT_GT_FATAL_VCTR_1 0x100264
>+#define _ERR_STAT_GT_FATAL_VCTR_2 0x100268
>+#define _ERR_STAT_GT_FATAL_VCTR_3 0x10026c
>+#define _ERR_STAT_GT_FATAL_VCTR_4 0x100270
>+#define _ERR_STAT_GT_FATAL_VCTR_5 0x100274
>+#define _ERR_STAT_GT_FATAL_VCTR_6 0x100278
>+#define _ERR_STAT_GT_FATAL_VCTR_7 0x10027c
>+#define ERR_STAT_GT_FATAL_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
>+ _ERR_STAT_GT_FATAL_VCTR_0, \
>+ _ERR_STAT_GT_FATAL_VCTR_1))
>+
> #define _ERR_STAT_GT_COR 0x100160
> #define _ERR_STAT_GT_NONFATAL 0x100164
> #define _ERR_STAT_GT_FATAL 0x100168
>@@ -137,7 +173,42 @@ enum hardware_error {
> _ERR_STAT_GT_COR, \
> _ERR_STAT_GT_NONFATAL))
>
>-#define EU_GRF_ERROR REG_BIT(15)
>-#define EU_IC_ERROR REG_BIT(14)
>-
>+#define EU_GRF_COR_ERR (15)
>+#define EU_IC_COR_ERR (14)
>+#define SLM_COR_ERR (13)
>+#define SAMPLER_COR_ERR (12)
>+#define GUC_COR_ERR (1)
>+#define L3_SNG_COR_ERR (0)
>+
>+#define PVC_COR_ERR_MASK \
>+ (REG_BIT(GUC_COR_ERR) | \
>+ REG_BIT(SLM_COR_ERR) | \
>+ REG_BIT(EU_IC_COR_ERR) | \
>+ REG_BIT(EU_GRF_COR_ERR))
>+
>+#define EU_GRF_FAT_ERR (15)
>+#define EU_IC_FAT_ERR (14)
>+#define SLM_FAT_ERR (13)
>+#define SAMPLER_FAT_ERR (12)
>+#define SQIDI_FAT_ERR (9)
>+#define IDI_PAR_FAT_ERR (8)
>+#define GUC_FAT_ERR (6)
>+#define L3_ECC_CHK_FAT_ERR (5)
>+#define L3_DOUBLE_FAT_ERR (4)
>+#define FPU_UNCORR_FAT_ERR (3)
>+#define ARRAY_BIST_FAT_ERR (1)
>+
>+#define PVC_FAT_ERR_MASK \
>+ (REG_BIT(FPU_UNCORR_FAT_ERR) | \
>+ REG_BIT(GUC_FAT_ERR) | \
>+ REG_BIT(SLM_FAT_ERR) | \
>+ REG_BIT(EU_GRF_FAT_ERR))
>+
>+#define GT_HW_ERROR_MAX_ERR_BITS 16
>+
>+#define _SLM_ECC_ERROR_CNT 0xe7f4
>+#define _SLM_UNCORR_ECC_ERROR_CNT 0xe7c0
>+#define SLM_ECC_ERROR_CNTR(x) _MMIO((x) == HARDWARE_ERROR_CORRECTABLE ? \
>+ _SLM_ECC_ERROR_CNT : \
>+ _SLM_UNCORR_ECC_ERROR_CNT)
> #endif
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index 88f863edc41c..ecabf4d6690d 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -99,6 +99,8 @@ struct xe_device {
> bool has_link_copy_engine;
> /** @enable_display: display enabled */
> bool enable_display;
>+ /** @has_gt_error_vectors: whether platform supports ERROR VECTORS */
>+ bool has_gt_error_vectors;
>
> #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> struct xe_device_display_info {
>diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>index bc821f431c45..ce9ce2748394 100644
>--- a/drivers/gpu/drm/xe/xe_gt.c
>+++ b/drivers/gpu/drm/xe/xe_gt.c
>@@ -44,6 +44,35 @@
> #include "xe_wa.h"
> #include "xe_wopcm.h"
>
>+static const char * const xe_gt_driver_errors_to_str[] = {
>+ [INTEL_GT_DRIVER_ERROR_INTERRUPT] = "INTERRUPT",
>+};
>+
>+void xe_gt_log_driver_error(struct xe_gt *gt,
>+ const enum xe_gt_driver_errors error,
>+ const char *fmt, ...)
>+{
>+ struct va_format vaf;
>+ va_list args;
>+
>+ va_start(args, fmt);
>+ vaf.fmt = fmt;
>+ vaf.va = &args;
>+
>+ BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
>+ INTEL_GT_DRIVER_ERROR_COUNT);
>+
>+ WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
>+
>+ gt->errors.driver[error]++;
>+
>+ drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
>+ gt->info.id,
>+ xe_gt_driver_errors_to_str[error],
>+ &vaf);
>+ va_end(args);
>+}
>+
> struct xe_gt *xe_find_full_gt(struct xe_gt *gt)
> {
> struct xe_gt *search;
>diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
>index 8f29aba455e0..9580a40c0142 100644
>--- a/drivers/gpu/drm/xe/xe_gt_types.h
>+++ b/drivers/gpu/drm/xe/xe_gt_types.h
>@@ -33,6 +33,43 @@ enum xe_gt_type {
> typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 * XE_MAX_DSS_FUSE_REGS)];
> typedef unsigned long xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
>
>+/* Count of GT Correctable and FATAL HW ERRORS */
>+enum intel_gt_hw_errors {
>+ INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
>+ INTEL_GT_HW_ERROR_COR_L3BANK,
>+ INTEL_GT_HW_ERROR_COR_L3_SNG,
>+ INTEL_GT_HW_ERROR_COR_GUC,
>+ INTEL_GT_HW_ERROR_COR_SAMPLER,
>+ INTEL_GT_HW_ERROR_COR_SLM,
>+ INTEL_GT_HW_ERROR_COR_EU_IC,
>+ INTEL_GT_HW_ERROR_COR_EU_GRF,
>+ INTEL_GT_HW_ERROR_FAT_SUBSLICE,
>+ INTEL_GT_HW_ERROR_FAT_L3BANK,
>+ INTEL_GT_HW_ERROR_FAT_ARR_BIST,
>+ INTEL_GT_HW_ERROR_FAT_FPU,
>+ INTEL_GT_HW_ERROR_FAT_L3_DOUB,
>+ INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
>+ INTEL_GT_HW_ERROR_FAT_GUC,
>+ INTEL_GT_HW_ERROR_FAT_IDI_PAR,
>+ INTEL_GT_HW_ERROR_FAT_SQIDI,
>+ INTEL_GT_HW_ERROR_FAT_SAMPLER,
>+ INTEL_GT_HW_ERROR_FAT_SLM,
>+ INTEL_GT_HW_ERROR_FAT_EU_IC,
>+ INTEL_GT_HW_ERROR_FAT_EU_GRF,
>+ INTEL_GT_HW_ERROR_FAT_TLB,
>+ INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
>+ INTEL_GT_HW_ERROR_COUNT
>+};
>+
>+enum xe_gt_driver_errors {
>+ INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
>+ INTEL_GT_DRIVER_ERROR_COUNT
>+};
>+
>+void xe_gt_log_driver_error(struct xe_gt *gt,
>+ const enum xe_gt_driver_errors error,
>+ const char *fmt, ...);
>+
> struct xe_mmio_range {
> u32 start;
> u32 end;
>@@ -357,6 +394,12 @@ struct xe_gt {
> * of a steered operation
> */
> spinlock_t mcr_lock;
>+
>+ struct intel_hw_errors {
>+ unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
>+ unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
>+ } errors;
>+
> };
>
> #endif
>diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>index 6b922332bff1..4626f7280aaf 100644
>--- a/drivers/gpu/drm/xe/xe_irq.c
>+++ b/drivers/gpu/drm/xe/xe_irq.c
>@@ -19,6 +19,7 @@
> #include "xe_hw_engine.h"
> #include "xe_mmio.h"
>
>+#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
> static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
> {
> u32 val = xe_mmio_read32(gt, reg.reg);
>@@ -359,44 +360,281 @@ hardware_error_type_to_str(const enum hardware_error hw_err)
> }
> }
>
>+#define xe_gt_hw_err(gt, fmt, ...) \
>+ drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "GT%d detected " fmt, \
>+ (gt)->info.id, ##__VA_ARGS__)
>+
> static void
>-xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
>+xe_gt_correctable_hw_error_stats_update(struct xe_gt *gt, unsigned long errstat)
> {
>- const char *hw_err_str = hardware_error_type_to_str(hw_err);
>- u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
>- u32 errstat;
>+ u32 errbit, cnt;
>
>- lockdep_assert_held(>_to_xe(gt)->irq.lock);
>+ if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
>+ return;
>
>- errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
>+ for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
>+ if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) & PVC_COR_ERR_MASK)) {
>+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
>+ "UNKNOWN CORRECTABLE error\n");
>+ continue;
>+ }
>
>- if (unlikely(!errstat)) {
>- DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
>- return;
>+ switch (errbit) {
need a tab here,
>+ case L3_SNG_COR_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_L3_SNG]++;
>+ xe_gt_hw_err(gt, "L3 SINGLE CORRECTABLE error\n");
>+ break;
>+ case GUC_COR_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_GUC]++;
>+ xe_gt_hw_err(gt, "SINGLE BIT GUC SRAM CORRECTABLE error\n");
>+ break;
>+ case SAMPLER_COR_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_SAMPLER]++;
>+ xe_gt_hw_err(gt, "SINGLE BIT SAMPLER CORRECTABLE error\n");
>+ break;
>+ case SLM_COR_ERR:
>+ cnt = xe_mmio_read32(gt, SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_CORRECTABLE).reg);
>+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_SLM] = cnt;
>+ xe_gt_hw_err(gt, "%u SINGLE BIT SLM CORRECTABLE error\n", cnt);
>+ break;
>+ case EU_IC_COR_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_IC]++;
>+ xe_gt_hw_err(gt, "SINGLE BIT EU IC CORRECTABLE error\n");
>+ break;
>+ case EU_GRF_COR_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_GRF]++;
>+ xe_gt_hw_err(gt, "SINGLE BIT EU GRF CORRECTABLE error\n");
>+ break;
>+ default:
>+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN CORRECTABLE error\n");
>+ break;
>+ }
> }
>+}
>
>- /*
>- * TODO: The GT Non Fatal Error Status Register
>- * only has reserved bitfields defined.
>- * Remove once there is something to service.
>- */
>- if (hw_err == HARDWARE_ERROR_NONFATAL) {
>- DRM_ERROR("detected Non-Fatal error\n");
>- xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
>+static void xe_gt_fatal_hw_error_stats_update(struct xe_gt *gt, unsigned long errstat)
>+{
>+ u32 errbit, cnt;
>+
>+ if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> return;
>+
>+ for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
>+ if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) & PVC_FAT_ERR_MASK)) {
>+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
>+ "UNKNOWN FATAL error\n");
>+ continue;
>+ }
>+
>+ switch (errbit) {
ditto,
Thanks,
Dafna
>+ case ARRAY_BIST_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_ARR_BIST]++;
>+ xe_gt_hw_err(gt, "Array BIST FATAL error\n");
>+ break;
>+ case FPU_UNCORR_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_FPU]++;
>+ xe_gt_hw_err(gt, "FPU FATAL error\n");
>+ break;
>+ case L3_DOUBLE_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_DOUB]++;
>+ xe_gt_hw_err(gt, "L3 Double FATAL error\n");
>+ break;
>+ case L3_ECC_CHK_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK]++;
>+ xe_gt_hw_err(gt, "L3 ECC Checker FATAL error\n");
>+ break;
>+ case GUC_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_GUC]++;
>+ xe_gt_hw_err(gt, "GUC SRAM FATAL error\n");
>+ break;
>+ case IDI_PAR_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_IDI_PAR]++;
>+ xe_gt_hw_err(gt, "IDI PARITY FATAL error\n");
>+ break;
>+ case SQIDI_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SQIDI]++;
>+ xe_gt_hw_err(gt, "SQIDI FATAL error\n");
>+ break;
>+ case SAMPLER_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SAMPLER]++;
>+ xe_gt_hw_err(gt, "SAMPLER FATAL error\n");
>+ break;
>+ case SLM_FAT_ERR:
>+ cnt = xe_mmio_read32(gt, SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_FATAL).reg);
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SLM] = cnt;
>+ xe_gt_hw_err(gt, "%u SLM FATAL error\n", cnt);
>+ break;
>+ case EU_IC_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_IC]++;
>+ xe_gt_hw_err(gt, "EU IC FATAL error\n");
>+ break;
>+ case EU_GRF_FAT_ERR:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_GRF]++;
>+ xe_gt_hw_err(gt, "EU GRF FATAL error\n");
>+ break;
>+ default:
>+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
>+ "UNKNOWN FATAL error\n");
>+ break;
>+ }
> }
>+}
>
>- /*
>- * TODO: The remaining GT errors don't have a
>- * need for targeted logging at the moment. We
>- * still want to log detection of these errors, but
>- * let's aggregate them until someone has a need for them.
>- */
>- if (errstat & other_errors)
>- DRM_ERROR("detected hardware error(s) in ERR_STAT_GT_REG_%s: 0x%08x\n",
>- hw_err_str, errstat & other_errors);
>+static void
>+xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
>+{
>+ const char *hw_err_str = hardware_error_type_to_str(hw_err);
>+ unsigned long errstat;
>+
>+ lockdep_assert_held(>_to_xe(gt)->irq.lock);
>
>- xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
>+ if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
>+ errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
>+ if (unlikely(!errstat)) {
>+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
>+ "ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
>+ return;
>+ }
>+ }
>+
>+ switch (hw_err) {
>+ case HARDWARE_ERROR_CORRECTABLE:
>+ if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
>+ bool error = false;
>+ int i;
>+
>+ errstat = 0;
>+ for (i = 0; i < ERR_STAT_GT_COR_VCTR_LEN; i++) {
>+ u32 err_type = ERR_STAT_GT_COR_VCTR_LEN;
>+ unsigned long vctr;
>+ const char *name;
>+
>+ vctr = xe_mmio_read32(gt, ERR_STAT_GT_COR_VCTR_REG(i).reg);
>+ if (!vctr)
>+ continue;
>+
>+ switch (i) {
>+ case ERR_STAT_GT_VCTR0:
>+ case ERR_STAT_GT_VCTR1:
>+ err_type = INTEL_GT_HW_ERROR_COR_SUBSLICE;
>+ gt->errors.hw[err_type] += hweight32(vctr);
>+ name = "SUBSLICE";
>+
>+ /* Avoid second read/write to error status register*/
>+ if (errstat)
>+ break;
>+
>+ errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
>+ xe_gt_hw_err(gt, "ERR_STAT_GT_CORRECTABLE:0x%08lx\n",
>+ errstat);
>+ xe_gt_correctable_hw_error_stats_update(gt, errstat);
>+ if (errstat)
>+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
>+ errstat);
>+ break;
>+
>+ case ERR_STAT_GT_VCTR2:
>+ case ERR_STAT_GT_VCTR3:
>+ err_type = INTEL_GT_HW_ERROR_COR_L3BANK;
>+ gt->errors.hw[err_type] += hweight32(vctr);
>+ name = "L3 BANK";
>+ break;
>+ default:
>+ name = "UNKNOWN";
>+ break;
>+ }
>+ xe_mmio_write32(gt, ERR_STAT_GT_COR_VCTR_REG(i).reg, vctr);
>+ xe_gt_hw_err(gt, "%s CORRECTABLE error, ERR_VECT_GT_CORRECTABLE_%d:0x%08lx\n",
>+ name, i, vctr);
>+ error = true;
>+ }
>+
>+ if (!error)
>+ xe_gt_hw_err(gt, "UNKNOWN CORRECTABLE error\n");
>+ } else {
>+ xe_gt_correctable_hw_error_stats_update(gt, errstat);
>+ xe_gt_hw_err(gt, "ERR_STAT_GT_CORRECTABLE:0x%08lx\n", errstat);
>+ }
>+ break;
>+ case HARDWARE_ERROR_NONFATAL:
>+ /*
>+ * TODO: The GT Non Fatal Error Status Register
>+ * only has reserved bitfields defined.
>+ * Remove once there is something to service.
>+ */
>+ drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "detected Non-Fatal error\n");
>+ break;
>+ case HARDWARE_ERROR_FATAL:
>+ if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
>+ bool error = false;
>+ int i;
>+
>+ errstat = 0;
>+ for (i = 0; i < ERR_STAT_GT_FATAL_VCTR_LEN; i++) {
>+ u32 err_type = ERR_STAT_GT_FATAL_VCTR_LEN;
>+ unsigned long vctr;
>+ const char *name;
>+
>+ vctr = xe_mmio_read32(gt, ERR_STAT_GT_FATAL_VCTR_REG(i).reg);
>+ if (!vctr)
>+ continue;
>+
>+ /* i represents the vector register index */
>+ switch (i) {
>+ case ERR_STAT_GT_VCTR0:
>+ case ERR_STAT_GT_VCTR1:
>+ err_type = INTEL_GT_HW_ERROR_FAT_SUBSLICE;
>+ gt->errors.hw[err_type] += hweight32(vctr);
>+ name = "SUBSLICE";
>+
>+ /*Avoid second read/write to error status register.*/
>+ if (errstat)
>+ break;
>+
>+ errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
>+ xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
>+ xe_gt_fatal_hw_error_stats_update(gt, errstat);
>+ if (errstat)
>+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
>+ errstat);
>+ break;
>+
>+ case ERR_STAT_GT_VCTR2:
>+ case ERR_STAT_GT_VCTR3:
>+ err_type = INTEL_GT_HW_ERROR_FAT_L3BANK;
>+ gt->errors.hw[err_type] += hweight32(vctr);
>+ name = "L3 BANK";
>+ break;
>+ case ERR_STAT_GT_VCTR6:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_TLB] += hweight16(vctr);
>+ name = "TLB";
>+ break;
>+ case ERR_STAT_GT_VCTR7:
>+ gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_FABRIC] += hweight8(vctr);
>+ name = "L3 FABRIC";
>+ break;
>+ default:
>+ name = "UNKNOWN";
>+ break;
>+ }
>+ xe_mmio_write32(gt, ERR_STAT_GT_FATAL_VCTR_REG(i).reg, vctr);
>+ xe_gt_hw_err(gt, "%s FATAL error, ERR_VECT_GT_FATAL_%d:0x%08lx\n",
>+ name, i, vctr);
>+ error = true;
>+ }
>+ if (!error)
>+ xe_gt_hw_err(gt, "UNKNOWN FATAL error\n");
>+ } else {
>+ xe_gt_fatal_hw_error_stats_update(gt, errstat);
>+ xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
>+ }
>+ break;
>+ default:
>+ break;
>+ }
>+
>+ if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
>+ xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> }
>
> static void
>@@ -409,7 +647,8 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> if (unlikely(!errsrc)) {
>- DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
>+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
>+ "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
> goto out_unlock;
> }
>
>@@ -417,8 +656,9 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> xe_gt_hw_error_handler(gt, hw_err);
>
> if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
>- DRM_ERROR("non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
>- hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
>+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
>+ "non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
>+ hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
>
> xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
>
>@@ -634,12 +874,44 @@ static void irq_uninstall(struct drm_device *drm, void *arg)
> pci_disable_msi(pdev);
> }
>
>+/**
>+ * process_hw_errors - checks for the occurrence of HW errors
>+ *
>+ * This checks for the HW Errors including FATAL error that might
>+ * have occurred in the previous boot of the driver which will
>+ * initiate PCIe FLR reset of the device and cause the
>+ * driver to reload.
>+ */
>+static void process_hw_errors(struct xe_device *xe)
>+{
>+ struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
>+ u32 dev_pcieerr_status, master_ctl;
>+ struct xe_gt *gt;
>+ int i;
>+
>+ dev_pcieerr_status = xe_mmio_read32(gt0, DEV_PCIEERR_STATUS.reg);
>+
>+ for_each_gt(gt, xe, i) {
>+ if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
>+ xe_hw_error_source_handler(gt, HARDWARE_ERROR_FATAL);
>+
>+ master_ctl = xe_mmio_read32(gt, GEN11_GFX_MSTR_IRQ.reg);
>+ xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg, master_ctl);
>+ xe_hw_error_irq_handler(gt, master_ctl);
>+ }
>+ if (dev_pcieerr_status)
>+ xe_mmio_write32(gt, DEV_PCIEERR_STATUS.reg, dev_pcieerr_status);
>+}
>+
> int xe_irq_install(struct xe_device *xe)
> {
> int irq = to_pci_dev(xe->drm.dev)->irq;
> irq_handler_t irq_handler;
> int err;
>
>+ if (IS_DGFX(xe))
>+ process_hw_errors(xe);
>+
> irq_handler = xe_irq_handler(xe);
> if (!irq_handler) {
> drm_err(&xe->drm, "No supported interrupt handler");
>diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>index 1844cff8fba8..69098194cef8 100644
>--- a/drivers/gpu/drm/xe/xe_pci.c
>+++ b/drivers/gpu/drm/xe/xe_pci.c
>@@ -73,6 +73,7 @@ struct xe_device_desc {
> bool has_range_tlb_invalidation;
> bool has_asid;
> bool has_link_copy_engine;
>+ bool has_gt_error_vectors;
> };
>
> __diag_push();
>@@ -232,6 +233,7 @@ static const struct xe_device_desc pvc_desc = {
> .supports_usm = true,
> .has_asid = true,
> .has_link_copy_engine = true,
>+ .has_gt_error_vectors = true,
> };
>
> #define MTL_MEDIA_ENGINES \
>@@ -418,6 +420,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> xe->info.vm_max_level = desc->vm_max_level;
> xe->info.supports_usm = desc->supports_usm;
> xe->info.has_asid = desc->has_asid;
>+ xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
> xe->info.has_flat_ccs = desc->has_flat_ccs;
> xe->info.has_4tile = desc->has_4tile;
> xe->info.has_range_tlb_invalidation = desc->has_range_tlb_invalidation;
>--
>2.25.1
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-04-06 9:26 ` [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors Himal Prasad Ghimiray
2023-04-24 13:37 ` Dafna Hirschfeld
@ 2023-04-25 0:22 ` Matt Roper
2023-04-28 8:00 ` Ghimiray, Himal Prasad
2023-04-25 4:26 ` Iddamsetty, Aravind
2 siblings, 1 reply; 22+ messages in thread
From: Matt Roper @ 2023-04-25 0:22 UTC (permalink / raw)
To: Himal Prasad Ghimiray; +Cc: intel-xe
On Thu, Apr 06, 2023 at 02:56:29PM +0530, Himal Prasad Ghimiray wrote:
> From: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>
> Count the CORRECTABLE and FATAL GT hardware errors as
> signaled by relevant interrupt and respective registers.
>
> For non relevant interrupts count them as driver interrupt error.
>
> For platform supporting error vector registers count and report
> the respective vector errors.
>
> Co-authored-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> ---
> drivers/gpu/drm/xe/regs/xe_regs.h | 77 ++++++-
> drivers/gpu/drm/xe/xe_device_types.h | 2 +
> drivers/gpu/drm/xe/xe_gt.c | 29 +++
> drivers/gpu/drm/xe/xe_gt_types.h | 43 ++++
> drivers/gpu/drm/xe/xe_irq.c | 332 ++++++++++++++++++++++++---
> drivers/gpu/drm/xe/xe_pci.c | 3 +
> 6 files changed, 453 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
> index dff74b093d4e..b3d35d0c5a77 100644
> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> @@ -122,14 +122,50 @@ enum hardware_error {
> HARDWARE_ERROR_MAX,
> };
>
> +#define DEV_PCIEERR_STATUS _MMIO(0x100180)
> +#define DEV_PCIEERR_IS_FATAL(x) (REG_BIT(2) << (x * 4))
Just move the math inside the REG_BIT parameter. I.e.
#define DEV_PCIEERR_IS_FATAL(x) REG_BIT(x * 4 + 2)
> #define _DEV_ERR_STAT_FATAL 0x100174
> #define _DEV_ERR_STAT_NONFATAL 0x100178
> #define _DEV_ERR_STAT_CORRECTABLE 0x10017c
> #define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
> _DEV_ERR_STAT_CORRECTABLE, \
> _DEV_ERR_STAT_NONFATAL))
> +
> #define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
>
> +enum gt_vctr_registers {
> + ERR_STAT_GT_VCTR0 = 0,
> + ERR_STAT_GT_VCTR1,
> + ERR_STAT_GT_VCTR2,
> + ERR_STAT_GT_VCTR3,
> + ERR_STAT_GT_VCTR4,
> + ERR_STAT_GT_VCTR5,
> + ERR_STAT_GT_VCTR6,
> + ERR_STAT_GT_VCTR7,
> +};
> +
> +#define ERR_STAT_GT_COR_VCTR_LEN (4)
> +#define _ERR_STAT_GT_COR_VCTR_0 0x1002a0
> +#define _ERR_STAT_GT_COR_VCTR_1 0x1002a4
> +#define _ERR_STAT_GT_COR_VCTR_2 0x1002a8
> +#define _ERR_STAT_GT_COR_VCTR_3 0x1002ac
> +#define ERR_STAT_GT_COR_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
> + _ERR_STAT_GT_COR_VCTR_0, \
> + _ERR_STAT_GT_COR_VCTR_1))
> +
> +#define ERR_STAT_GT_FATAL_VCTR_LEN (8)
> +#define _ERR_STAT_GT_FATAL_VCTR_0 0x100260
> +#define _ERR_STAT_GT_FATAL_VCTR_1 0x100264
> +#define _ERR_STAT_GT_FATAL_VCTR_2 0x100268
> +#define _ERR_STAT_GT_FATAL_VCTR_3 0x10026c
> +#define _ERR_STAT_GT_FATAL_VCTR_4 0x100270
> +#define _ERR_STAT_GT_FATAL_VCTR_5 0x100274
> +#define _ERR_STAT_GT_FATAL_VCTR_6 0x100278
> +#define _ERR_STAT_GT_FATAL_VCTR_7 0x10027c
> +#define ERR_STAT_GT_FATAL_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
> + _ERR_STAT_GT_FATAL_VCTR_0, \
> + _ERR_STAT_GT_FATAL_VCTR_1))
> +
> #define _ERR_STAT_GT_COR 0x100160
> #define _ERR_STAT_GT_NONFATAL 0x100164
> #define _ERR_STAT_GT_FATAL 0x100168
> @@ -137,7 +173,42 @@ enum hardware_error {
> _ERR_STAT_GT_COR, \
> _ERR_STAT_GT_NONFATAL))
>
> -#define EU_GRF_ERROR REG_BIT(15)
> -#define EU_IC_ERROR REG_BIT(14)
> -
> +#define EU_GRF_COR_ERR (15)
> +#define EU_IC_COR_ERR (14)
> +#define SLM_COR_ERR (13)
> +#define SAMPLER_COR_ERR (12)
> +#define GUC_COR_ERR (1)
> +#define L3_SNG_COR_ERR (0)
> +
> +#define PVC_COR_ERR_MASK \
> + (REG_BIT(GUC_COR_ERR) | \
> + REG_BIT(SLM_COR_ERR) | \
> + REG_BIT(EU_IC_COR_ERR) | \
> + REG_BIT(EU_GRF_COR_ERR))
> +
> +#define EU_GRF_FAT_ERR (15)
> +#define EU_IC_FAT_ERR (14)
> +#define SLM_FAT_ERR (13)
> +#define SAMPLER_FAT_ERR (12)
> +#define SQIDI_FAT_ERR (9)
> +#define IDI_PAR_FAT_ERR (8)
> +#define GUC_FAT_ERR (6)
> +#define L3_ECC_CHK_FAT_ERR (5)
> +#define L3_DOUBLE_FAT_ERR (4)
> +#define FPU_UNCORR_FAT_ERR (3)
> +#define ARRAY_BIST_FAT_ERR (1)
> +
> +#define PVC_FAT_ERR_MASK \
> + (REG_BIT(FPU_UNCORR_FAT_ERR) | \
> + REG_BIT(GUC_FAT_ERR) | \
> + REG_BIT(SLM_FAT_ERR) | \
> + REG_BIT(EU_GRF_FAT_ERR))
> +
> +#define GT_HW_ERROR_MAX_ERR_BITS 16
> +
> +#define _SLM_ECC_ERROR_CNT 0xe7f4
> +#define _SLM_UNCORR_ECC_ERROR_CNT 0xe7c0
> +#define SLM_ECC_ERROR_CNTR(x) _MMIO((x) == HARDWARE_ERROR_CORRECTABLE ? \
> + _SLM_ECC_ERROR_CNT : \
> + _SLM_UNCORR_ECC_ERROR_CNT)
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 88f863edc41c..ecabf4d6690d 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -99,6 +99,8 @@ struct xe_device {
> bool has_link_copy_engine;
> /** @enable_display: display enabled */
> bool enable_display;
> + /** @has_gt_error_vectors: whether platform supports ERROR VECTORS */
> + bool has_gt_error_vectors;
>
> #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> struct xe_device_display_info {
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index bc821f431c45..ce9ce2748394 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -44,6 +44,35 @@
> #include "xe_wa.h"
> #include "xe_wopcm.h"
>
> +static const char * const xe_gt_driver_errors_to_str[] = {
> + [INTEL_GT_DRIVER_ERROR_INTERRUPT] = "INTERRUPT",
> +};
> +
> +void xe_gt_log_driver_error(struct xe_gt *gt,
It looks like the things this function gets used for fall into one of
two categories:
- Software assertions that we don't expect to happen unless we screwed
up and missed something in the code.
- Hardware is doing something out of spec (e.g., raising interrupt bits
that aren't documented).
For the first one, I don't see a reason to be counting up such errors;
they're just there for early developers and such mistakes should not be
present in the driver anymore by the time it gets into real users'
hands. For the latter, classifying these as "driver errors" is
incorrect since it's the hardware misbehaving, not the driver.
All of this new infrastructure seems pretty questionable at the moment.
We're doing extra work to count up errors, but then never doing anything
with the counts. You mention in the cover letter that these will be
exposed to userspace eventually, but what's the benefit of that? Which
userspace component is going to actually use this information? What do
you expect userspace to do if it finds out there's been a fatal or
correctable error in some low-level hardware unit? Generally userspace
shouldn't even need to care about the really low-level hardware details;
if something has truly gone fatally wrong, it's game over for userspace
and it probably doesn't matter exactly where in the hardware things are
actually busted.
Without some extra justification from the userspace point of view, it
feels like we're just adding a bunch of code that doesn't have a
real-world purpose.
> + const enum xe_gt_driver_errors error,
> + const char *fmt, ...)
> +{
> + struct va_format vaf;
> + va_list args;
> +
> + va_start(args, fmt);
> + vaf.fmt = fmt;
> + vaf.va = &args;
> +
> + BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
> + INTEL_GT_DRIVER_ERROR_COUNT);
> +
> + WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
> +
> + gt->errors.driver[error]++;
> +
> + drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
> + gt->info.id,
> + xe_gt_driver_errors_to_str[error],
> + &vaf);
> + va_end(args);
> +}
> +
> struct xe_gt *xe_find_full_gt(struct xe_gt *gt)
> {
> struct xe_gt *search;
> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
> index 8f29aba455e0..9580a40c0142 100644
> --- a/drivers/gpu/drm/xe/xe_gt_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> @@ -33,6 +33,43 @@ enum xe_gt_type {
> typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 * XE_MAX_DSS_FUSE_REGS)];
> typedef unsigned long xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
>
> +/* Count of GT Correctable and FATAL HW ERRORS */
> +enum intel_gt_hw_errors {
> + INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
> + INTEL_GT_HW_ERROR_COR_L3BANK,
> + INTEL_GT_HW_ERROR_COR_L3_SNG,
> + INTEL_GT_HW_ERROR_COR_GUC,
> + INTEL_GT_HW_ERROR_COR_SAMPLER,
> + INTEL_GT_HW_ERROR_COR_SLM,
> + INTEL_GT_HW_ERROR_COR_EU_IC,
> + INTEL_GT_HW_ERROR_COR_EU_GRF,
> + INTEL_GT_HW_ERROR_FAT_SUBSLICE,
> + INTEL_GT_HW_ERROR_FAT_L3BANK,
> + INTEL_GT_HW_ERROR_FAT_ARR_BIST,
> + INTEL_GT_HW_ERROR_FAT_FPU,
> + INTEL_GT_HW_ERROR_FAT_L3_DOUB,
> + INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
> + INTEL_GT_HW_ERROR_FAT_GUC,
> + INTEL_GT_HW_ERROR_FAT_IDI_PAR,
> + INTEL_GT_HW_ERROR_FAT_SQIDI,
> + INTEL_GT_HW_ERROR_FAT_SAMPLER,
> + INTEL_GT_HW_ERROR_FAT_SLM,
> + INTEL_GT_HW_ERROR_FAT_EU_IC,
> + INTEL_GT_HW_ERROR_FAT_EU_GRF,
> + INTEL_GT_HW_ERROR_FAT_TLB,
> + INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
> + INTEL_GT_HW_ERROR_COUNT
> +};
> +
> +enum xe_gt_driver_errors {
> + INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
> + INTEL_GT_DRIVER_ERROR_COUNT
> +};
> +
> +void xe_gt_log_driver_error(struct xe_gt *gt,
> + const enum xe_gt_driver_errors error,
> + const char *fmt, ...);
> +
> struct xe_mmio_range {
> u32 start;
> u32 end;
> @@ -357,6 +394,12 @@ struct xe_gt {
> * of a steered operation
> */
> spinlock_t mcr_lock;
> +
> + struct intel_hw_errors {
> + unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
> + unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
> + } errors;
> +
> };
>
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> index 6b922332bff1..4626f7280aaf 100644
> --- a/drivers/gpu/drm/xe/xe_irq.c
> +++ b/drivers/gpu/drm/xe/xe_irq.c
> @@ -19,6 +19,7 @@
> #include "xe_hw_engine.h"
> #include "xe_mmio.h"
>
> +#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
> static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
> {
> u32 val = xe_mmio_read32(gt, reg.reg);
> @@ -359,44 +360,281 @@ hardware_error_type_to_str(const enum hardware_error hw_err)
> }
> }
>
> +#define xe_gt_hw_err(gt, fmt, ...) \
> + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "GT%d detected " fmt, \
> + (gt)->info.id, ##__VA_ARGS__)
As on the previous patch, it looks like we're printing error-level
kernel messages for correctable errors (i.e., things the hardware caught
and fixed internally like ECC). Generally those kind of things
shouldn't be putting errors in the kernel log because there's no actual
problem from the end user perspective.
> +
> static void
> -xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> +xe_gt_correctable_hw_error_stats_update(struct xe_gt *gt, unsigned long errstat)
> {
> - const char *hw_err_str = hardware_error_type_to_str(hw_err);
> - u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> - u32 errstat;
> + u32 errbit, cnt;
>
> - lockdep_assert_held(>_to_xe(gt)->irq.lock);
> + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> + return;
>
> - errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) & PVC_COR_ERR_MASK)) {
> + xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
> + "UNKNOWN CORRECTABLE error\n");
> + continue;
> + }
>
> - if (unlikely(!errstat)) {
> - DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
> - return;
> + switch (errbit) {
> + case L3_SNG_COR_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_L3_SNG]++;
> + xe_gt_hw_err(gt, "L3 SINGLE CORRECTABLE error\n");
> + break;
> + case GUC_COR_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_GUC]++;
> + xe_gt_hw_err(gt, "SINGLE BIT GUC SRAM CORRECTABLE error\n");
> + break;
> + case SAMPLER_COR_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SAMPLER]++;
> + xe_gt_hw_err(gt, "SINGLE BIT SAMPLER CORRECTABLE error\n");
> + break;
> + case SLM_COR_ERR:
> + cnt = xe_mmio_read32(gt, SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_CORRECTABLE).reg);
> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SLM] = cnt;
> + xe_gt_hw_err(gt, "%u SINGLE BIT SLM CORRECTABLE error\n", cnt);
> + break;
> + case EU_IC_COR_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_IC]++;
> + xe_gt_hw_err(gt, "SINGLE BIT EU IC CORRECTABLE error\n");
> + break;
> + case EU_GRF_COR_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_GRF]++;
> + xe_gt_hw_err(gt, "SINGLE BIT EU GRF CORRECTABLE error\n");
> + break;
> + default:
> + xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN CORRECTABLE error\n");
> + break;
> + }
> }
> +}
>
> - /*
> - * TODO: The GT Non Fatal Error Status Register
> - * only has reserved bitfields defined.
> - * Remove once there is something to service.
> - */
> - if (hw_err == HARDWARE_ERROR_NONFATAL) {
> - DRM_ERROR("detected Non-Fatal error\n");
> - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> +static void xe_gt_fatal_hw_error_stats_update(struct xe_gt *gt, unsigned long errstat)
> +{
> + u32 errbit, cnt;
> +
> + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> return;
> +
> + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) & PVC_FAT_ERR_MASK)) {
> + xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
> + "UNKNOWN FATAL error\n");
> + continue;
> + }
> +
> + switch (errbit) {
> + case ARRAY_BIST_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_ARR_BIST]++;
> + xe_gt_hw_err(gt, "Array BIST FATAL error\n");
> + break;
> + case FPU_UNCORR_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_FPU]++;
> + xe_gt_hw_err(gt, "FPU FATAL error\n");
> + break;
> + case L3_DOUBLE_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_DOUB]++;
> + xe_gt_hw_err(gt, "L3 Double FATAL error\n");
> + break;
> + case L3_ECC_CHK_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK]++;
> + xe_gt_hw_err(gt, "L3 ECC Checker FATAL error\n");
> + break;
> + case GUC_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_GUC]++;
> + xe_gt_hw_err(gt, "GUC SRAM FATAL error\n");
> + break;
> + case IDI_PAR_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_IDI_PAR]++;
> + xe_gt_hw_err(gt, "IDI PARITY FATAL error\n");
> + break;
> + case SQIDI_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SQIDI]++;
> + xe_gt_hw_err(gt, "SQIDI FATAL error\n");
> + break;
> + case SAMPLER_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SAMPLER]++;
> + xe_gt_hw_err(gt, "SAMPLER FATAL error\n");
> + break;
> + case SLM_FAT_ERR:
> + cnt = xe_mmio_read32(gt, SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_FATAL).reg);
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SLM] = cnt;
> + xe_gt_hw_err(gt, "%u SLM FATAL error\n", cnt);
> + break;
> + case EU_IC_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_IC]++;
> + xe_gt_hw_err(gt, "EU IC FATAL error\n");
> + break;
> + case EU_GRF_FAT_ERR:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_GRF]++;
> + xe_gt_hw_err(gt, "EU GRF FATAL error\n");
> + break;
> + default:
> + xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
> + "UNKNOWN FATAL error\n");
> + break;
> + }
> }
> +}
>
> - /*
> - * TODO: The remaining GT errors don't have a
> - * need for targeted logging at the moment. We
> - * still want to log detection of these errors, but
> - * let's aggregate them until someone has a need for them.
> - */
> - if (errstat & other_errors)
> - DRM_ERROR("detected hardware error(s) in ERR_STAT_GT_REG_%s: 0x%08x\n",
> - hw_err_str, errstat & other_errors);
> +static void
> +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> +{
> + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> + unsigned long errstat;
> +
> + lockdep_assert_held(>_to_xe(gt)->irq.lock);
>
> - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> + errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> + if (unlikely(!errstat)) {
> + xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
> + "ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
> + return;
> + }
> + }
> +
> + switch (hw_err) {
> + case HARDWARE_ERROR_CORRECTABLE:
> + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> + bool error = false;
> + int i;
> +
> + errstat = 0;
> + for (i = 0; i < ERR_STAT_GT_COR_VCTR_LEN; i++) {
> + u32 err_type = ERR_STAT_GT_COR_VCTR_LEN;
> + unsigned long vctr;
> + const char *name;
> +
> + vctr = xe_mmio_read32(gt, ERR_STAT_GT_COR_VCTR_REG(i).reg);
> + if (!vctr)
> + continue;
> +
> + switch (i) {
> + case ERR_STAT_GT_VCTR0:
> + case ERR_STAT_GT_VCTR1:
> + err_type = INTEL_GT_HW_ERROR_COR_SUBSLICE;
> + gt->errors.hw[err_type] += hweight32(vctr);
> + name = "SUBSLICE";
> +
> + /* Avoid second read/write to error status register*/
> + if (errstat)
> + break;
> +
> + errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> + xe_gt_hw_err(gt, "ERR_STAT_GT_CORRECTABLE:0x%08lx\n",
> + errstat);
> + xe_gt_correctable_hw_error_stats_update(gt, errstat);
> + if (errstat)
> + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> + errstat);
> + break;
> +
> + case ERR_STAT_GT_VCTR2:
> + case ERR_STAT_GT_VCTR3:
> + err_type = INTEL_GT_HW_ERROR_COR_L3BANK;
> + gt->errors.hw[err_type] += hweight32(vctr);
> + name = "L3 BANK";
> + break;
> + default:
> + name = "UNKNOWN";
> + break;
> + }
> + xe_mmio_write32(gt, ERR_STAT_GT_COR_VCTR_REG(i).reg, vctr);
> + xe_gt_hw_err(gt, "%s CORRECTABLE error, ERR_VECT_GT_CORRECTABLE_%d:0x%08lx\n",
> + name, i, vctr);
> + error = true;
> + }
> +
> + if (!error)
> + xe_gt_hw_err(gt, "UNKNOWN CORRECTABLE error\n");
> + } else {
> + xe_gt_correctable_hw_error_stats_update(gt, errstat);
> + xe_gt_hw_err(gt, "ERR_STAT_GT_CORRECTABLE:0x%08lx\n", errstat);
> + }
> + break;
> + case HARDWARE_ERROR_NONFATAL:
> + /*
> + * TODO: The GT Non Fatal Error Status Register
> + * only has reserved bitfields defined.
> + * Remove once there is something to service.
> + */
> + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "detected Non-Fatal error\n");
> + break;
> + case HARDWARE_ERROR_FATAL:
> + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> + bool error = false;
> + int i;
> +
> + errstat = 0;
> + for (i = 0; i < ERR_STAT_GT_FATAL_VCTR_LEN; i++) {
> + u32 err_type = ERR_STAT_GT_FATAL_VCTR_LEN;
> + unsigned long vctr;
> + const char *name;
> +
> + vctr = xe_mmio_read32(gt, ERR_STAT_GT_FATAL_VCTR_REG(i).reg);
> + if (!vctr)
> + continue;
> +
> + /* i represents the vector register index */
> + switch (i) {
> + case ERR_STAT_GT_VCTR0:
> + case ERR_STAT_GT_VCTR1:
> + err_type = INTEL_GT_HW_ERROR_FAT_SUBSLICE;
> + gt->errors.hw[err_type] += hweight32(vctr);
> + name = "SUBSLICE";
> +
> + /*Avoid second read/write to error status register.*/
> + if (errstat)
> + break;
> +
> + errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> + xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
> + xe_gt_fatal_hw_error_stats_update(gt, errstat);
> + if (errstat)
> + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> + errstat);
> + break;
> +
> + case ERR_STAT_GT_VCTR2:
> + case ERR_STAT_GT_VCTR3:
> + err_type = INTEL_GT_HW_ERROR_FAT_L3BANK;
> + gt->errors.hw[err_type] += hweight32(vctr);
> + name = "L3 BANK";
> + break;
> + case ERR_STAT_GT_VCTR6:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_TLB] += hweight16(vctr);
> + name = "TLB";
> + break;
> + case ERR_STAT_GT_VCTR7:
> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_FABRIC] += hweight8(vctr);
> + name = "L3 FABRIC";
> + break;
> + default:
> + name = "UNKNOWN";
> + break;
> + }
> + xe_mmio_write32(gt, ERR_STAT_GT_FATAL_VCTR_REG(i).reg, vctr);
> + xe_gt_hw_err(gt, "%s FATAL error, ERR_VECT_GT_FATAL_%d:0x%08lx\n",
> + name, i, vctr);
> + error = true;
> + }
> + if (!error)
> + xe_gt_hw_err(gt, "UNKNOWN FATAL error\n");
> + } else {
> + xe_gt_fatal_hw_error_stats_update(gt, errstat);
> + xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
> + }
> + break;
> + default:
> + break;
> + }
> +
> + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> }
>
> static void
> @@ -409,7 +647,8 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> if (unlikely(!errsrc)) {
> - DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
> + xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
> + "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
> goto out_unlock;
> }
>
> @@ -417,8 +656,9 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
> xe_gt_hw_error_handler(gt, hw_err);
>
> if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
> - DRM_ERROR("non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
> - hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
> + xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
> + "non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
> + hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
>
> xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
>
> @@ -634,12 +874,44 @@ static void irq_uninstall(struct drm_device *drm, void *arg)
> pci_disable_msi(pdev);
> }
>
> +/**
> + * process_hw_errors - checks for the occurrence of HW errors
> + *
> + * This checks for the HW Errors including FATAL error that might
> + * have occurred in the previous boot of the driver which will
> + * initiate PCIe FLR reset of the device and cause the
> + * driver to reload.
Is this saying that there's already been a PCIe FLR and you're trying to
read the registers after that reset has happened? The bspec indicates
that these registers have 'DEV' style reset, so they wouldn't be able to
preserve their values across a reset.
> + */
> +static void process_hw_errors(struct xe_device *xe)
> +{
> + struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
> + u32 dev_pcieerr_status, master_ctl;
> + struct xe_gt *gt;
> + int i;
> +
> + dev_pcieerr_status = xe_mmio_read32(gt0, DEV_PCIEERR_STATUS.reg);
> +
> + for_each_gt(gt, xe, i) {
> + if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
> + xe_hw_error_source_handler(gt, HARDWARE_ERROR_FATAL);
> +
> + master_ctl = xe_mmio_read32(gt, GEN11_GFX_MSTR_IRQ.reg);
> + xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg, master_ctl);
> + xe_hw_error_irq_handler(gt, master_ctl);
> + }
> + if (dev_pcieerr_status)
> + xe_mmio_write32(gt, DEV_PCIEERR_STATUS.reg, dev_pcieerr_status);
> +}
> +
> int xe_irq_install(struct xe_device *xe)
> {
> int irq = to_pci_dev(xe->drm.dev)->irq;
> irq_handler_t irq_handler;
> int err;
>
> + if (IS_DGFX(xe))
> + process_hw_errors(xe);
Why is this conditional on DGFX? From what I can see this also applies
to integrated platforms like MTL too.
Matt
> +
> irq_handler = xe_irq_handler(xe);
> if (!irq_handler) {
> drm_err(&xe->drm, "No supported interrupt handler");
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index 1844cff8fba8..69098194cef8 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -73,6 +73,7 @@ struct xe_device_desc {
> bool has_range_tlb_invalidation;
> bool has_asid;
> bool has_link_copy_engine;
> + bool has_gt_error_vectors;
> };
>
> __diag_push();
> @@ -232,6 +233,7 @@ static const struct xe_device_desc pvc_desc = {
> .supports_usm = true,
> .has_asid = true,
> .has_link_copy_engine = true,
> + .has_gt_error_vectors = true,
> };
>
> #define MTL_MEDIA_ENGINES \
> @@ -418,6 +420,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> xe->info.vm_max_level = desc->vm_max_level;
> xe->info.supports_usm = desc->supports_usm;
> xe->info.has_asid = desc->has_asid;
> + xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
> xe->info.has_flat_ccs = desc->has_flat_ccs;
> xe->info.has_4tile = desc->has_4tile;
> xe->info.has_range_tlb_invalidation = desc->has_range_tlb_invalidation;
> --
> 2.25.1
>
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-04-25 0:22 ` Matt Roper
@ 2023-04-28 8:00 ` Ghimiray, Himal Prasad
2023-05-04 0:02 ` Matt Roper
0 siblings, 1 reply; 22+ messages in thread
From: Ghimiray, Himal Prasad @ 2023-04-28 8:00 UTC (permalink / raw)
To: Roper, Matthew D; +Cc: intel-xe@lists.freedesktop.org
> -----Original Message-----
> From: Roper, Matthew D <matthew.d.roper@intel.com>
> Sent: 25 April 2023 05:53
> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> Cc: intel-xe@lists.freedesktop.org
> Subject: Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
>
> On Thu, Apr 06, 2023 at 02:56:29PM +0530, Himal Prasad Ghimiray wrote:
> > From: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> >
> > Count the CORRECTABLE and FATAL GT hardware errors as signaled by
> > relevant interrupt and respective registers.
> >
> > For non relevant interrupts count them as driver interrupt error.
> >
> > For platform supporting error vector registers count and report the
> > respective vector errors.
> >
> > Co-authored-by: Himal Prasad Ghimiray
> > <himal.prasad.ghimiray@intel.com>
> > Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > ---
> > drivers/gpu/drm/xe/regs/xe_regs.h | 77 ++++++-
> > drivers/gpu/drm/xe/xe_device_types.h | 2 +
> > drivers/gpu/drm/xe/xe_gt.c | 29 +++
> > drivers/gpu/drm/xe/xe_gt_types.h | 43 ++++
> > drivers/gpu/drm/xe/xe_irq.c | 332 ++++++++++++++++++++++++---
> > drivers/gpu/drm/xe/xe_pci.c | 3 +
> > 6 files changed, 453 insertions(+), 33 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h
> > b/drivers/gpu/drm/xe/regs/xe_regs.h
> > index dff74b093d4e..b3d35d0c5a77 100644
> > --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> > +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> > @@ -122,14 +122,50 @@ enum hardware_error {
> > HARDWARE_ERROR_MAX,
> > };
> >
> > +#define DEV_PCIEERR_STATUS _MMIO(0x100180)
> > +#define DEV_PCIEERR_IS_FATAL(x) (REG_BIT(2) << (x * 4))
>
> Just move the math inside the REG_BIT parameter. I.e.
>
> #define DEV_PCIEERR_IS_FATAL(x) REG_BIT(x * 4 + 2)
>
> > #define _DEV_ERR_STAT_FATAL 0x100174
> > #define _DEV_ERR_STAT_NONFATAL 0x100178
> > #define _DEV_ERR_STAT_CORRECTABLE 0x10017c
> > #define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
> >
> _DEV_ERR_STAT_CORRECTABLE, \
> >
> _DEV_ERR_STAT_NONFATAL))
> > +
> > #define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
> >
> > +enum gt_vctr_registers {
> > + ERR_STAT_GT_VCTR0 = 0,
> > + ERR_STAT_GT_VCTR1,
> > + ERR_STAT_GT_VCTR2,
> > + ERR_STAT_GT_VCTR3,
> > + ERR_STAT_GT_VCTR4,
> > + ERR_STAT_GT_VCTR5,
> > + ERR_STAT_GT_VCTR6,
> > + ERR_STAT_GT_VCTR7,
> > +};
> > +
> > +#define ERR_STAT_GT_COR_VCTR_LEN (4)
> > +#define _ERR_STAT_GT_COR_VCTR_0 0x1002a0
> > +#define _ERR_STAT_GT_COR_VCTR_1 0x1002a4
> > +#define _ERR_STAT_GT_COR_VCTR_2 0x1002a8
> > +#define _ERR_STAT_GT_COR_VCTR_3 0x1002ac
> > +#define ERR_STAT_GT_COR_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
> > +
> _ERR_STAT_GT_COR_VCTR_0, \
> > +
> _ERR_STAT_GT_COR_VCTR_1))
> > +
> > +#define ERR_STAT_GT_FATAL_VCTR_LEN (8)
> > +#define _ERR_STAT_GT_FATAL_VCTR_0 0x100260
> > +#define _ERR_STAT_GT_FATAL_VCTR_1 0x100264
> > +#define _ERR_STAT_GT_FATAL_VCTR_2 0x100268
> > +#define _ERR_STAT_GT_FATAL_VCTR_3 0x10026c
> > +#define _ERR_STAT_GT_FATAL_VCTR_4 0x100270
> > +#define _ERR_STAT_GT_FATAL_VCTR_5 0x100274
> > +#define _ERR_STAT_GT_FATAL_VCTR_6 0x100278
> > +#define _ERR_STAT_GT_FATAL_VCTR_7 0x10027c
> > +#define ERR_STAT_GT_FATAL_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
> > + _ERR_STAT_GT_FATAL_VCTR_0, \
> > + _ERR_STAT_GT_FATAL_VCTR_1))
> > +
> > #define _ERR_STAT_GT_COR 0x100160
> > #define _ERR_STAT_GT_NONFATAL 0x100164
> > #define _ERR_STAT_GT_FATAL 0x100168
> > @@ -137,7 +173,42 @@ enum hardware_error {
> > _ERR_STAT_GT_COR, \
> > _ERR_STAT_GT_NONFATAL))
> >
> > -#define EU_GRF_ERROR REG_BIT(15)
> > -#define EU_IC_ERROR REG_BIT(14)
> > -
> > +#define EU_GRF_COR_ERR (15)
> > +#define EU_IC_COR_ERR (14)
> > +#define SLM_COR_ERR (13)
> > +#define SAMPLER_COR_ERR (12)
> > +#define GUC_COR_ERR (1)
> > +#define L3_SNG_COR_ERR (0)
> > +
> > +#define PVC_COR_ERR_MASK \
> > + (REG_BIT(GUC_COR_ERR) | \
> > + REG_BIT(SLM_COR_ERR) | \
> > + REG_BIT(EU_IC_COR_ERR) | \
> > + REG_BIT(EU_GRF_COR_ERR))
> > +
> > +#define EU_GRF_FAT_ERR (15)
> > +#define EU_IC_FAT_ERR (14)
> > +#define SLM_FAT_ERR (13)
> > +#define SAMPLER_FAT_ERR (12)
> > +#define SQIDI_FAT_ERR (9)
> > +#define IDI_PAR_FAT_ERR (8)
> > +#define GUC_FAT_ERR (6)
> > +#define L3_ECC_CHK_FAT_ERR (5)
> > +#define L3_DOUBLE_FAT_ERR (4)
> > +#define FPU_UNCORR_FAT_ERR (3)
> > +#define ARRAY_BIST_FAT_ERR (1)
> > +
> > +#define PVC_FAT_ERR_MASK \
> > + (REG_BIT(FPU_UNCORR_FAT_ERR) | \
> > + REG_BIT(GUC_FAT_ERR) | \
> > + REG_BIT(SLM_FAT_ERR) | \
> > + REG_BIT(EU_GRF_FAT_ERR))
> > +
> > +#define GT_HW_ERROR_MAX_ERR_BITS 16
> > +
> > +#define _SLM_ECC_ERROR_CNT 0xe7f4
> > +#define _SLM_UNCORR_ECC_ERROR_CNT 0xe7c0
> > +#define SLM_ECC_ERROR_CNTR(x) _MMIO((x) ==
> HARDWARE_ERROR_CORRECTABLE ? \
> > + _SLM_ECC_ERROR_CNT : \
> > +
> _SLM_UNCORR_ECC_ERROR_CNT)
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > index 88f863edc41c..ecabf4d6690d 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -99,6 +99,8 @@ struct xe_device {
> > bool has_link_copy_engine;
> > /** @enable_display: display enabled */
> > bool enable_display;
> > + /** @has_gt_error_vectors: whether platform supports
> ERROR VECTORS */
> > + bool has_gt_error_vectors;
> >
> > #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> > struct xe_device_display_info {
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index bc821f431c45..ce9ce2748394 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -44,6 +44,35 @@
> > #include "xe_wa.h"
> > #include "xe_wopcm.h"
> >
> > +static const char * const xe_gt_driver_errors_to_str[] = {
> > + [INTEL_GT_DRIVER_ERROR_INTERRUPT] = "INTERRUPT", };
> > +
> > +void xe_gt_log_driver_error(struct xe_gt *gt,
>
> It looks like the things this function gets used for fall into one of two
> categories:
>
> - Software assertions that we don't expect to happen unless we screwed
> up and missed something in the code.
> - Hardware is doing something out of spec (e.g., raising interrupt bits
> that aren't documented).
>
> For the first one, I don't see a reason to be counting up such errors; they're
> just there for early developers and such mistakes should not be present in
> the driver anymore by the time it gets into real users'
> hands. For the latter, classifying these as "driver errors" is incorrect since it's
> the hardware misbehaving, not the driver.
Yes you are right. Will address this in next patch
>
> All of this new infrastructure seems pretty questionable at the moment.
> We're doing extra work to count up errors, but then never doing anything
> with the counts. You mention in the cover letter that these will be exposed
> to userspace eventually, but what's the benefit of that? Which userspace
> component is going to actually use this information? What do you expect
> userspace to do if it finds out there's been a fatal or correctable error in
> some low-level hardware unit? Generally userspace shouldn't even need to
> care about the really low-level hardware details; if something has truly gone
> fatally wrong, it's game over for userspace and it probably doesn't matter
> exactly where in the hardware things are actually busted.
>
> Without some extra justification from the userspace point of view, it feels
> like we're just adding a bunch of code that doesn't have a real-world
> purpose.
The error counters exposed by KMD will be used by sysman
They will be categorized to specific category of error in sysman:
https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras
>
> > + const enum xe_gt_driver_errors error,
> > + const char *fmt, ...)
> > +{
> > + struct va_format vaf;
> > + va_list args;
> > +
> > + va_start(args, fmt);
> > + vaf.fmt = fmt;
> > + vaf.va = &args;
> > +
> > + BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
> > + INTEL_GT_DRIVER_ERROR_COUNT);
> > +
> > + WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
> > +
> > + gt->errors.driver[error]++;
> > +
> > + drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
> > + gt->info.id,
> > + xe_gt_driver_errors_to_str[error],
> > + &vaf);
> > + va_end(args);
> > +}
> > +
> > struct xe_gt *xe_find_full_gt(struct xe_gt *gt) {
> > struct xe_gt *search;
> > diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
> > b/drivers/gpu/drm/xe/xe_gt_types.h
> > index 8f29aba455e0..9580a40c0142 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > @@ -33,6 +33,43 @@ enum xe_gt_type {
> > typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 *
> > XE_MAX_DSS_FUSE_REGS)]; typedef unsigned long
> > xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
> >
> > +/* Count of GT Correctable and FATAL HW ERRORS */ enum
> > +intel_gt_hw_errors {
> > + INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
> > + INTEL_GT_HW_ERROR_COR_L3BANK,
> > + INTEL_GT_HW_ERROR_COR_L3_SNG,
> > + INTEL_GT_HW_ERROR_COR_GUC,
> > + INTEL_GT_HW_ERROR_COR_SAMPLER,
> > + INTEL_GT_HW_ERROR_COR_SLM,
> > + INTEL_GT_HW_ERROR_COR_EU_IC,
> > + INTEL_GT_HW_ERROR_COR_EU_GRF,
> > + INTEL_GT_HW_ERROR_FAT_SUBSLICE,
> > + INTEL_GT_HW_ERROR_FAT_L3BANK,
> > + INTEL_GT_HW_ERROR_FAT_ARR_BIST,
> > + INTEL_GT_HW_ERROR_FAT_FPU,
> > + INTEL_GT_HW_ERROR_FAT_L3_DOUB,
> > + INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
> > + INTEL_GT_HW_ERROR_FAT_GUC,
> > + INTEL_GT_HW_ERROR_FAT_IDI_PAR,
> > + INTEL_GT_HW_ERROR_FAT_SQIDI,
> > + INTEL_GT_HW_ERROR_FAT_SAMPLER,
> > + INTEL_GT_HW_ERROR_FAT_SLM,
> > + INTEL_GT_HW_ERROR_FAT_EU_IC,
> > + INTEL_GT_HW_ERROR_FAT_EU_GRF,
> > + INTEL_GT_HW_ERROR_FAT_TLB,
> > + INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
> > + INTEL_GT_HW_ERROR_COUNT
> > +};
> > +
> > +enum xe_gt_driver_errors {
> > + INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
> > + INTEL_GT_DRIVER_ERROR_COUNT
> > +};
> > +
> > +void xe_gt_log_driver_error(struct xe_gt *gt,
> > + const enum xe_gt_driver_errors error,
> > + const char *fmt, ...);
> > +
> > struct xe_mmio_range {
> > u32 start;
> > u32 end;
> > @@ -357,6 +394,12 @@ struct xe_gt {
> > * of a steered operation
> > */
> > spinlock_t mcr_lock;
> > +
> > + struct intel_hw_errors {
> > + unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
> > + unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
> > + } errors;
> > +
> > };
> >
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> > index 6b922332bff1..4626f7280aaf 100644
> > --- a/drivers/gpu/drm/xe/xe_irq.c
> > +++ b/drivers/gpu/drm/xe/xe_irq.c
> > @@ -19,6 +19,7 @@
> > #include "xe_hw_engine.h"
> > #include "xe_mmio.h"
> >
> > +#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
> > static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
> > {
> > u32 val = xe_mmio_read32(gt, reg.reg); @@ -359,44 +360,281 @@
> > hardware_error_type_to_str(const enum hardware_error hw_err)
> > }
> > }
> >
> > +#define xe_gt_hw_err(gt, fmt, ...) \
> > + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "GT%d detected "
> fmt, \
> > + (gt)->info.id, ##__VA_ARGS__)
>
> As on the previous patch, it looks like we're printing error-level kernel
> messages for correctable errors (i.e., things the hardware caught and fixed
> internally like ECC). Generally those kind of things shouldn't be putting
> errors in the kernel log because there's no actual problem from the end user
> perspective.
Agreed. Should we use warning for correctable errors ?
>
> > +
> > static void
> > -xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
> > hw_err)
> > +xe_gt_correctable_hw_error_stats_update(struct xe_gt *gt, unsigned
> > +long errstat)
> > {
> > - const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > - u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> > - u32 errstat;
> > + u32 errbit, cnt;
> >
> > - lockdep_assert_held(>_to_xe(gt)->irq.lock);
> > + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> > + return;
> >
> > - errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> > + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> > + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
> PVC_COR_ERR_MASK)) {
> > + xe_gt_log_driver_error(gt,
> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > + "UNKNOWN CORRECTABLE
> error\n");
> > + continue;
> > + }
> >
> > - if (unlikely(!errstat)) {
> > - DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n",
> hw_err_str);
> > - return;
> > + switch (errbit) {
> > + case L3_SNG_COR_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_L3_SNG]++;
> > + xe_gt_hw_err(gt, "L3 SINGLE CORRECTABLE error\n");
> > + break;
> > + case GUC_COR_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_GUC]++;
> > + xe_gt_hw_err(gt, "SINGLE BIT GUC SRAM CORRECTABLE
> error\n");
> > + break;
> > + case SAMPLER_COR_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SAMPLER]++;
> > + xe_gt_hw_err(gt, "SINGLE BIT SAMPLER CORRECTABLE
> error\n");
> > + break;
> > + case SLM_COR_ERR:
> > + cnt = xe_mmio_read32(gt,
> SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_CORRECTABLE).reg);
> > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SLM] = cnt;
> > + xe_gt_hw_err(gt, "%u SINGLE BIT SLM CORRECTABLE
> error\n", cnt);
> > + break;
> > + case EU_IC_COR_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_IC]++;
> > + xe_gt_hw_err(gt, "SINGLE BIT EU IC CORRECTABLE error\n");
> > + break;
> > + case EU_GRF_COR_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_GRF]++;
> > + xe_gt_hw_err(gt, "SINGLE BIT EU GRF CORRECTABLE
> error\n");
> > + break;
> > + default:
> > + xe_gt_log_driver_error(gt,
> INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN CORRECTABLE error\n");
> > + break;
> > + }
> > }
> > +}
> >
> > - /*
> > - * TODO: The GT Non Fatal Error Status Register
> > - * only has reserved bitfields defined.
> > - * Remove once there is something to service.
> > - */
> > - if (hw_err == HARDWARE_ERROR_NONFATAL) {
> > - DRM_ERROR("detected Non-Fatal error\n");
> > - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> errstat);
> > +static void xe_gt_fatal_hw_error_stats_update(struct xe_gt *gt,
> > +unsigned long errstat) {
> > + u32 errbit, cnt;
> > +
> > + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> > return;
> > +
> > + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> > + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
> PVC_FAT_ERR_MASK)) {
> > + xe_gt_log_driver_error(gt,
> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > + "UNKNOWN FATAL error\n");
> > + continue;
> > + }
> > +
> > + switch (errbit) {
> > + case ARRAY_BIST_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_ARR_BIST]++;
> > + xe_gt_hw_err(gt, "Array BIST FATAL error\n");
> > + break;
> > + case FPU_UNCORR_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_FPU]++;
> > + xe_gt_hw_err(gt, "FPU FATAL error\n");
> > + break;
> > + case L3_DOUBLE_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_DOUB]++;
> > + xe_gt_hw_err(gt, "L3 Double FATAL error\n");
> > + break;
> > + case L3_ECC_CHK_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK]++;
> > + xe_gt_hw_err(gt, "L3 ECC Checker FATAL error\n");
> > + break;
> > + case GUC_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_GUC]++;
> > + xe_gt_hw_err(gt, "GUC SRAM FATAL error\n");
> > + break;
> > + case IDI_PAR_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_IDI_PAR]++;
> > + xe_gt_hw_err(gt, "IDI PARITY FATAL error\n");
> > + break;
> > + case SQIDI_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SQIDI]++;
> > + xe_gt_hw_err(gt, "SQIDI FATAL error\n");
> > + break;
> > + case SAMPLER_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SAMPLER]++;
> > + xe_gt_hw_err(gt, "SAMPLER FATAL error\n");
> > + break;
> > + case SLM_FAT_ERR:
> > + cnt = xe_mmio_read32(gt,
> SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_FATAL).reg);
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SLM] = cnt;
> > + xe_gt_hw_err(gt, "%u SLM FATAL error\n", cnt);
> > + break;
> > + case EU_IC_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_IC]++;
> > + xe_gt_hw_err(gt, "EU IC FATAL error\n");
> > + break;
> > + case EU_GRF_FAT_ERR:
> > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_GRF]++;
> > + xe_gt_hw_err(gt, "EU GRF FATAL error\n");
> > + break;
> > + default:
> > + xe_gt_log_driver_error(gt,
> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > + "UNKNOWN FATAL error\n");
> > + break;
> > + }
> > }
> > +}
> >
> > - /*
> > - * TODO: The remaining GT errors don't have a
> > - * need for targeted logging at the moment. We
> > - * still want to log detection of these errors, but
> > - * let's aggregate them until someone has a need for them.
> > - */
> > - if (errstat & other_errors)
> > - DRM_ERROR("detected hardware error(s) in
> ERR_STAT_GT_REG_%s: 0x%08x\n",
> > - hw_err_str, errstat & other_errors);
> > +static void
> > +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
> > +hw_err) {
> > + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > + unsigned long errstat;
> > +
> > + lockdep_assert_held(>_to_xe(gt)->irq.lock);
> >
> > - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> > + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> > + errstat = xe_mmio_read32(gt,
> ERR_STAT_GT_REG(hw_err).reg);
> > + if (unlikely(!errstat)) {
> > + xe_gt_log_driver_error(gt,
> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > + "ERR_STAT_GT_REG_%s
> blank!\n", hw_err_str);
> > + return;
> > + }
> > + }
> > +
> > + switch (hw_err) {
> > + case HARDWARE_ERROR_CORRECTABLE:
> > + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> > + bool error = false;
> > + int i;
> > +
> > + errstat = 0;
> > + for (i = 0; i < ERR_STAT_GT_COR_VCTR_LEN; i++) {
> > + u32 err_type =
> ERR_STAT_GT_COR_VCTR_LEN;
> > + unsigned long vctr;
> > + const char *name;
> > +
> > + vctr = xe_mmio_read32(gt,
> ERR_STAT_GT_COR_VCTR_REG(i).reg);
> > + if (!vctr)
> > + continue;
> > +
> > + switch (i) {
> > + case ERR_STAT_GT_VCTR0:
> > + case ERR_STAT_GT_VCTR1:
> > + err_type =
> INTEL_GT_HW_ERROR_COR_SUBSLICE;
> > + gt->errors.hw[err_type] +=
> hweight32(vctr);
> > + name = "SUBSLICE";
> > +
> > + /* Avoid second read/write to error
> status register*/
> > + if (errstat)
> > + break;
> > +
> > + errstat = xe_mmio_read32(gt,
> ERR_STAT_GT_REG(hw_err).reg);
> > + xe_gt_hw_err(gt,
> "ERR_STAT_GT_CORRECTABLE:0x%08lx\n",
> > + errstat);
> > +
> xe_gt_correctable_hw_error_stats_update(gt, errstat);
> > + if (errstat)
> > + xe_mmio_write32(gt,
> ERR_STAT_GT_REG(hw_err).reg,
> > + errstat);
> > + break;
> > +
> > + case ERR_STAT_GT_VCTR2:
> > + case ERR_STAT_GT_VCTR3:
> > + err_type =
> INTEL_GT_HW_ERROR_COR_L3BANK;
> > + gt->errors.hw[err_type] +=
> hweight32(vctr);
> > + name = "L3 BANK";
> > + break;
> > + default:
> > + name = "UNKNOWN";
> > + break;
> > + }
> > + xe_mmio_write32(gt,
> ERR_STAT_GT_COR_VCTR_REG(i).reg, vctr);
> > + xe_gt_hw_err(gt, "%s CORRECTABLE error,
> ERR_VECT_GT_CORRECTABLE_%d:0x%08lx\n",
> > + name, i, vctr);
> > + error = true;
> > + }
> > +
> > + if (!error)
> > + xe_gt_hw_err(gt, "UNKNOWN CORRECTABLE
> error\n");
> > + } else {
> > + xe_gt_correctable_hw_error_stats_update(gt,
> errstat);
> > + xe_gt_hw_err(gt,
> "ERR_STAT_GT_CORRECTABLE:0x%08lx\n", errstat);
> > + }
> > + break;
> > + case HARDWARE_ERROR_NONFATAL:
> > + /*
> > + * TODO: The GT Non Fatal Error Status Register
> > + * only has reserved bitfields defined.
> > + * Remove once there is something to service.
> > + */
> > + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "detected
> Non-Fatal error\n");
> > + break;
> > + case HARDWARE_ERROR_FATAL:
> > + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> > + bool error = false;
> > + int i;
> > +
> > + errstat = 0;
> > + for (i = 0; i < ERR_STAT_GT_FATAL_VCTR_LEN; i++) {
> > + u32 err_type =
> ERR_STAT_GT_FATAL_VCTR_LEN;
> > + unsigned long vctr;
> > + const char *name;
> > +
> > + vctr = xe_mmio_read32(gt,
> ERR_STAT_GT_FATAL_VCTR_REG(i).reg);
> > + if (!vctr)
> > + continue;
> > +
> > + /* i represents the vector register index */
> > + switch (i) {
> > + case ERR_STAT_GT_VCTR0:
> > + case ERR_STAT_GT_VCTR1:
> > + err_type =
> INTEL_GT_HW_ERROR_FAT_SUBSLICE;
> > + gt->errors.hw[err_type] +=
> hweight32(vctr);
> > + name = "SUBSLICE";
> > +
> > + /*Avoid second read/write to error
> status register.*/
> > + if (errstat)
> > + break;
> > +
> > + errstat = xe_mmio_read32(gt,
> ERR_STAT_GT_REG(hw_err).reg);
> > + xe_gt_hw_err(gt,
> "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
> > +
> xe_gt_fatal_hw_error_stats_update(gt, errstat);
> > + if (errstat)
> > + xe_mmio_write32(gt,
> ERR_STAT_GT_REG(hw_err).reg,
> > + errstat);
> > + break;
> > +
> > + case ERR_STAT_GT_VCTR2:
> > + case ERR_STAT_GT_VCTR3:
> > + err_type =
> INTEL_GT_HW_ERROR_FAT_L3BANK;
> > + gt->errors.hw[err_type] +=
> hweight32(vctr);
> > + name = "L3 BANK";
> > + break;
> > + case ERR_STAT_GT_VCTR6:
> > + gt-
> >errors.hw[INTEL_GT_HW_ERROR_FAT_TLB] += hweight16(vctr);
> > + name = "TLB";
> > + break;
> > + case ERR_STAT_GT_VCTR7:
> > + gt-
> >errors.hw[INTEL_GT_HW_ERROR_FAT_L3_FABRIC] += hweight8(vctr);
> > + name = "L3 FABRIC";
> > + break;
> > + default:
> > + name = "UNKNOWN";
> > + break;
> > + }
> > + xe_mmio_write32(gt,
> ERR_STAT_GT_FATAL_VCTR_REG(i).reg, vctr);
> > + xe_gt_hw_err(gt, "%s FATAL error,
> ERR_VECT_GT_FATAL_%d:0x%08lx\n",
> > + name, i, vctr);
> > + error = true;
> > + }
> > + if (!error)
> > + xe_gt_hw_err(gt, "UNKNOWN FATAL
> error\n");
> > + } else {
> > + xe_gt_fatal_hw_error_stats_update(gt, errstat);
> > + xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n",
> errstat);
> > + }
> > + break;
> > + default:
> > + break;
> > + }
> > +
> > + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> > + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> errstat);
> > }
> >
> > static void
> > @@ -409,7 +647,8 @@ xe_hw_error_source_handler(struct xe_gt *gt,
> const enum hardware_error hw_err)
> > spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> > errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> > if (unlikely(!errsrc)) {
> > - DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n",
> hw_err_str);
> > + xe_gt_log_driver_error(gt,
> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > + "DEV_ERR_STAT_REG_%s blank!\n",
> hw_err_str);
> > goto out_unlock;
> > }
> >
> > @@ -417,8 +656,9 @@ xe_hw_error_source_handler(struct xe_gt *gt,
> const enum hardware_error hw_err)
> > xe_gt_hw_error_handler(gt, hw_err);
> >
> > if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
> > - DRM_ERROR("non-GT hardware error(s) in
> DEV_ERR_STAT_REG_%s: 0x%08x\n",
> > - hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
> > + xe_gt_log_driver_error(gt,
> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > + "non-GT hardware error(s) in
> DEV_ERR_STAT_REG_%s: 0x%08x\n",
> > + hw_err_str, errsrc &
> ~DEV_ERR_STAT_GT_ERROR);
> >
> > xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
> >
> > @@ -634,12 +874,44 @@ static void irq_uninstall(struct drm_device *drm,
> void *arg)
> > pci_disable_msi(pdev);
> > }
> >
> > +/**
> > + * process_hw_errors - checks for the occurrence of HW errors
> > + *
> > + * This checks for the HW Errors including FATAL error that might
> > + * have occurred in the previous boot of the driver which will
> > + * initiate PCIe FLR reset of the device and cause the
> > + * driver to reload.
>
> Is this saying that there's already been a PCIe FLR and you're trying to read
> the registers after that reset has happened? The bspec indicates that these
> registers have 'DEV' style reset, so they wouldn't be able to preserve their
> values across a reset.
Registers preserve the value across reset.
BSPEC: 50875
>
> > + */
> > +static void process_hw_errors(struct xe_device *xe) {
> > + struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
> > + u32 dev_pcieerr_status, master_ctl;
> > + struct xe_gt *gt;
> > + int i;
> > +
> > + dev_pcieerr_status = xe_mmio_read32(gt0,
> DEV_PCIEERR_STATUS.reg);
> > +
> > + for_each_gt(gt, xe, i) {
> > + if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
> > + xe_hw_error_source_handler(gt,
> HARDWARE_ERROR_FATAL);
> > +
> > + master_ctl = xe_mmio_read32(gt,
> GEN11_GFX_MSTR_IRQ.reg);
> > + xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg,
> master_ctl);
> > + xe_hw_error_irq_handler(gt, master_ctl);
> > + }
> > + if (dev_pcieerr_status)
> > + xe_mmio_write32(gt, DEV_PCIEERR_STATUS.reg,
> dev_pcieerr_status); }
> > +
> > int xe_irq_install(struct xe_device *xe) {
> > int irq = to_pci_dev(xe->drm.dev)->irq;
> > irq_handler_t irq_handler;
> > int err;
> >
> > + if (IS_DGFX(xe))
> > + process_hw_errors(xe);
>
> Why is this conditional on DGFX? From what I can see this also applies to
> integrated platforms like MTL too.
From RAS perspective report of error is required on DGFX only.
sysman don't use these counters from IGFX
>
>
> Matt
>
> > +
> > irq_handler = xe_irq_handler(xe);
> > if (!irq_handler) {
> > drm_err(&xe->drm, "No supported interrupt handler"); diff --
> git
> > a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index
> > 1844cff8fba8..69098194cef8 100644
> > --- a/drivers/gpu/drm/xe/xe_pci.c
> > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > @@ -73,6 +73,7 @@ struct xe_device_desc {
> > bool has_range_tlb_invalidation;
> > bool has_asid;
> > bool has_link_copy_engine;
> > + bool has_gt_error_vectors;
> > };
> >
> > __diag_push();
> > @@ -232,6 +233,7 @@ static const struct xe_device_desc pvc_desc = {
> > .supports_usm = true,
> > .has_asid = true,
> > .has_link_copy_engine = true,
> > + .has_gt_error_vectors = true,
> > };
> >
> > #define MTL_MEDIA_ENGINES \
> > @@ -418,6 +420,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const
> struct pci_device_id *ent)
> > xe->info.vm_max_level = desc->vm_max_level;
> > xe->info.supports_usm = desc->supports_usm;
> > xe->info.has_asid = desc->has_asid;
> > + xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
> > xe->info.has_flat_ccs = desc->has_flat_ccs;
> > xe->info.has_4tile = desc->has_4tile;
> > xe->info.has_range_tlb_invalidation =
> > desc->has_range_tlb_invalidation;
> > --
> > 2.25.1
> >
>
> --
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-04-28 8:00 ` Ghimiray, Himal Prasad
@ 2023-05-04 0:02 ` Matt Roper
2023-05-05 7:24 ` Iddamsetty, Aravind
0 siblings, 1 reply; 22+ messages in thread
From: Matt Roper @ 2023-05-04 0:02 UTC (permalink / raw)
To: Ghimiray, Himal Prasad; +Cc: intel-xe@lists.freedesktop.org
On Fri, Apr 28, 2023 at 01:00:04AM -0700, Ghimiray, Himal Prasad wrote:
...
> > All of this new infrastructure seems pretty questionable at the moment.
> > We're doing extra work to count up errors, but then never doing anything
> > with the counts. You mention in the cover letter that these will be exposed
> > to userspace eventually, but what's the benefit of that? Which userspace
> > component is going to actually use this information? What do you expect
> > userspace to do if it finds out there's been a fatal or correctable error in
> > some low-level hardware unit? Generally userspace shouldn't even need to
> > care about the really low-level hardware details; if something has truly gone
> > fatally wrong, it's game over for userspace and it probably doesn't matter
> > exactly where in the hardware things are actually busted.
> >
> > Without some extra justification from the userspace point of view, it feels
> > like we're just adding a bunch of code that doesn't have a real-world
> > purpose.
> The error counters exposed by KMD will be used by sysman
> They will be categorized to specific category of error in sysman:
> https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras
L0 sysman looks sort of like a complicated libdrm replacement. I.e.,
it's just a wrapper library over various uapi interfaces, but isn't
really a "consumer" of that uapi itself. We need to understand the
whole top-to-bottom stack to make sure that whatever interfaces and
representation are selected (both at the Xe and Sysman levels) actually
makes sense and satisfies the full stack needs.
Is the actual reporting of these errors going to be done via the
standard Linux RAS/EDAC interfaces? I.e., what's documented at
https://www.kernel.org/doc/html/v6.3/admin-guide/ras.html ? If so, then
there's already a bunch of real userspace tools that work with that, so
that would probably help justify the work here, and make it clear that
we're not just reinventing the wheel. If we're not tying into that,
then we probably need to justify clearly why it can't be used.
> >
> > > + const enum xe_gt_driver_errors error,
> > > + const char *fmt, ...)
> > > +{
> > > + struct va_format vaf;
> > > + va_list args;
> > > +
> > > + va_start(args, fmt);
> > > + vaf.fmt = fmt;
> > > + vaf.va = &args;
> > > +
> > > + BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
> > > + INTEL_GT_DRIVER_ERROR_COUNT);
> > > +
> > > + WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
> > > +
> > > + gt->errors.driver[error]++;
> > > +
> > > + drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
> > > + gt->info.id,
> > > + xe_gt_driver_errors_to_str[error],
> > > + &vaf);
> > > + va_end(args);
> > > +}
> > > +
> > > struct xe_gt *xe_find_full_gt(struct xe_gt *gt) {
> > > struct xe_gt *search;
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
> > > b/drivers/gpu/drm/xe/xe_gt_types.h
> > > index 8f29aba455e0..9580a40c0142 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > > @@ -33,6 +33,43 @@ enum xe_gt_type {
> > > typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 *
> > > XE_MAX_DSS_FUSE_REGS)]; typedef unsigned long
> > > xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
> > >
> > > +/* Count of GT Correctable and FATAL HW ERRORS */ enum
> > > +intel_gt_hw_errors {
> > > + INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
> > > + INTEL_GT_HW_ERROR_COR_L3BANK,
> > > + INTEL_GT_HW_ERROR_COR_L3_SNG,
> > > + INTEL_GT_HW_ERROR_COR_GUC,
> > > + INTEL_GT_HW_ERROR_COR_SAMPLER,
> > > + INTEL_GT_HW_ERROR_COR_SLM,
> > > + INTEL_GT_HW_ERROR_COR_EU_IC,
> > > + INTEL_GT_HW_ERROR_COR_EU_GRF,
> > > + INTEL_GT_HW_ERROR_FAT_SUBSLICE,
> > > + INTEL_GT_HW_ERROR_FAT_L3BANK,
> > > + INTEL_GT_HW_ERROR_FAT_ARR_BIST,
> > > + INTEL_GT_HW_ERROR_FAT_FPU,
> > > + INTEL_GT_HW_ERROR_FAT_L3_DOUB,
> > > + INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
> > > + INTEL_GT_HW_ERROR_FAT_GUC,
> > > + INTEL_GT_HW_ERROR_FAT_IDI_PAR,
> > > + INTEL_GT_HW_ERROR_FAT_SQIDI,
> > > + INTEL_GT_HW_ERROR_FAT_SAMPLER,
> > > + INTEL_GT_HW_ERROR_FAT_SLM,
> > > + INTEL_GT_HW_ERROR_FAT_EU_IC,
> > > + INTEL_GT_HW_ERROR_FAT_EU_GRF,
> > > + INTEL_GT_HW_ERROR_FAT_TLB,
> > > + INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
> > > + INTEL_GT_HW_ERROR_COUNT
> > > +};
> > > +
> > > +enum xe_gt_driver_errors {
> > > + INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
> > > + INTEL_GT_DRIVER_ERROR_COUNT
> > > +};
> > > +
> > > +void xe_gt_log_driver_error(struct xe_gt *gt,
> > > + const enum xe_gt_driver_errors error,
> > > + const char *fmt, ...);
> > > +
> > > struct xe_mmio_range {
> > > u32 start;
> > > u32 end;
> > > @@ -357,6 +394,12 @@ struct xe_gt {
> > > * of a steered operation
> > > */
> > > spinlock_t mcr_lock;
> > > +
> > > + struct intel_hw_errors {
> > > + unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
> > > + unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
> > > + } errors;
> > > +
> > > };
> > >
> > > #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> > > index 6b922332bff1..4626f7280aaf 100644
> > > --- a/drivers/gpu/drm/xe/xe_irq.c
> > > +++ b/drivers/gpu/drm/xe/xe_irq.c
> > > @@ -19,6 +19,7 @@
> > > #include "xe_hw_engine.h"
> > > #include "xe_mmio.h"
> > >
> > > +#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
> > > static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
> > > {
> > > u32 val = xe_mmio_read32(gt, reg.reg); @@ -359,44 +360,281 @@
> > > hardware_error_type_to_str(const enum hardware_error hw_err)
> > > }
> > > }
> > >
> > > +#define xe_gt_hw_err(gt, fmt, ...) \
> > > + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "GT%d detected "
> > fmt, \
> > > + (gt)->info.id, ##__VA_ARGS__)
> >
> > As on the previous patch, it looks like we're printing error-level kernel
> > messages for correctable errors (i.e., things the hardware caught and fixed
> > internally like ECC). Generally those kind of things shouldn't be putting
> > errors in the kernel log because there's no actual problem from the end user
> > perspective.
> Agreed. Should we use warning for correctable errors ?
Presumably we shouldn't be reporting them in dmesg at all by default.
It looks like the EDAC subsystem has its own ways of controlling if/when
stuff shows up in dmesg.
> >
> > > +
> > > static void
> > > -xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
> > > hw_err)
> > > +xe_gt_correctable_hw_error_stats_update(struct xe_gt *gt, unsigned
> > > +long errstat)
> > > {
> > > - const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > > - u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> > > - u32 errstat;
> > > + u32 errbit, cnt;
> > >
> > > - lockdep_assert_held(>_to_xe(gt)->irq.lock);
> > > + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> > > + return;
> > >
> > > - errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> > > + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> > > + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
> > PVC_COR_ERR_MASK)) {
> > > + xe_gt_log_driver_error(gt,
> > INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > > + "UNKNOWN CORRECTABLE
> > error\n");
> > > + continue;
> > > + }
> > >
> > > - if (unlikely(!errstat)) {
> > > - DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n",
> > hw_err_str);
> > > - return;
> > > + switch (errbit) {
> > > + case L3_SNG_COR_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_L3_SNG]++;
> > > + xe_gt_hw_err(gt, "L3 SINGLE CORRECTABLE error\n");
> > > + break;
> > > + case GUC_COR_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_GUC]++;
> > > + xe_gt_hw_err(gt, "SINGLE BIT GUC SRAM CORRECTABLE
> > error\n");
> > > + break;
> > > + case SAMPLER_COR_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SAMPLER]++;
> > > + xe_gt_hw_err(gt, "SINGLE BIT SAMPLER CORRECTABLE
> > error\n");
> > > + break;
> > > + case SLM_COR_ERR:
> > > + cnt = xe_mmio_read32(gt,
> > SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_CORRECTABLE).reg);
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SLM] = cnt;
> > > + xe_gt_hw_err(gt, "%u SINGLE BIT SLM CORRECTABLE
> > error\n", cnt);
> > > + break;
> > > + case EU_IC_COR_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_IC]++;
> > > + xe_gt_hw_err(gt, "SINGLE BIT EU IC CORRECTABLE error\n");
> > > + break;
> > > + case EU_GRF_COR_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_GRF]++;
> > > + xe_gt_hw_err(gt, "SINGLE BIT EU GRF CORRECTABLE
> > error\n");
> > > + break;
> > > + default:
> > > + xe_gt_log_driver_error(gt,
> > INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN CORRECTABLE error\n");
> > > + break;
> > > + }
> > > }
> > > +}
> > >
> > > - /*
> > > - * TODO: The GT Non Fatal Error Status Register
> > > - * only has reserved bitfields defined.
> > > - * Remove once there is something to service.
> > > - */
> > > - if (hw_err == HARDWARE_ERROR_NONFATAL) {
> > > - DRM_ERROR("detected Non-Fatal error\n");
> > > - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> > errstat);
> > > +static void xe_gt_fatal_hw_error_stats_update(struct xe_gt *gt,
> > > +unsigned long errstat) {
> > > + u32 errbit, cnt;
> > > +
> > > + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> > > return;
> > > +
> > > + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> > > + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
> > PVC_FAT_ERR_MASK)) {
> > > + xe_gt_log_driver_error(gt,
> > INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > > + "UNKNOWN FATAL error\n");
> > > + continue;
> > > + }
> > > +
> > > + switch (errbit) {
> > > + case ARRAY_BIST_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_ARR_BIST]++;
> > > + xe_gt_hw_err(gt, "Array BIST FATAL error\n");
> > > + break;
> > > + case FPU_UNCORR_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_FPU]++;
> > > + xe_gt_hw_err(gt, "FPU FATAL error\n");
> > > + break;
> > > + case L3_DOUBLE_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_DOUB]++;
> > > + xe_gt_hw_err(gt, "L3 Double FATAL error\n");
> > > + break;
> > > + case L3_ECC_CHK_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK]++;
> > > + xe_gt_hw_err(gt, "L3 ECC Checker FATAL error\n");
> > > + break;
> > > + case GUC_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_GUC]++;
> > > + xe_gt_hw_err(gt, "GUC SRAM FATAL error\n");
> > > + break;
> > > + case IDI_PAR_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_IDI_PAR]++;
> > > + xe_gt_hw_err(gt, "IDI PARITY FATAL error\n");
> > > + break;
> > > + case SQIDI_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SQIDI]++;
> > > + xe_gt_hw_err(gt, "SQIDI FATAL error\n");
> > > + break;
> > > + case SAMPLER_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SAMPLER]++;
> > > + xe_gt_hw_err(gt, "SAMPLER FATAL error\n");
> > > + break;
> > > + case SLM_FAT_ERR:
> > > + cnt = xe_mmio_read32(gt,
> > SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_FATAL).reg);
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SLM] = cnt;
> > > + xe_gt_hw_err(gt, "%u SLM FATAL error\n", cnt);
> > > + break;
> > > + case EU_IC_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_IC]++;
> > > + xe_gt_hw_err(gt, "EU IC FATAL error\n");
> > > + break;
> > > + case EU_GRF_FAT_ERR:
> > > + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_GRF]++;
> > > + xe_gt_hw_err(gt, "EU GRF FATAL error\n");
> > > + break;
> > > + default:
> > > + xe_gt_log_driver_error(gt,
> > INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > > + "UNKNOWN FATAL error\n");
> > > + break;
> > > + }
> > > }
> > > +}
> > >
> > > - /*
> > > - * TODO: The remaining GT errors don't have a
> > > - * need for targeted logging at the moment. We
> > > - * still want to log detection of these errors, but
> > > - * let's aggregate them until someone has a need for them.
> > > - */
> > > - if (errstat & other_errors)
> > > - DRM_ERROR("detected hardware error(s) in
> > ERR_STAT_GT_REG_%s: 0x%08x\n",
> > > - hw_err_str, errstat & other_errors);
> > > +static void
> > > +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
> > > +hw_err) {
> > > + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> > > + unsigned long errstat;
> > > +
> > > + lockdep_assert_held(>_to_xe(gt)->irq.lock);
> > >
> > > - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> > > + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> > > + errstat = xe_mmio_read32(gt,
> > ERR_STAT_GT_REG(hw_err).reg);
> > > + if (unlikely(!errstat)) {
> > > + xe_gt_log_driver_error(gt,
> > INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > > + "ERR_STAT_GT_REG_%s
> > blank!\n", hw_err_str);
> > > + return;
> > > + }
> > > + }
> > > +
> > > + switch (hw_err) {
> > > + case HARDWARE_ERROR_CORRECTABLE:
> > > + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> > > + bool error = false;
> > > + int i;
> > > +
> > > + errstat = 0;
> > > + for (i = 0; i < ERR_STAT_GT_COR_VCTR_LEN; i++) {
> > > + u32 err_type =
> > ERR_STAT_GT_COR_VCTR_LEN;
> > > + unsigned long vctr;
> > > + const char *name;
> > > +
> > > + vctr = xe_mmio_read32(gt,
> > ERR_STAT_GT_COR_VCTR_REG(i).reg);
> > > + if (!vctr)
> > > + continue;
> > > +
> > > + switch (i) {
> > > + case ERR_STAT_GT_VCTR0:
> > > + case ERR_STAT_GT_VCTR1:
> > > + err_type =
> > INTEL_GT_HW_ERROR_COR_SUBSLICE;
> > > + gt->errors.hw[err_type] +=
> > hweight32(vctr);
> > > + name = "SUBSLICE";
> > > +
> > > + /* Avoid second read/write to error
> > status register*/
> > > + if (errstat)
> > > + break;
> > > +
> > > + errstat = xe_mmio_read32(gt,
> > ERR_STAT_GT_REG(hw_err).reg);
> > > + xe_gt_hw_err(gt,
> > "ERR_STAT_GT_CORRECTABLE:0x%08lx\n",
> > > + errstat);
> > > +
> > xe_gt_correctable_hw_error_stats_update(gt, errstat);
> > > + if (errstat)
> > > + xe_mmio_write32(gt,
> > ERR_STAT_GT_REG(hw_err).reg,
> > > + errstat);
> > > + break;
> > > +
> > > + case ERR_STAT_GT_VCTR2:
> > > + case ERR_STAT_GT_VCTR3:
> > > + err_type =
> > INTEL_GT_HW_ERROR_COR_L3BANK;
> > > + gt->errors.hw[err_type] +=
> > hweight32(vctr);
> > > + name = "L3 BANK";
> > > + break;
> > > + default:
> > > + name = "UNKNOWN";
> > > + break;
> > > + }
> > > + xe_mmio_write32(gt,
> > ERR_STAT_GT_COR_VCTR_REG(i).reg, vctr);
> > > + xe_gt_hw_err(gt, "%s CORRECTABLE error,
> > ERR_VECT_GT_CORRECTABLE_%d:0x%08lx\n",
> > > + name, i, vctr);
> > > + error = true;
> > > + }
> > > +
> > > + if (!error)
> > > + xe_gt_hw_err(gt, "UNKNOWN CORRECTABLE
> > error\n");
> > > + } else {
> > > + xe_gt_correctable_hw_error_stats_update(gt,
> > errstat);
> > > + xe_gt_hw_err(gt,
> > "ERR_STAT_GT_CORRECTABLE:0x%08lx\n", errstat);
> > > + }
> > > + break;
> > > + case HARDWARE_ERROR_NONFATAL:
> > > + /*
> > > + * TODO: The GT Non Fatal Error Status Register
> > > + * only has reserved bitfields defined.
> > > + * Remove once there is something to service.
> > > + */
> > > + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "detected
> > Non-Fatal error\n");
> > > + break;
> > > + case HARDWARE_ERROR_FATAL:
> > > + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> > > + bool error = false;
> > > + int i;
> > > +
> > > + errstat = 0;
> > > + for (i = 0; i < ERR_STAT_GT_FATAL_VCTR_LEN; i++) {
> > > + u32 err_type =
> > ERR_STAT_GT_FATAL_VCTR_LEN;
> > > + unsigned long vctr;
> > > + const char *name;
> > > +
> > > + vctr = xe_mmio_read32(gt,
> > ERR_STAT_GT_FATAL_VCTR_REG(i).reg);
> > > + if (!vctr)
> > > + continue;
> > > +
> > > + /* i represents the vector register index */
> > > + switch (i) {
> > > + case ERR_STAT_GT_VCTR0:
> > > + case ERR_STAT_GT_VCTR1:
> > > + err_type =
> > INTEL_GT_HW_ERROR_FAT_SUBSLICE;
> > > + gt->errors.hw[err_type] +=
> > hweight32(vctr);
> > > + name = "SUBSLICE";
> > > +
> > > + /*Avoid second read/write to error
> > status register.*/
> > > + if (errstat)
> > > + break;
> > > +
> > > + errstat = xe_mmio_read32(gt,
> > ERR_STAT_GT_REG(hw_err).reg);
> > > + xe_gt_hw_err(gt,
> > "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
> > > +
> > xe_gt_fatal_hw_error_stats_update(gt, errstat);
> > > + if (errstat)
> > > + xe_mmio_write32(gt,
> > ERR_STAT_GT_REG(hw_err).reg,
> > > + errstat);
> > > + break;
> > > +
> > > + case ERR_STAT_GT_VCTR2:
> > > + case ERR_STAT_GT_VCTR3:
> > > + err_type =
> > INTEL_GT_HW_ERROR_FAT_L3BANK;
> > > + gt->errors.hw[err_type] +=
> > hweight32(vctr);
> > > + name = "L3 BANK";
> > > + break;
> > > + case ERR_STAT_GT_VCTR6:
> > > + gt-
> > >errors.hw[INTEL_GT_HW_ERROR_FAT_TLB] += hweight16(vctr);
> > > + name = "TLB";
> > > + break;
> > > + case ERR_STAT_GT_VCTR7:
> > > + gt-
> > >errors.hw[INTEL_GT_HW_ERROR_FAT_L3_FABRIC] += hweight8(vctr);
> > > + name = "L3 FABRIC";
> > > + break;
> > > + default:
> > > + name = "UNKNOWN";
> > > + break;
> > > + }
> > > + xe_mmio_write32(gt,
> > ERR_STAT_GT_FATAL_VCTR_REG(i).reg, vctr);
> > > + xe_gt_hw_err(gt, "%s FATAL error,
> > ERR_VECT_GT_FATAL_%d:0x%08lx\n",
> > > + name, i, vctr);
> > > + error = true;
> > > + }
> > > + if (!error)
> > > + xe_gt_hw_err(gt, "UNKNOWN FATAL
> > error\n");
> > > + } else {
> > > + xe_gt_fatal_hw_error_stats_update(gt, errstat);
> > > + xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n",
> > errstat);
> > > + }
> > > + break;
> > > + default:
> > > + break;
> > > + }
> > > +
> > > + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> > > + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> > errstat);
> > > }
> > >
> > > static void
> > > @@ -409,7 +647,8 @@ xe_hw_error_source_handler(struct xe_gt *gt,
> > const enum hardware_error hw_err)
> > > spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> > > errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> > > if (unlikely(!errsrc)) {
> > > - DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n",
> > hw_err_str);
> > > + xe_gt_log_driver_error(gt,
> > INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > > + "DEV_ERR_STAT_REG_%s blank!\n",
> > hw_err_str);
> > > goto out_unlock;
> > > }
> > >
> > > @@ -417,8 +656,9 @@ xe_hw_error_source_handler(struct xe_gt *gt,
> > const enum hardware_error hw_err)
> > > xe_gt_hw_error_handler(gt, hw_err);
> > >
> > > if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
> > > - DRM_ERROR("non-GT hardware error(s) in
> > DEV_ERR_STAT_REG_%s: 0x%08x\n",
> > > - hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
> > > + xe_gt_log_driver_error(gt,
> > INTEL_GT_DRIVER_ERROR_INTERRUPT,
> > > + "non-GT hardware error(s) in
> > DEV_ERR_STAT_REG_%s: 0x%08x\n",
> > > + hw_err_str, errsrc &
> > ~DEV_ERR_STAT_GT_ERROR);
> > >
> > > xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
> > >
> > > @@ -634,12 +874,44 @@ static void irq_uninstall(struct drm_device *drm,
> > void *arg)
> > > pci_disable_msi(pdev);
> > > }
> > >
> > > +/**
> > > + * process_hw_errors - checks for the occurrence of HW errors
> > > + *
> > > + * This checks for the HW Errors including FATAL error that might
> > > + * have occurred in the previous boot of the driver which will
> > > + * initiate PCIe FLR reset of the device and cause the
> > > + * driver to reload.
> >
> > Is this saying that there's already been a PCIe FLR and you're trying to read
> > the registers after that reset has happened? The bspec indicates that these
> > registers have 'DEV' style reset, so they wouldn't be able to preserve their
> > values across a reset.
> Registers preserve the value across reset.
> BSPEC: 50875
> >
> > > + */
> > > +static void process_hw_errors(struct xe_device *xe) {
> > > + struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
> > > + u32 dev_pcieerr_status, master_ctl;
> > > + struct xe_gt *gt;
> > > + int i;
> > > +
> > > + dev_pcieerr_status = xe_mmio_read32(gt0,
> > DEV_PCIEERR_STATUS.reg);
> > > +
> > > + for_each_gt(gt, xe, i) {
> > > + if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
> > > + xe_hw_error_source_handler(gt,
> > HARDWARE_ERROR_FATAL);
> > > +
> > > + master_ctl = xe_mmio_read32(gt,
> > GEN11_GFX_MSTR_IRQ.reg);
> > > + xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg,
> > master_ctl);
> > > + xe_hw_error_irq_handler(gt, master_ctl);
> > > + }
> > > + if (dev_pcieerr_status)
> > > + xe_mmio_write32(gt, DEV_PCIEERR_STATUS.reg,
> > dev_pcieerr_status); }
> > > +
> > > int xe_irq_install(struct xe_device *xe) {
> > > int irq = to_pci_dev(xe->drm.dev)->irq;
> > > irq_handler_t irq_handler;
> > > int err;
> > >
> > > + if (IS_DGFX(xe))
> > > + process_hw_errors(xe);
> >
> > Why is this conditional on DGFX? From what I can see this also applies to
> > integrated platforms like MTL too.
> From RAS perspective report of error is required on DGFX only.
> sysman don't use these counters from IGFX
I don't think we should add artificial limitations to the kernel driver
just because one wrapper library (which probably won't even get used by
most users) has them. If the hardware can report errors, and if users
can bypass sysman and just use EDAC stuff, then I don't see a reason to
hide the errors on integrated platforms?
Matt
> >
> >
> > Matt
> >
> > > +
> > > irq_handler = xe_irq_handler(xe);
> > > if (!irq_handler) {
> > > drm_err(&xe->drm, "No supported interrupt handler"); diff --
> > git
> > > a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index
> > > 1844cff8fba8..69098194cef8 100644
> > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > @@ -73,6 +73,7 @@ struct xe_device_desc {
> > > bool has_range_tlb_invalidation;
> > > bool has_asid;
> > > bool has_link_copy_engine;
> > > + bool has_gt_error_vectors;
> > > };
> > >
> > > __diag_push();
> > > @@ -232,6 +233,7 @@ static const struct xe_device_desc pvc_desc = {
> > > .supports_usm = true,
> > > .has_asid = true,
> > > .has_link_copy_engine = true,
> > > + .has_gt_error_vectors = true,
> > > };
> > >
> > > #define MTL_MEDIA_ENGINES \
> > > @@ -418,6 +420,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const
> > struct pci_device_id *ent)
> > > xe->info.vm_max_level = desc->vm_max_level;
> > > xe->info.supports_usm = desc->supports_usm;
> > > xe->info.has_asid = desc->has_asid;
> > > + xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
> > > xe->info.has_flat_ccs = desc->has_flat_ccs;
> > > xe->info.has_4tile = desc->has_4tile;
> > > xe->info.has_range_tlb_invalidation =
> > > desc->has_range_tlb_invalidation;
> > > --
> > > 2.25.1
> > >
> >
> > --
> > Matt Roper
> > Graphics Software Engineer
> > Linux GPU Platform Enablement
> > Intel Corporation
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-05-04 0:02 ` Matt Roper
@ 2023-05-05 7:24 ` Iddamsetty, Aravind
2023-05-05 15:10 ` Matt Roper
0 siblings, 1 reply; 22+ messages in thread
From: Iddamsetty, Aravind @ 2023-05-05 7:24 UTC (permalink / raw)
To: Matt Roper, Ghimiray, Himal Prasad; +Cc: intel-xe@lists.freedesktop.org
On 04-05-2023 05:32, Matt Roper wrote:
> On Fri, Apr 28, 2023 at 01:00:04AM -0700, Ghimiray, Himal Prasad wrote:
> ...
>>> All of this new infrastructure seems pretty questionable at the moment.
>>> We're doing extra work to count up errors, but then never doing anything
>>> with the counts. You mention in the cover letter that these will be exposed
>>> to userspace eventually, but what's the benefit of that? Which userspace
>>> component is going to actually use this information? What do you expect
>>> userspace to do if it finds out there's been a fatal or correctable error in
>>> some low-level hardware unit? Generally userspace shouldn't even need to
>>> care about the really low-level hardware details; if something has truly gone
>>> fatally wrong, it's game over for userspace and it probably doesn't matter
>>> exactly where in the hardware things are actually busted.
>>>
>>> Without some extra justification from the userspace point of view, it feels
>>> like we're just adding a bunch of code that doesn't have a real-world
>>> purpose.
>> The error counters exposed by KMD will be used by sysman
>> They will be categorized to specific category of error in sysman:
>> https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras
>
> L0 sysman looks sort of like a complicated libdrm replacement. I.e.,
> it's just a wrapper library over various uapi interfaces, but isn't
> really a "consumer" of that uapi itself. We need to understand the
> whole top-to-bottom stack to make sure that whatever interfaces and
> representation are selected (both at the Xe and Sysman levels) actually
> makes sense and satisfies the full stack needs.
>
> Is the actual reporting of these errors going to be done via the
> standard Linux RAS/EDAC interfaces? I.e., what's documented at
> https://www.kernel.org/doc/html/v6.3/admin-guide/ras.html ? If so, then
> there's already a bunch of real userspace tools that work with that, so
> that would probably help justify the work here, and make it clear that
> we're not just reinventing the wheel. If we're not tying into that,
> then we probably need to justify clearly why it can't be used.
I don't think we can use EDAC, IIUC it expects errors to be reported via
MCA(Machine Check Architecute)/MCE and our HW doesn't do that. the
correctable and non fatal errors are reported via MSI and fatal as a
PCI_ERR message.
Thanks,
Aravind.
>
>>>
>>>> + const enum xe_gt_driver_errors error,
>>>> + const char *fmt, ...)
>>>> +{
>>>> + struct va_format vaf;
>>>> + va_list args;
>>>> +
>>>> + va_start(args, fmt);
>>>> + vaf.fmt = fmt;
>>>> + vaf.va = &args;
>>>> +
>>>> + BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
>>>> + INTEL_GT_DRIVER_ERROR_COUNT);
>>>> +
>>>> + WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
>>>> +
>>>> + gt->errors.driver[error]++;
>>>> +
>>>> + drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
>>>> + gt->info.id,
>>>> + xe_gt_driver_errors_to_str[error],
>>>> + &vaf);
>>>> + va_end(args);
>>>> +}
>>>> +
>>>> struct xe_gt *xe_find_full_gt(struct xe_gt *gt) {
>>>> struct xe_gt *search;
>>>> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
>>>> b/drivers/gpu/drm/xe/xe_gt_types.h
>>>> index 8f29aba455e0..9580a40c0142 100644
>>>> --- a/drivers/gpu/drm/xe/xe_gt_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
>>>> @@ -33,6 +33,43 @@ enum xe_gt_type {
>>>> typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 *
>>>> XE_MAX_DSS_FUSE_REGS)]; typedef unsigned long
>>>> xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
>>>>
>>>> +/* Count of GT Correctable and FATAL HW ERRORS */ enum
>>>> +intel_gt_hw_errors {
>>>> + INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
>>>> + INTEL_GT_HW_ERROR_COR_L3BANK,
>>>> + INTEL_GT_HW_ERROR_COR_L3_SNG,
>>>> + INTEL_GT_HW_ERROR_COR_GUC,
>>>> + INTEL_GT_HW_ERROR_COR_SAMPLER,
>>>> + INTEL_GT_HW_ERROR_COR_SLM,
>>>> + INTEL_GT_HW_ERROR_COR_EU_IC,
>>>> + INTEL_GT_HW_ERROR_COR_EU_GRF,
>>>> + INTEL_GT_HW_ERROR_FAT_SUBSLICE,
>>>> + INTEL_GT_HW_ERROR_FAT_L3BANK,
>>>> + INTEL_GT_HW_ERROR_FAT_ARR_BIST,
>>>> + INTEL_GT_HW_ERROR_FAT_FPU,
>>>> + INTEL_GT_HW_ERROR_FAT_L3_DOUB,
>>>> + INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
>>>> + INTEL_GT_HW_ERROR_FAT_GUC,
>>>> + INTEL_GT_HW_ERROR_FAT_IDI_PAR,
>>>> + INTEL_GT_HW_ERROR_FAT_SQIDI,
>>>> + INTEL_GT_HW_ERROR_FAT_SAMPLER,
>>>> + INTEL_GT_HW_ERROR_FAT_SLM,
>>>> + INTEL_GT_HW_ERROR_FAT_EU_IC,
>>>> + INTEL_GT_HW_ERROR_FAT_EU_GRF,
>>>> + INTEL_GT_HW_ERROR_FAT_TLB,
>>>> + INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
>>>> + INTEL_GT_HW_ERROR_COUNT
>>>> +};
>>>> +
>>>> +enum xe_gt_driver_errors {
>>>> + INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
>>>> + INTEL_GT_DRIVER_ERROR_COUNT
>>>> +};
>>>> +
>>>> +void xe_gt_log_driver_error(struct xe_gt *gt,
>>>> + const enum xe_gt_driver_errors error,
>>>> + const char *fmt, ...);
>>>> +
>>>> struct xe_mmio_range {
>>>> u32 start;
>>>> u32 end;
>>>> @@ -357,6 +394,12 @@ struct xe_gt {
>>>> * of a steered operation
>>>> */
>>>> spinlock_t mcr_lock;
>>>> +
>>>> + struct intel_hw_errors {
>>>> + unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
>>>> + unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
>>>> + } errors;
>>>> +
>>>> };
>>>>
>>>> #endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
>>>> index 6b922332bff1..4626f7280aaf 100644
>>>> --- a/drivers/gpu/drm/xe/xe_irq.c
>>>> +++ b/drivers/gpu/drm/xe/xe_irq.c
>>>> @@ -19,6 +19,7 @@
>>>> #include "xe_hw_engine.h"
>>>> #include "xe_mmio.h"
>>>>
>>>> +#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
>>>> static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
>>>> {
>>>> u32 val = xe_mmio_read32(gt, reg.reg); @@ -359,44 +360,281 @@
>>>> hardware_error_type_to_str(const enum hardware_error hw_err)
>>>> }
>>>> }
>>>>
>>>> +#define xe_gt_hw_err(gt, fmt, ...) \
>>>> + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "GT%d detected "
>>> fmt, \
>>>> + (gt)->info.id, ##__VA_ARGS__)
>>>
>>> As on the previous patch, it looks like we're printing error-level kernel
>>> messages for correctable errors (i.e., things the hardware caught and fixed
>>> internally like ECC). Generally those kind of things shouldn't be putting
>>> errors in the kernel log because there's no actual problem from the end user
>>> perspective.
>> Agreed. Should we use warning for correctable errors ?
>
> Presumably we shouldn't be reporting them in dmesg at all by default.
> It looks like the EDAC subsystem has its own ways of controlling if/when
> stuff shows up in dmesg.
>
>>>
>>>> +
>>>> static void
>>>> -xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
>>>> hw_err)
>>>> +xe_gt_correctable_hw_error_stats_update(struct xe_gt *gt, unsigned
>>>> +long errstat)
>>>> {
>>>> - const char *hw_err_str = hardware_error_type_to_str(hw_err);
>>>> - u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
>>>> - u32 errstat;
>>>> + u32 errbit, cnt;
>>>>
>>>> - lockdep_assert_held(>_to_xe(gt)->irq.lock);
>>>> + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
>>>> + return;
>>>>
>>>> - errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
>>>> + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
>>>> + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
>>> PVC_COR_ERR_MASK)) {
>>>> + xe_gt_log_driver_error(gt,
>>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
>>>> + "UNKNOWN CORRECTABLE
>>> error\n");
>>>> + continue;
>>>> + }
>>>>
>>>> - if (unlikely(!errstat)) {
>>>> - DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n",
>>> hw_err_str);
>>>> - return;
>>>> + switch (errbit) {
>>>> + case L3_SNG_COR_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_L3_SNG]++;
>>>> + xe_gt_hw_err(gt, "L3 SINGLE CORRECTABLE error\n");
>>>> + break;
>>>> + case GUC_COR_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_GUC]++;
>>>> + xe_gt_hw_err(gt, "SINGLE BIT GUC SRAM CORRECTABLE
>>> error\n");
>>>> + break;
>>>> + case SAMPLER_COR_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SAMPLER]++;
>>>> + xe_gt_hw_err(gt, "SINGLE BIT SAMPLER CORRECTABLE
>>> error\n");
>>>> + break;
>>>> + case SLM_COR_ERR:
>>>> + cnt = xe_mmio_read32(gt,
>>> SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_CORRECTABLE).reg);
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SLM] = cnt;
>>>> + xe_gt_hw_err(gt, "%u SINGLE BIT SLM CORRECTABLE
>>> error\n", cnt);
>>>> + break;
>>>> + case EU_IC_COR_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_IC]++;
>>>> + xe_gt_hw_err(gt, "SINGLE BIT EU IC CORRECTABLE error\n");
>>>> + break;
>>>> + case EU_GRF_COR_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_GRF]++;
>>>> + xe_gt_hw_err(gt, "SINGLE BIT EU GRF CORRECTABLE
>>> error\n");
>>>> + break;
>>>> + default:
>>>> + xe_gt_log_driver_error(gt,
>>> INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN CORRECTABLE error\n");
>>>> + break;
>>>> + }
>>>> }
>>>> +}
>>>>
>>>> - /*
>>>> - * TODO: The GT Non Fatal Error Status Register
>>>> - * only has reserved bitfields defined.
>>>> - * Remove once there is something to service.
>>>> - */
>>>> - if (hw_err == HARDWARE_ERROR_NONFATAL) {
>>>> - DRM_ERROR("detected Non-Fatal error\n");
>>>> - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
>>> errstat);
>>>> +static void xe_gt_fatal_hw_error_stats_update(struct xe_gt *gt,
>>>> +unsigned long errstat) {
>>>> + u32 errbit, cnt;
>>>> +
>>>> + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
>>>> return;
>>>> +
>>>> + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
>>>> + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
>>> PVC_FAT_ERR_MASK)) {
>>>> + xe_gt_log_driver_error(gt,
>>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
>>>> + "UNKNOWN FATAL error\n");
>>>> + continue;
>>>> + }
>>>> +
>>>> + switch (errbit) {
>>>> + case ARRAY_BIST_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_ARR_BIST]++;
>>>> + xe_gt_hw_err(gt, "Array BIST FATAL error\n");
>>>> + break;
>>>> + case FPU_UNCORR_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_FPU]++;
>>>> + xe_gt_hw_err(gt, "FPU FATAL error\n");
>>>> + break;
>>>> + case L3_DOUBLE_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_DOUB]++;
>>>> + xe_gt_hw_err(gt, "L3 Double FATAL error\n");
>>>> + break;
>>>> + case L3_ECC_CHK_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK]++;
>>>> + xe_gt_hw_err(gt, "L3 ECC Checker FATAL error\n");
>>>> + break;
>>>> + case GUC_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_GUC]++;
>>>> + xe_gt_hw_err(gt, "GUC SRAM FATAL error\n");
>>>> + break;
>>>> + case IDI_PAR_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_IDI_PAR]++;
>>>> + xe_gt_hw_err(gt, "IDI PARITY FATAL error\n");
>>>> + break;
>>>> + case SQIDI_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SQIDI]++;
>>>> + xe_gt_hw_err(gt, "SQIDI FATAL error\n");
>>>> + break;
>>>> + case SAMPLER_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SAMPLER]++;
>>>> + xe_gt_hw_err(gt, "SAMPLER FATAL error\n");
>>>> + break;
>>>> + case SLM_FAT_ERR:
>>>> + cnt = xe_mmio_read32(gt,
>>> SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_FATAL).reg);
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SLM] = cnt;
>>>> + xe_gt_hw_err(gt, "%u SLM FATAL error\n", cnt);
>>>> + break;
>>>> + case EU_IC_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_IC]++;
>>>> + xe_gt_hw_err(gt, "EU IC FATAL error\n");
>>>> + break;
>>>> + case EU_GRF_FAT_ERR:
>>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_GRF]++;
>>>> + xe_gt_hw_err(gt, "EU GRF FATAL error\n");
>>>> + break;
>>>> + default:
>>>> + xe_gt_log_driver_error(gt,
>>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
>>>> + "UNKNOWN FATAL error\n");
>>>> + break;
>>>> + }
>>>> }
>>>> +}
>>>>
>>>> - /*
>>>> - * TODO: The remaining GT errors don't have a
>>>> - * need for targeted logging at the moment. We
>>>> - * still want to log detection of these errors, but
>>>> - * let's aggregate them until someone has a need for them.
>>>> - */
>>>> - if (errstat & other_errors)
>>>> - DRM_ERROR("detected hardware error(s) in
>>> ERR_STAT_GT_REG_%s: 0x%08x\n",
>>>> - hw_err_str, errstat & other_errors);
>>>> +static void
>>>> +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
>>>> +hw_err) {
>>>> + const char *hw_err_str = hardware_error_type_to_str(hw_err);
>>>> + unsigned long errstat;
>>>> +
>>>> + lockdep_assert_held(>_to_xe(gt)->irq.lock);
>>>>
>>>> - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
>>>> + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
>>>> + errstat = xe_mmio_read32(gt,
>>> ERR_STAT_GT_REG(hw_err).reg);
>>>> + if (unlikely(!errstat)) {
>>>> + xe_gt_log_driver_error(gt,
>>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
>>>> + "ERR_STAT_GT_REG_%s
>>> blank!\n", hw_err_str);
>>>> + return;
>>>> + }
>>>> + }
>>>> +
>>>> + switch (hw_err) {
>>>> + case HARDWARE_ERROR_CORRECTABLE:
>>>> + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
>>>> + bool error = false;
>>>> + int i;
>>>> +
>>>> + errstat = 0;
>>>> + for (i = 0; i < ERR_STAT_GT_COR_VCTR_LEN; i++) {
>>>> + u32 err_type =
>>> ERR_STAT_GT_COR_VCTR_LEN;
>>>> + unsigned long vctr;
>>>> + const char *name;
>>>> +
>>>> + vctr = xe_mmio_read32(gt,
>>> ERR_STAT_GT_COR_VCTR_REG(i).reg);
>>>> + if (!vctr)
>>>> + continue;
>>>> +
>>>> + switch (i) {
>>>> + case ERR_STAT_GT_VCTR0:
>>>> + case ERR_STAT_GT_VCTR1:
>>>> + err_type =
>>> INTEL_GT_HW_ERROR_COR_SUBSLICE;
>>>> + gt->errors.hw[err_type] +=
>>> hweight32(vctr);
>>>> + name = "SUBSLICE";
>>>> +
>>>> + /* Avoid second read/write to error
>>> status register*/
>>>> + if (errstat)
>>>> + break;
>>>> +
>>>> + errstat = xe_mmio_read32(gt,
>>> ERR_STAT_GT_REG(hw_err).reg);
>>>> + xe_gt_hw_err(gt,
>>> "ERR_STAT_GT_CORRECTABLE:0x%08lx\n",
>>>> + errstat);
>>>> +
>>> xe_gt_correctable_hw_error_stats_update(gt, errstat);
>>>> + if (errstat)
>>>> + xe_mmio_write32(gt,
>>> ERR_STAT_GT_REG(hw_err).reg,
>>>> + errstat);
>>>> + break;
>>>> +
>>>> + case ERR_STAT_GT_VCTR2:
>>>> + case ERR_STAT_GT_VCTR3:
>>>> + err_type =
>>> INTEL_GT_HW_ERROR_COR_L3BANK;
>>>> + gt->errors.hw[err_type] +=
>>> hweight32(vctr);
>>>> + name = "L3 BANK";
>>>> + break;
>>>> + default:
>>>> + name = "UNKNOWN";
>>>> + break;
>>>> + }
>>>> + xe_mmio_write32(gt,
>>> ERR_STAT_GT_COR_VCTR_REG(i).reg, vctr);
>>>> + xe_gt_hw_err(gt, "%s CORRECTABLE error,
>>> ERR_VECT_GT_CORRECTABLE_%d:0x%08lx\n",
>>>> + name, i, vctr);
>>>> + error = true;
>>>> + }
>>>> +
>>>> + if (!error)
>>>> + xe_gt_hw_err(gt, "UNKNOWN CORRECTABLE
>>> error\n");
>>>> + } else {
>>>> + xe_gt_correctable_hw_error_stats_update(gt,
>>> errstat);
>>>> + xe_gt_hw_err(gt,
>>> "ERR_STAT_GT_CORRECTABLE:0x%08lx\n", errstat);
>>>> + }
>>>> + break;
>>>> + case HARDWARE_ERROR_NONFATAL:
>>>> + /*
>>>> + * TODO: The GT Non Fatal Error Status Register
>>>> + * only has reserved bitfields defined.
>>>> + * Remove once there is something to service.
>>>> + */
>>>> + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "detected
>>> Non-Fatal error\n");
>>>> + break;
>>>> + case HARDWARE_ERROR_FATAL:
>>>> + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
>>>> + bool error = false;
>>>> + int i;
>>>> +
>>>> + errstat = 0;
>>>> + for (i = 0; i < ERR_STAT_GT_FATAL_VCTR_LEN; i++) {
>>>> + u32 err_type =
>>> ERR_STAT_GT_FATAL_VCTR_LEN;
>>>> + unsigned long vctr;
>>>> + const char *name;
>>>> +
>>>> + vctr = xe_mmio_read32(gt,
>>> ERR_STAT_GT_FATAL_VCTR_REG(i).reg);
>>>> + if (!vctr)
>>>> + continue;
>>>> +
>>>> + /* i represents the vector register index */
>>>> + switch (i) {
>>>> + case ERR_STAT_GT_VCTR0:
>>>> + case ERR_STAT_GT_VCTR1:
>>>> + err_type =
>>> INTEL_GT_HW_ERROR_FAT_SUBSLICE;
>>>> + gt->errors.hw[err_type] +=
>>> hweight32(vctr);
>>>> + name = "SUBSLICE";
>>>> +
>>>> + /*Avoid second read/write to error
>>> status register.*/
>>>> + if (errstat)
>>>> + break;
>>>> +
>>>> + errstat = xe_mmio_read32(gt,
>>> ERR_STAT_GT_REG(hw_err).reg);
>>>> + xe_gt_hw_err(gt,
>>> "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
>>>> +
>>> xe_gt_fatal_hw_error_stats_update(gt, errstat);
>>>> + if (errstat)
>>>> + xe_mmio_write32(gt,
>>> ERR_STAT_GT_REG(hw_err).reg,
>>>> + errstat);
>>>> + break;
>>>> +
>>>> + case ERR_STAT_GT_VCTR2:
>>>> + case ERR_STAT_GT_VCTR3:
>>>> + err_type =
>>> INTEL_GT_HW_ERROR_FAT_L3BANK;
>>>> + gt->errors.hw[err_type] +=
>>> hweight32(vctr);
>>>> + name = "L3 BANK";
>>>> + break;
>>>> + case ERR_STAT_GT_VCTR6:
>>>> + gt-
>>>> errors.hw[INTEL_GT_HW_ERROR_FAT_TLB] += hweight16(vctr);
>>>> + name = "TLB";
>>>> + break;
>>>> + case ERR_STAT_GT_VCTR7:
>>>> + gt-
>>>> errors.hw[INTEL_GT_HW_ERROR_FAT_L3_FABRIC] += hweight8(vctr);
>>>> + name = "L3 FABRIC";
>>>> + break;
>>>> + default:
>>>> + name = "UNKNOWN";
>>>> + break;
>>>> + }
>>>> + xe_mmio_write32(gt,
>>> ERR_STAT_GT_FATAL_VCTR_REG(i).reg, vctr);
>>>> + xe_gt_hw_err(gt, "%s FATAL error,
>>> ERR_VECT_GT_FATAL_%d:0x%08lx\n",
>>>> + name, i, vctr);
>>>> + error = true;
>>>> + }
>>>> + if (!error)
>>>> + xe_gt_hw_err(gt, "UNKNOWN FATAL
>>> error\n");
>>>> + } else {
>>>> + xe_gt_fatal_hw_error_stats_update(gt, errstat);
>>>> + xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n",
>>> errstat);
>>>> + }
>>>> + break;
>>>> + default:
>>>> + break;
>>>> + }
>>>> +
>>>> + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
>>>> + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
>>> errstat);
>>>> }
>>>>
>>>> static void
>>>> @@ -409,7 +647,8 @@ xe_hw_error_source_handler(struct xe_gt *gt,
>>> const enum hardware_error hw_err)
>>>> spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
>>>> errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
>>>> if (unlikely(!errsrc)) {
>>>> - DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n",
>>> hw_err_str);
>>>> + xe_gt_log_driver_error(gt,
>>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
>>>> + "DEV_ERR_STAT_REG_%s blank!\n",
>>> hw_err_str);
>>>> goto out_unlock;
>>>> }
>>>>
>>>> @@ -417,8 +656,9 @@ xe_hw_error_source_handler(struct xe_gt *gt,
>>> const enum hardware_error hw_err)
>>>> xe_gt_hw_error_handler(gt, hw_err);
>>>>
>>>> if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
>>>> - DRM_ERROR("non-GT hardware error(s) in
>>> DEV_ERR_STAT_REG_%s: 0x%08x\n",
>>>> - hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
>>>> + xe_gt_log_driver_error(gt,
>>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
>>>> + "non-GT hardware error(s) in
>>> DEV_ERR_STAT_REG_%s: 0x%08x\n",
>>>> + hw_err_str, errsrc &
>>> ~DEV_ERR_STAT_GT_ERROR);
>>>>
>>>> xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
>>>>
>>>> @@ -634,12 +874,44 @@ static void irq_uninstall(struct drm_device *drm,
>>> void *arg)
>>>> pci_disable_msi(pdev);
>>>> }
>>>>
>>>> +/**
>>>> + * process_hw_errors - checks for the occurrence of HW errors
>>>> + *
>>>> + * This checks for the HW Errors including FATAL error that might
>>>> + * have occurred in the previous boot of the driver which will
>>>> + * initiate PCIe FLR reset of the device and cause the
>>>> + * driver to reload.
>>>
>>> Is this saying that there's already been a PCIe FLR and you're trying to read
>>> the registers after that reset has happened? The bspec indicates that these
>>> registers have 'DEV' style reset, so they wouldn't be able to preserve their
>>> values across a reset.
>> Registers preserve the value across reset.
>> BSPEC: 50875
>>>
>>>> + */
>>>> +static void process_hw_errors(struct xe_device *xe) {
>>>> + struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
>>>> + u32 dev_pcieerr_status, master_ctl;
>>>> + struct xe_gt *gt;
>>>> + int i;
>>>> +
>>>> + dev_pcieerr_status = xe_mmio_read32(gt0,
>>> DEV_PCIEERR_STATUS.reg);
>>>> +
>>>> + for_each_gt(gt, xe, i) {
>>>> + if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
>>>> + xe_hw_error_source_handler(gt,
>>> HARDWARE_ERROR_FATAL);
>>>> +
>>>> + master_ctl = xe_mmio_read32(gt,
>>> GEN11_GFX_MSTR_IRQ.reg);
>>>> + xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg,
>>> master_ctl);
>>>> + xe_hw_error_irq_handler(gt, master_ctl);
>>>> + }
>>>> + if (dev_pcieerr_status)
>>>> + xe_mmio_write32(gt, DEV_PCIEERR_STATUS.reg,
>>> dev_pcieerr_status); }
>>>> +
>>>> int xe_irq_install(struct xe_device *xe) {
>>>> int irq = to_pci_dev(xe->drm.dev)->irq;
>>>> irq_handler_t irq_handler;
>>>> int err;
>>>>
>>>> + if (IS_DGFX(xe))
>>>> + process_hw_errors(xe);
>>>
>>> Why is this conditional on DGFX? From what I can see this also applies to
>>> integrated platforms like MTL too.
>> From RAS perspective report of error is required on DGFX only.
>> sysman don't use these counters from IGFX
>
> I don't think we should add artificial limitations to the kernel driver
> just because one wrapper library (which probably won't even get used by
> most users) has them. If the hardware can report errors, and if users
> can bypass sysman and just use EDAC stuff, then I don't see a reason to
> hide the errors on integrated platforms?
>
>
> Matt
>
>>>
>>>
>>> Matt
>>>
>>>> +
>>>> irq_handler = xe_irq_handler(xe);
>>>> if (!irq_handler) {
>>>> drm_err(&xe->drm, "No supported interrupt handler"); diff --
>>> git
>>>> a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index
>>>> 1844cff8fba8..69098194cef8 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pci.c
>>>> +++ b/drivers/gpu/drm/xe/xe_pci.c
>>>> @@ -73,6 +73,7 @@ struct xe_device_desc {
>>>> bool has_range_tlb_invalidation;
>>>> bool has_asid;
>>>> bool has_link_copy_engine;
>>>> + bool has_gt_error_vectors;
>>>> };
>>>>
>>>> __diag_push();
>>>> @@ -232,6 +233,7 @@ static const struct xe_device_desc pvc_desc = {
>>>> .supports_usm = true,
>>>> .has_asid = true,
>>>> .has_link_copy_engine = true,
>>>> + .has_gt_error_vectors = true,
>>>> };
>>>>
>>>> #define MTL_MEDIA_ENGINES \
>>>> @@ -418,6 +420,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const
>>> struct pci_device_id *ent)
>>>> xe->info.vm_max_level = desc->vm_max_level;
>>>> xe->info.supports_usm = desc->supports_usm;
>>>> xe->info.has_asid = desc->has_asid;
>>>> + xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
>>>> xe->info.has_flat_ccs = desc->has_flat_ccs;
>>>> xe->info.has_4tile = desc->has_4tile;
>>>> xe->info.has_range_tlb_invalidation =
>>>> desc->has_range_tlb_invalidation;
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> --
>>> Matt Roper
>>> Graphics Software Engineer
>>> Linux GPU Platform Enablement
>>> Intel Corporation
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-05-05 7:24 ` Iddamsetty, Aravind
@ 2023-05-05 15:10 ` Matt Roper
0 siblings, 0 replies; 22+ messages in thread
From: Matt Roper @ 2023-05-05 15:10 UTC (permalink / raw)
To: Iddamsetty, Aravind; +Cc: intel-xe@lists.freedesktop.org
On Fri, May 05, 2023 at 12:54:20PM +0530, Iddamsetty, Aravind wrote:
>
>
> On 04-05-2023 05:32, Matt Roper wrote:
> > On Fri, Apr 28, 2023 at 01:00:04AM -0700, Ghimiray, Himal Prasad wrote:
> > ...
> >>> All of this new infrastructure seems pretty questionable at the moment.
> >>> We're doing extra work to count up errors, but then never doing anything
> >>> with the counts. You mention in the cover letter that these will be exposed
> >>> to userspace eventually, but what's the benefit of that? Which userspace
> >>> component is going to actually use this information? What do you expect
> >>> userspace to do if it finds out there's been a fatal or correctable error in
> >>> some low-level hardware unit? Generally userspace shouldn't even need to
> >>> care about the really low-level hardware details; if something has truly gone
> >>> fatally wrong, it's game over for userspace and it probably doesn't matter
> >>> exactly where in the hardware things are actually busted.
> >>>
> >>> Without some extra justification from the userspace point of view, it feels
> >>> like we're just adding a bunch of code that doesn't have a real-world
> >>> purpose.
> >> The error counters exposed by KMD will be used by sysman
> >> They will be categorized to specific category of error in sysman:
> >> https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras
> >
> > L0 sysman looks sort of like a complicated libdrm replacement. I.e.,
> > it's just a wrapper library over various uapi interfaces, but isn't
> > really a "consumer" of that uapi itself. We need to understand the
> > whole top-to-bottom stack to make sure that whatever interfaces and
> > representation are selected (both at the Xe and Sysman levels) actually
> > makes sense and satisfies the full stack needs.
> >
> > Is the actual reporting of these errors going to be done via the
> > standard Linux RAS/EDAC interfaces? I.e., what's documented at
> > https://www.kernel.org/doc/html/v6.3/admin-guide/ras.html ? If so, then
> > there's already a bunch of real userspace tools that work with that, so
> > that would probably help justify the work here, and make it clear that
> > we're not just reinventing the wheel. If we're not tying into that,
> > then we probably need to justify clearly why it can't be used.
>
> I don't think we can use EDAC, IIUC it expects errors to be reported via
> MCA(Machine Check Architecute)/MCE and our HW doesn't do that. the
> correctable and non fatal errors are reported via MSI and fatal as a
> PCI_ERR message.
My impression from skimming the EDAC stuff was that this is under our
control. I.e., our driver can receive the errors any way it wants (such
as MSI), but we'd use the edac_device framework to manage the counts and
to handle reporting it out to userspace through standard sysfs
mechanisms. Is that not the case? I see some of the other drivers
receiving their errors as interrupts and then using
edac_device_handle_ce() / edac_device_handle_ue() to manage and report
the errors.
Matt
>
> Thanks,
> Aravind.
> >
> >>>
> >>>> + const enum xe_gt_driver_errors error,
> >>>> + const char *fmt, ...)
> >>>> +{
> >>>> + struct va_format vaf;
> >>>> + va_list args;
> >>>> +
> >>>> + va_start(args, fmt);
> >>>> + vaf.fmt = fmt;
> >>>> + vaf.va = &args;
> >>>> +
> >>>> + BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
> >>>> + INTEL_GT_DRIVER_ERROR_COUNT);
> >>>> +
> >>>> + WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
> >>>> +
> >>>> + gt->errors.driver[error]++;
> >>>> +
> >>>> + drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
> >>>> + gt->info.id,
> >>>> + xe_gt_driver_errors_to_str[error],
> >>>> + &vaf);
> >>>> + va_end(args);
> >>>> +}
> >>>> +
> >>>> struct xe_gt *xe_find_full_gt(struct xe_gt *gt) {
> >>>> struct xe_gt *search;
> >>>> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
> >>>> b/drivers/gpu/drm/xe/xe_gt_types.h
> >>>> index 8f29aba455e0..9580a40c0142 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_gt_types.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> >>>> @@ -33,6 +33,43 @@ enum xe_gt_type {
> >>>> typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 *
> >>>> XE_MAX_DSS_FUSE_REGS)]; typedef unsigned long
> >>>> xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
> >>>>
> >>>> +/* Count of GT Correctable and FATAL HW ERRORS */ enum
> >>>> +intel_gt_hw_errors {
> >>>> + INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
> >>>> + INTEL_GT_HW_ERROR_COR_L3BANK,
> >>>> + INTEL_GT_HW_ERROR_COR_L3_SNG,
> >>>> + INTEL_GT_HW_ERROR_COR_GUC,
> >>>> + INTEL_GT_HW_ERROR_COR_SAMPLER,
> >>>> + INTEL_GT_HW_ERROR_COR_SLM,
> >>>> + INTEL_GT_HW_ERROR_COR_EU_IC,
> >>>> + INTEL_GT_HW_ERROR_COR_EU_GRF,
> >>>> + INTEL_GT_HW_ERROR_FAT_SUBSLICE,
> >>>> + INTEL_GT_HW_ERROR_FAT_L3BANK,
> >>>> + INTEL_GT_HW_ERROR_FAT_ARR_BIST,
> >>>> + INTEL_GT_HW_ERROR_FAT_FPU,
> >>>> + INTEL_GT_HW_ERROR_FAT_L3_DOUB,
> >>>> + INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
> >>>> + INTEL_GT_HW_ERROR_FAT_GUC,
> >>>> + INTEL_GT_HW_ERROR_FAT_IDI_PAR,
> >>>> + INTEL_GT_HW_ERROR_FAT_SQIDI,
> >>>> + INTEL_GT_HW_ERROR_FAT_SAMPLER,
> >>>> + INTEL_GT_HW_ERROR_FAT_SLM,
> >>>> + INTEL_GT_HW_ERROR_FAT_EU_IC,
> >>>> + INTEL_GT_HW_ERROR_FAT_EU_GRF,
> >>>> + INTEL_GT_HW_ERROR_FAT_TLB,
> >>>> + INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
> >>>> + INTEL_GT_HW_ERROR_COUNT
> >>>> +};
> >>>> +
> >>>> +enum xe_gt_driver_errors {
> >>>> + INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
> >>>> + INTEL_GT_DRIVER_ERROR_COUNT
> >>>> +};
> >>>> +
> >>>> +void xe_gt_log_driver_error(struct xe_gt *gt,
> >>>> + const enum xe_gt_driver_errors error,
> >>>> + const char *fmt, ...);
> >>>> +
> >>>> struct xe_mmio_range {
> >>>> u32 start;
> >>>> u32 end;
> >>>> @@ -357,6 +394,12 @@ struct xe_gt {
> >>>> * of a steered operation
> >>>> */
> >>>> spinlock_t mcr_lock;
> >>>> +
> >>>> + struct intel_hw_errors {
> >>>> + unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
> >>>> + unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
> >>>> + } errors;
> >>>> +
> >>>> };
> >>>>
> >>>> #endif
> >>>> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> >>>> index 6b922332bff1..4626f7280aaf 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_irq.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_irq.c
> >>>> @@ -19,6 +19,7 @@
> >>>> #include "xe_hw_engine.h"
> >>>> #include "xe_mmio.h"
> >>>>
> >>>> +#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
> >>>> static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
> >>>> {
> >>>> u32 val = xe_mmio_read32(gt, reg.reg); @@ -359,44 +360,281 @@
> >>>> hardware_error_type_to_str(const enum hardware_error hw_err)
> >>>> }
> >>>> }
> >>>>
> >>>> +#define xe_gt_hw_err(gt, fmt, ...) \
> >>>> + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "GT%d detected "
> >>> fmt, \
> >>>> + (gt)->info.id, ##__VA_ARGS__)
> >>>
> >>> As on the previous patch, it looks like we're printing error-level kernel
> >>> messages for correctable errors (i.e., things the hardware caught and fixed
> >>> internally like ECC). Generally those kind of things shouldn't be putting
> >>> errors in the kernel log because there's no actual problem from the end user
> >>> perspective.
> >> Agreed. Should we use warning for correctable errors ?
> >
> > Presumably we shouldn't be reporting them in dmesg at all by default.
> > It looks like the EDAC subsystem has its own ways of controlling if/when
> > stuff shows up in dmesg.
> >
> >>>
> >>>> +
> >>>> static void
> >>>> -xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
> >>>> hw_err)
> >>>> +xe_gt_correctable_hw_error_stats_update(struct xe_gt *gt, unsigned
> >>>> +long errstat)
> >>>> {
> >>>> - const char *hw_err_str = hardware_error_type_to_str(hw_err);
> >>>> - u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> >>>> - u32 errstat;
> >>>> + u32 errbit, cnt;
> >>>>
> >>>> - lockdep_assert_held(>_to_xe(gt)->irq.lock);
> >>>> + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> >>>> + return;
> >>>>
> >>>> - errstat = xe_mmio_read32(gt, ERR_STAT_GT_REG(hw_err).reg);
> >>>> + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> >>>> + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
> >>> PVC_COR_ERR_MASK)) {
> >>>> + xe_gt_log_driver_error(gt,
> >>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> >>>> + "UNKNOWN CORRECTABLE
> >>> error\n");
> >>>> + continue;
> >>>> + }
> >>>>
> >>>> - if (unlikely(!errstat)) {
> >>>> - DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n",
> >>> hw_err_str);
> >>>> - return;
> >>>> + switch (errbit) {
> >>>> + case L3_SNG_COR_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_L3_SNG]++;
> >>>> + xe_gt_hw_err(gt, "L3 SINGLE CORRECTABLE error\n");
> >>>> + break;
> >>>> + case GUC_COR_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_GUC]++;
> >>>> + xe_gt_hw_err(gt, "SINGLE BIT GUC SRAM CORRECTABLE
> >>> error\n");
> >>>> + break;
> >>>> + case SAMPLER_COR_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SAMPLER]++;
> >>>> + xe_gt_hw_err(gt, "SINGLE BIT SAMPLER CORRECTABLE
> >>> error\n");
> >>>> + break;
> >>>> + case SLM_COR_ERR:
> >>>> + cnt = xe_mmio_read32(gt,
> >>> SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_CORRECTABLE).reg);
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_SLM] = cnt;
> >>>> + xe_gt_hw_err(gt, "%u SINGLE BIT SLM CORRECTABLE
> >>> error\n", cnt);
> >>>> + break;
> >>>> + case EU_IC_COR_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_IC]++;
> >>>> + xe_gt_hw_err(gt, "SINGLE BIT EU IC CORRECTABLE error\n");
> >>>> + break;
> >>>> + case EU_GRF_COR_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_COR_EU_GRF]++;
> >>>> + xe_gt_hw_err(gt, "SINGLE BIT EU GRF CORRECTABLE
> >>> error\n");
> >>>> + break;
> >>>> + default:
> >>>> + xe_gt_log_driver_error(gt,
> >>> INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN CORRECTABLE error\n");
> >>>> + break;
> >>>> + }
> >>>> }
> >>>> +}
> >>>>
> >>>> - /*
> >>>> - * TODO: The GT Non Fatal Error Status Register
> >>>> - * only has reserved bitfields defined.
> >>>> - * Remove once there is something to service.
> >>>> - */
> >>>> - if (hw_err == HARDWARE_ERROR_NONFATAL) {
> >>>> - DRM_ERROR("detected Non-Fatal error\n");
> >>>> - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> >>> errstat);
> >>>> +static void xe_gt_fatal_hw_error_stats_update(struct xe_gt *gt,
> >>>> +unsigned long errstat) {
> >>>> + u32 errbit, cnt;
> >>>> +
> >>>> + if (!errstat && HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> >>>> return;
> >>>> +
> >>>> + for_each_set_bit(errbit, &errstat, GT_HW_ERROR_MAX_ERR_BITS) {
> >>>> + if (gt->xe->info.platform == XE_PVC && !(REG_BIT(errbit) &
> >>> PVC_FAT_ERR_MASK)) {
> >>>> + xe_gt_log_driver_error(gt,
> >>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> >>>> + "UNKNOWN FATAL error\n");
> >>>> + continue;
> >>>> + }
> >>>> +
> >>>> + switch (errbit) {
> >>>> + case ARRAY_BIST_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_ARR_BIST]++;
> >>>> + xe_gt_hw_err(gt, "Array BIST FATAL error\n");
> >>>> + break;
> >>>> + case FPU_UNCORR_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_FPU]++;
> >>>> + xe_gt_hw_err(gt, "FPU FATAL error\n");
> >>>> + break;
> >>>> + case L3_DOUBLE_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_DOUB]++;
> >>>> + xe_gt_hw_err(gt, "L3 Double FATAL error\n");
> >>>> + break;
> >>>> + case L3_ECC_CHK_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK]++;
> >>>> + xe_gt_hw_err(gt, "L3 ECC Checker FATAL error\n");
> >>>> + break;
> >>>> + case GUC_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_GUC]++;
> >>>> + xe_gt_hw_err(gt, "GUC SRAM FATAL error\n");
> >>>> + break;
> >>>> + case IDI_PAR_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_IDI_PAR]++;
> >>>> + xe_gt_hw_err(gt, "IDI PARITY FATAL error\n");
> >>>> + break;
> >>>> + case SQIDI_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SQIDI]++;
> >>>> + xe_gt_hw_err(gt, "SQIDI FATAL error\n");
> >>>> + break;
> >>>> + case SAMPLER_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SAMPLER]++;
> >>>> + xe_gt_hw_err(gt, "SAMPLER FATAL error\n");
> >>>> + break;
> >>>> + case SLM_FAT_ERR:
> >>>> + cnt = xe_mmio_read32(gt,
> >>> SLM_ECC_ERROR_CNTR(HARDWARE_ERROR_FATAL).reg);
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_SLM] = cnt;
> >>>> + xe_gt_hw_err(gt, "%u SLM FATAL error\n", cnt);
> >>>> + break;
> >>>> + case EU_IC_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_IC]++;
> >>>> + xe_gt_hw_err(gt, "EU IC FATAL error\n");
> >>>> + break;
> >>>> + case EU_GRF_FAT_ERR:
> >>>> + gt->errors.hw[INTEL_GT_HW_ERROR_FAT_EU_GRF]++;
> >>>> + xe_gt_hw_err(gt, "EU GRF FATAL error\n");
> >>>> + break;
> >>>> + default:
> >>>> + xe_gt_log_driver_error(gt,
> >>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> >>>> + "UNKNOWN FATAL error\n");
> >>>> + break;
> >>>> + }
> >>>> }
> >>>> +}
> >>>>
> >>>> - /*
> >>>> - * TODO: The remaining GT errors don't have a
> >>>> - * need for targeted logging at the moment. We
> >>>> - * still want to log detection of these errors, but
> >>>> - * let's aggregate them until someone has a need for them.
> >>>> - */
> >>>> - if (errstat & other_errors)
> >>>> - DRM_ERROR("detected hardware error(s) in
> >>> ERR_STAT_GT_REG_%s: 0x%08x\n",
> >>>> - hw_err_str, errstat & other_errors);
> >>>> +static void
> >>>> +xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error
> >>>> +hw_err) {
> >>>> + const char *hw_err_str = hardware_error_type_to_str(hw_err);
> >>>> + unsigned long errstat;
> >>>> +
> >>>> + lockdep_assert_held(>_to_xe(gt)->irq.lock);
> >>>>
> >>>> - xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
> >>>> + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> >>>> + errstat = xe_mmio_read32(gt,
> >>> ERR_STAT_GT_REG(hw_err).reg);
> >>>> + if (unlikely(!errstat)) {
> >>>> + xe_gt_log_driver_error(gt,
> >>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> >>>> + "ERR_STAT_GT_REG_%s
> >>> blank!\n", hw_err_str);
> >>>> + return;
> >>>> + }
> >>>> + }
> >>>> +
> >>>> + switch (hw_err) {
> >>>> + case HARDWARE_ERROR_CORRECTABLE:
> >>>> + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> >>>> + bool error = false;
> >>>> + int i;
> >>>> +
> >>>> + errstat = 0;
> >>>> + for (i = 0; i < ERR_STAT_GT_COR_VCTR_LEN; i++) {
> >>>> + u32 err_type =
> >>> ERR_STAT_GT_COR_VCTR_LEN;
> >>>> + unsigned long vctr;
> >>>> + const char *name;
> >>>> +
> >>>> + vctr = xe_mmio_read32(gt,
> >>> ERR_STAT_GT_COR_VCTR_REG(i).reg);
> >>>> + if (!vctr)
> >>>> + continue;
> >>>> +
> >>>> + switch (i) {
> >>>> + case ERR_STAT_GT_VCTR0:
> >>>> + case ERR_STAT_GT_VCTR1:
> >>>> + err_type =
> >>> INTEL_GT_HW_ERROR_COR_SUBSLICE;
> >>>> + gt->errors.hw[err_type] +=
> >>> hweight32(vctr);
> >>>> + name = "SUBSLICE";
> >>>> +
> >>>> + /* Avoid second read/write to error
> >>> status register*/
> >>>> + if (errstat)
> >>>> + break;
> >>>> +
> >>>> + errstat = xe_mmio_read32(gt,
> >>> ERR_STAT_GT_REG(hw_err).reg);
> >>>> + xe_gt_hw_err(gt,
> >>> "ERR_STAT_GT_CORRECTABLE:0x%08lx\n",
> >>>> + errstat);
> >>>> +
> >>> xe_gt_correctable_hw_error_stats_update(gt, errstat);
> >>>> + if (errstat)
> >>>> + xe_mmio_write32(gt,
> >>> ERR_STAT_GT_REG(hw_err).reg,
> >>>> + errstat);
> >>>> + break;
> >>>> +
> >>>> + case ERR_STAT_GT_VCTR2:
> >>>> + case ERR_STAT_GT_VCTR3:
> >>>> + err_type =
> >>> INTEL_GT_HW_ERROR_COR_L3BANK;
> >>>> + gt->errors.hw[err_type] +=
> >>> hweight32(vctr);
> >>>> + name = "L3 BANK";
> >>>> + break;
> >>>> + default:
> >>>> + name = "UNKNOWN";
> >>>> + break;
> >>>> + }
> >>>> + xe_mmio_write32(gt,
> >>> ERR_STAT_GT_COR_VCTR_REG(i).reg, vctr);
> >>>> + xe_gt_hw_err(gt, "%s CORRECTABLE error,
> >>> ERR_VECT_GT_CORRECTABLE_%d:0x%08lx\n",
> >>>> + name, i, vctr);
> >>>> + error = true;
> >>>> + }
> >>>> +
> >>>> + if (!error)
> >>>> + xe_gt_hw_err(gt, "UNKNOWN CORRECTABLE
> >>> error\n");
> >>>> + } else {
> >>>> + xe_gt_correctable_hw_error_stats_update(gt,
> >>> errstat);
> >>>> + xe_gt_hw_err(gt,
> >>> "ERR_STAT_GT_CORRECTABLE:0x%08lx\n", errstat);
> >>>> + }
> >>>> + break;
> >>>> + case HARDWARE_ERROR_NONFATAL:
> >>>> + /*
> >>>> + * TODO: The GT Non Fatal Error Status Register
> >>>> + * only has reserved bitfields defined.
> >>>> + * Remove once there is something to service.
> >>>> + */
> >>>> + drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR "detected
> >>> Non-Fatal error\n");
> >>>> + break;
> >>>> + case HARDWARE_ERROR_FATAL:
> >>>> + if (HAS_GT_ERROR_VECTORS(gt_to_xe(gt))) {
> >>>> + bool error = false;
> >>>> + int i;
> >>>> +
> >>>> + errstat = 0;
> >>>> + for (i = 0; i < ERR_STAT_GT_FATAL_VCTR_LEN; i++) {
> >>>> + u32 err_type =
> >>> ERR_STAT_GT_FATAL_VCTR_LEN;
> >>>> + unsigned long vctr;
> >>>> + const char *name;
> >>>> +
> >>>> + vctr = xe_mmio_read32(gt,
> >>> ERR_STAT_GT_FATAL_VCTR_REG(i).reg);
> >>>> + if (!vctr)
> >>>> + continue;
> >>>> +
> >>>> + /* i represents the vector register index */
> >>>> + switch (i) {
> >>>> + case ERR_STAT_GT_VCTR0:
> >>>> + case ERR_STAT_GT_VCTR1:
> >>>> + err_type =
> >>> INTEL_GT_HW_ERROR_FAT_SUBSLICE;
> >>>> + gt->errors.hw[err_type] +=
> >>> hweight32(vctr);
> >>>> + name = "SUBSLICE";
> >>>> +
> >>>> + /*Avoid second read/write to error
> >>> status register.*/
> >>>> + if (errstat)
> >>>> + break;
> >>>> +
> >>>> + errstat = xe_mmio_read32(gt,
> >>> ERR_STAT_GT_REG(hw_err).reg);
> >>>> + xe_gt_hw_err(gt,
> >>> "ERR_STAT_GT_FATAL:0x%08lx\n", errstat);
> >>>> +
> >>> xe_gt_fatal_hw_error_stats_update(gt, errstat);
> >>>> + if (errstat)
> >>>> + xe_mmio_write32(gt,
> >>> ERR_STAT_GT_REG(hw_err).reg,
> >>>> + errstat);
> >>>> + break;
> >>>> +
> >>>> + case ERR_STAT_GT_VCTR2:
> >>>> + case ERR_STAT_GT_VCTR3:
> >>>> + err_type =
> >>> INTEL_GT_HW_ERROR_FAT_L3BANK;
> >>>> + gt->errors.hw[err_type] +=
> >>> hweight32(vctr);
> >>>> + name = "L3 BANK";
> >>>> + break;
> >>>> + case ERR_STAT_GT_VCTR6:
> >>>> + gt-
> >>>> errors.hw[INTEL_GT_HW_ERROR_FAT_TLB] += hweight16(vctr);
> >>>> + name = "TLB";
> >>>> + break;
> >>>> + case ERR_STAT_GT_VCTR7:
> >>>> + gt-
> >>>> errors.hw[INTEL_GT_HW_ERROR_FAT_L3_FABRIC] += hweight8(vctr);
> >>>> + name = "L3 FABRIC";
> >>>> + break;
> >>>> + default:
> >>>> + name = "UNKNOWN";
> >>>> + break;
> >>>> + }
> >>>> + xe_mmio_write32(gt,
> >>> ERR_STAT_GT_FATAL_VCTR_REG(i).reg, vctr);
> >>>> + xe_gt_hw_err(gt, "%s FATAL error,
> >>> ERR_VECT_GT_FATAL_%d:0x%08lx\n",
> >>>> + name, i, vctr);
> >>>> + error = true;
> >>>> + }
> >>>> + if (!error)
> >>>> + xe_gt_hw_err(gt, "UNKNOWN FATAL
> >>> error\n");
> >>>> + } else {
> >>>> + xe_gt_fatal_hw_error_stats_update(gt, errstat);
> >>>> + xe_gt_hw_err(gt, "ERR_STAT_GT_FATAL:0x%08lx\n",
> >>> errstat);
> >>>> + }
> >>>> + break;
> >>>> + default:
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + if (!HAS_GT_ERROR_VECTORS(gt_to_xe(gt)))
> >>>> + xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg,
> >>> errstat);
> >>>> }
> >>>>
> >>>> static void
> >>>> @@ -409,7 +647,8 @@ xe_hw_error_source_handler(struct xe_gt *gt,
> >>> const enum hardware_error hw_err)
> >>>> spin_lock_irqsave(>_to_xe(gt)->irq.lock, flags);
> >>>> errsrc = xe_mmio_read32(gt, DEV_ERR_STAT_REG(hw_err).reg);
> >>>> if (unlikely(!errsrc)) {
> >>>> - DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n",
> >>> hw_err_str);
> >>>> + xe_gt_log_driver_error(gt,
> >>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> >>>> + "DEV_ERR_STAT_REG_%s blank!\n",
> >>> hw_err_str);
> >>>> goto out_unlock;
> >>>> }
> >>>>
> >>>> @@ -417,8 +656,9 @@ xe_hw_error_source_handler(struct xe_gt *gt,
> >>> const enum hardware_error hw_err)
> >>>> xe_gt_hw_error_handler(gt, hw_err);
> >>>>
> >>>> if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
> >>>> - DRM_ERROR("non-GT hardware error(s) in
> >>> DEV_ERR_STAT_REG_%s: 0x%08x\n",
> >>>> - hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
> >>>> + xe_gt_log_driver_error(gt,
> >>> INTEL_GT_DRIVER_ERROR_INTERRUPT,
> >>>> + "non-GT hardware error(s) in
> >>> DEV_ERR_STAT_REG_%s: 0x%08x\n",
> >>>> + hw_err_str, errsrc &
> >>> ~DEV_ERR_STAT_GT_ERROR);
> >>>>
> >>>> xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
> >>>>
> >>>> @@ -634,12 +874,44 @@ static void irq_uninstall(struct drm_device *drm,
> >>> void *arg)
> >>>> pci_disable_msi(pdev);
> >>>> }
> >>>>
> >>>> +/**
> >>>> + * process_hw_errors - checks for the occurrence of HW errors
> >>>> + *
> >>>> + * This checks for the HW Errors including FATAL error that might
> >>>> + * have occurred in the previous boot of the driver which will
> >>>> + * initiate PCIe FLR reset of the device and cause the
> >>>> + * driver to reload.
> >>>
> >>> Is this saying that there's already been a PCIe FLR and you're trying to read
> >>> the registers after that reset has happened? The bspec indicates that these
> >>> registers have 'DEV' style reset, so they wouldn't be able to preserve their
> >>> values across a reset.
> >> Registers preserve the value across reset.
> >> BSPEC: 50875
> >>>
> >>>> + */
> >>>> +static void process_hw_errors(struct xe_device *xe) {
> >>>> + struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
> >>>> + u32 dev_pcieerr_status, master_ctl;
> >>>> + struct xe_gt *gt;
> >>>> + int i;
> >>>> +
> >>>> + dev_pcieerr_status = xe_mmio_read32(gt0,
> >>> DEV_PCIEERR_STATUS.reg);
> >>>> +
> >>>> + for_each_gt(gt, xe, i) {
> >>>> + if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
> >>>> + xe_hw_error_source_handler(gt,
> >>> HARDWARE_ERROR_FATAL);
> >>>> +
> >>>> + master_ctl = xe_mmio_read32(gt,
> >>> GEN11_GFX_MSTR_IRQ.reg);
> >>>> + xe_mmio_write32(gt, GEN11_GFX_MSTR_IRQ.reg,
> >>> master_ctl);
> >>>> + xe_hw_error_irq_handler(gt, master_ctl);
> >>>> + }
> >>>> + if (dev_pcieerr_status)
> >>>> + xe_mmio_write32(gt, DEV_PCIEERR_STATUS.reg,
> >>> dev_pcieerr_status); }
> >>>> +
> >>>> int xe_irq_install(struct xe_device *xe) {
> >>>> int irq = to_pci_dev(xe->drm.dev)->irq;
> >>>> irq_handler_t irq_handler;
> >>>> int err;
> >>>>
> >>>> + if (IS_DGFX(xe))
> >>>> + process_hw_errors(xe);
> >>>
> >>> Why is this conditional on DGFX? From what I can see this also applies to
> >>> integrated platforms like MTL too.
> >> From RAS perspective report of error is required on DGFX only.
> >> sysman don't use these counters from IGFX
> >
> > I don't think we should add artificial limitations to the kernel driver
> > just because one wrapper library (which probably won't even get used by
> > most users) has them. If the hardware can report errors, and if users
> > can bypass sysman and just use EDAC stuff, then I don't see a reason to
> > hide the errors on integrated platforms?
> >
> >
> > Matt
> >
> >>>
> >>>
> >>> Matt
> >>>
> >>>> +
> >>>> irq_handler = xe_irq_handler(xe);
> >>>> if (!irq_handler) {
> >>>> drm_err(&xe->drm, "No supported interrupt handler"); diff --
> >>> git
> >>>> a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index
> >>>> 1844cff8fba8..69098194cef8 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_pci.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_pci.c
> >>>> @@ -73,6 +73,7 @@ struct xe_device_desc {
> >>>> bool has_range_tlb_invalidation;
> >>>> bool has_asid;
> >>>> bool has_link_copy_engine;
> >>>> + bool has_gt_error_vectors;
> >>>> };
> >>>>
> >>>> __diag_push();
> >>>> @@ -232,6 +233,7 @@ static const struct xe_device_desc pvc_desc = {
> >>>> .supports_usm = true,
> >>>> .has_asid = true,
> >>>> .has_link_copy_engine = true,
> >>>> + .has_gt_error_vectors = true,
> >>>> };
> >>>>
> >>>> #define MTL_MEDIA_ENGINES \
> >>>> @@ -418,6 +420,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const
> >>> struct pci_device_id *ent)
> >>>> xe->info.vm_max_level = desc->vm_max_level;
> >>>> xe->info.supports_usm = desc->supports_usm;
> >>>> xe->info.has_asid = desc->has_asid;
> >>>> + xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
> >>>> xe->info.has_flat_ccs = desc->has_flat_ccs;
> >>>> xe->info.has_4tile = desc->has_4tile;
> >>>> xe->info.has_range_tlb_invalidation =
> >>>> desc->has_range_tlb_invalidation;
> >>>> --
> >>>> 2.25.1
> >>>>
> >>>
> >>> --
> >>> Matt Roper
> >>> Graphics Software Engineer
> >>> Linux GPU Platform Enablement
> >>> Intel Corporation
> >
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors.
2023-04-06 9:26 ` [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors Himal Prasad Ghimiray
2023-04-24 13:37 ` Dafna Hirschfeld
2023-04-25 0:22 ` Matt Roper
@ 2023-04-25 4:26 ` Iddamsetty, Aravind
2 siblings, 0 replies; 22+ messages in thread
From: Iddamsetty, Aravind @ 2023-04-25 4:26 UTC (permalink / raw)
To: Himal Prasad Ghimiray, intel-xe
On 06-04-2023 14:56, Himal Prasad Ghimiray wrote:
> From: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>
> Count the CORRECTABLE and FATAL GT hardware errors as
> signaled by relevant interrupt and respective registers.
>
> For non relevant interrupts count them as driver interrupt error.
>
> For platform supporting error vector registers count and report
> the respective vector errors.
>
> Co-authored-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> ---
> drivers/gpu/drm/xe/regs/xe_regs.h | 77 ++++++-
> drivers/gpu/drm/xe/xe_device_types.h | 2 +
> drivers/gpu/drm/xe/xe_gt.c | 29 +++
> drivers/gpu/drm/xe/xe_gt_types.h | 43 ++++
> drivers/gpu/drm/xe/xe_irq.c | 332 ++++++++++++++++++++++++---
> drivers/gpu/drm/xe/xe_pci.c | 3 +
> 6 files changed, 453 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
> index dff74b093d4e..b3d35d0c5a77 100644
> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> @@ -122,14 +122,50 @@ enum hardware_error {
> HARDWARE_ERROR_MAX,
> };
>
> +#define DEV_PCIEERR_STATUS _MMIO(0x100180)
> +#define DEV_PCIEERR_IS_FATAL(x) (REG_BIT(2) << (x * 4))
> #define _DEV_ERR_STAT_FATAL 0x100174
> #define _DEV_ERR_STAT_NONFATAL 0x100178
> #define _DEV_ERR_STAT_CORRECTABLE 0x10017c
> #define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
> _DEV_ERR_STAT_CORRECTABLE, \
> _DEV_ERR_STAT_NONFATAL))
> +
> #define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
>
> +enum gt_vctr_registers {
> + ERR_STAT_GT_VCTR0 = 0,
> + ERR_STAT_GT_VCTR1,
> + ERR_STAT_GT_VCTR2,
> + ERR_STAT_GT_VCTR3,
> + ERR_STAT_GT_VCTR4,
> + ERR_STAT_GT_VCTR5,
> + ERR_STAT_GT_VCTR6,
> + ERR_STAT_GT_VCTR7,
> +};
> +
> +#define ERR_STAT_GT_COR_VCTR_LEN (4)
> +#define _ERR_STAT_GT_COR_VCTR_0 0x1002a0
> +#define _ERR_STAT_GT_COR_VCTR_1 0x1002a4
> +#define _ERR_STAT_GT_COR_VCTR_2 0x1002a8
> +#define _ERR_STAT_GT_COR_VCTR_3 0x1002ac
> +#define ERR_STAT_GT_COR_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
> + _ERR_STAT_GT_COR_VCTR_0, \
> + _ERR_STAT_GT_COR_VCTR_1))
> +
> +#define ERR_STAT_GT_FATAL_VCTR_LEN (8)
> +#define _ERR_STAT_GT_FATAL_VCTR_0 0x100260
> +#define _ERR_STAT_GT_FATAL_VCTR_1 0x100264
> +#define _ERR_STAT_GT_FATAL_VCTR_2 0x100268
> +#define _ERR_STAT_GT_FATAL_VCTR_3 0x10026c
> +#define _ERR_STAT_GT_FATAL_VCTR_4 0x100270
> +#define _ERR_STAT_GT_FATAL_VCTR_5 0x100274
> +#define _ERR_STAT_GT_FATAL_VCTR_6 0x100278
> +#define _ERR_STAT_GT_FATAL_VCTR_7 0x10027c
> +#define ERR_STAT_GT_FATAL_VCTR_REG(x) _MMIO(_PICK_EVEN((x), \
> + _ERR_STAT_GT_FATAL_VCTR_0, \
> + _ERR_STAT_GT_FATAL_VCTR_1))
> +
> #define _ERR_STAT_GT_COR 0x100160
> #define _ERR_STAT_GT_NONFATAL 0x100164
> #define _ERR_STAT_GT_FATAL 0x100168
> @@ -137,7 +173,42 @@ enum hardware_error {
> _ERR_STAT_GT_COR, \
> _ERR_STAT_GT_NONFATAL))
>
> -#define EU_GRF_ERROR REG_BIT(15)
> -#define EU_IC_ERROR REG_BIT(14)
> -
> +#define EU_GRF_COR_ERR (15)
> +#define EU_IC_COR_ERR (14)
> +#define SLM_COR_ERR (13)
> +#define SAMPLER_COR_ERR (12)
> +#define GUC_COR_ERR (1)
> +#define L3_SNG_COR_ERR (0)
> +
> +#define PVC_COR_ERR_MASK \
> + (REG_BIT(GUC_COR_ERR) | \
> + REG_BIT(SLM_COR_ERR) | \
> + REG_BIT(EU_IC_COR_ERR) | \
> + REG_BIT(EU_GRF_COR_ERR))
> +
> +#define EU_GRF_FAT_ERR (15)
> +#define EU_IC_FAT_ERR (14)
> +#define SLM_FAT_ERR (13)
> +#define SAMPLER_FAT_ERR (12)
> +#define SQIDI_FAT_ERR (9)
> +#define IDI_PAR_FAT_ERR (8)
> +#define GUC_FAT_ERR (6)
> +#define L3_ECC_CHK_FAT_ERR (5)
> +#define L3_DOUBLE_FAT_ERR (4)
> +#define FPU_UNCORR_FAT_ERR (3)
> +#define ARRAY_BIST_FAT_ERR (1)
> +
> +#define PVC_FAT_ERR_MASK \
> + (REG_BIT(FPU_UNCORR_FAT_ERR) | \
> + REG_BIT(GUC_FAT_ERR) | \
> + REG_BIT(SLM_FAT_ERR) | \
> + REG_BIT(EU_GRF_FAT_ERR))
> +
> +#define GT_HW_ERROR_MAX_ERR_BITS 16
> +
> +#define _SLM_ECC_ERROR_CNT 0xe7f4
> +#define _SLM_UNCORR_ECC_ERROR_CNT 0xe7c0
> +#define SLM_ECC_ERROR_CNTR(x) _MMIO((x) == HARDWARE_ERROR_CORRECTABLE ? \
> + _SLM_ECC_ERROR_CNT : \
> + _SLM_UNCORR_ECC_ERROR_CNT)
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 88f863edc41c..ecabf4d6690d 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -99,6 +99,8 @@ struct xe_device {
> bool has_link_copy_engine;
> /** @enable_display: display enabled */
> bool enable_display;
> + /** @has_gt_error_vectors: whether platform supports ERROR VECTORS */
> + bool has_gt_error_vectors;
>
> #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> struct xe_device_display_info {
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index bc821f431c45..ce9ce2748394 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -44,6 +44,35 @@
> #include "xe_wa.h"
> #include "xe_wopcm.h"
>
> +static const char * const xe_gt_driver_errors_to_str[] = {
> + [INTEL_GT_DRIVER_ERROR_INTERRUPT] = "INTERRUPT",
> +};
> +
> +void xe_gt_log_driver_error(struct xe_gt *gt,
> + const enum xe_gt_driver_errors error,
> + const char *fmt, ...)
> +{
> + struct va_format vaf;
> + va_list args;
> +
> + va_start(args, fmt);
> + vaf.fmt = fmt;
> + vaf.va = &args;
> +
> + BUILD_BUG_ON(ARRAY_SIZE(xe_gt_driver_errors_to_str) !=
> + INTEL_GT_DRIVER_ERROR_COUNT);
> +
> + WARN_ON_ONCE(error >= INTEL_GT_DRIVER_ERROR_COUNT);
> +
> + gt->errors.driver[error]++;
> +
> + drm_err_ratelimited(>_to_xe(gt)->drm, "GT%u [%s] %pV",
> + gt->info.id,
> + xe_gt_driver_errors_to_str[error],
> + &vaf);
> + va_end(args);
> +}
> +
> struct xe_gt *xe_find_full_gt(struct xe_gt *gt)
> {
> struct xe_gt *search;
> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
> index 8f29aba455e0..9580a40c0142 100644
> --- a/drivers/gpu/drm/xe/xe_gt_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> @@ -33,6 +33,43 @@ enum xe_gt_type {
> typedef unsigned long xe_dss_mask_t[BITS_TO_LONGS(32 * XE_MAX_DSS_FUSE_REGS)];
> typedef unsigned long xe_eu_mask_t[BITS_TO_LONGS(32 * XE_MAX_EU_FUSE_REGS)];
>
> +/* Count of GT Correctable and FATAL HW ERRORS */
> +enum intel_gt_hw_errors {
> + INTEL_GT_HW_ERROR_COR_SUBSLICE = 0,
> + INTEL_GT_HW_ERROR_COR_L3BANK,
> + INTEL_GT_HW_ERROR_COR_L3_SNG,
> + INTEL_GT_HW_ERROR_COR_GUC,
> + INTEL_GT_HW_ERROR_COR_SAMPLER,
> + INTEL_GT_HW_ERROR_COR_SLM,
> + INTEL_GT_HW_ERROR_COR_EU_IC,
> + INTEL_GT_HW_ERROR_COR_EU_GRF,
> + INTEL_GT_HW_ERROR_FAT_SUBSLICE,
> + INTEL_GT_HW_ERROR_FAT_L3BANK,
> + INTEL_GT_HW_ERROR_FAT_ARR_BIST,
> + INTEL_GT_HW_ERROR_FAT_FPU,
> + INTEL_GT_HW_ERROR_FAT_L3_DOUB,
> + INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
> + INTEL_GT_HW_ERROR_FAT_GUC,
> + INTEL_GT_HW_ERROR_FAT_IDI_PAR,
> + INTEL_GT_HW_ERROR_FAT_SQIDI,
> + INTEL_GT_HW_ERROR_FAT_SAMPLER,
> + INTEL_GT_HW_ERROR_FAT_SLM,
> + INTEL_GT_HW_ERROR_FAT_EU_IC,
> + INTEL_GT_HW_ERROR_FAT_EU_GRF,
> + INTEL_GT_HW_ERROR_FAT_TLB,
> + INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
> + INTEL_GT_HW_ERROR_COUNT
> +};
> +
> +enum xe_gt_driver_errors {
> + INTEL_GT_DRIVER_ERROR_INTERRUPT = 0,
> + INTEL_GT_DRIVER_ERROR_COUNT
> +};
Let's have driver errors as a separate patch and limit this to only HW
errors.
Thanks,
Aravind.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-xe] [PATCH 3/4] drm/xe/ras: Count SOC and SGUNIT errors
2023-04-06 9:26 [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Himal Prasad Ghimiray
` (2 preceding siblings ...)
2023-04-06 9:26 ` [Intel-xe] [PATCH 2/4] drm/xe/ras: Log the GT hw errors Himal Prasad Ghimiray
@ 2023-04-06 9:26 ` Himal Prasad Ghimiray
2023-04-06 9:26 ` [Intel-xe] [PATCH 4/4] drm/xe/ras: Add support for reporting CSC HW and FW errors Himal Prasad Ghimiray
2023-04-06 12:25 ` [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Jani Nikula
5 siblings, 0 replies; 22+ messages in thread
From: Himal Prasad Ghimiray @ 2023-04-06 9:26 UTC (permalink / raw)
To: intel-xe; +Cc: Himal Prasad Ghimiray
From: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Count the SOC, SGUNIT hardware errors as signaled by relevant
interrupts and respective registers.
Co-authored-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
---
drivers/gpu/drm/xe/regs/xe_regs.h | 116 ++++++++++++++-
drivers/gpu/drm/xe/xe_device.c | 6 +
drivers/gpu/drm/xe/xe_gt.c | 1 +
drivers/gpu/drm/xe/xe_gt_types.h | 28 ++++
drivers/gpu/drm/xe/xe_irq.c | 236 +++++++++++++++++++++++++++++-
5 files changed, 381 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
index b3d35d0c5a77..422ed63ab32e 100644
--- a/drivers/gpu/drm/xe/regs/xe_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_regs.h
@@ -130,8 +130,9 @@ enum hardware_error {
#define DEV_ERR_STAT_REG(x) _MMIO(_PICK_EVEN((x), \
_DEV_ERR_STAT_CORRECTABLE, \
_DEV_ERR_STAT_NONFATAL))
-
-#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
+#define DEV_ERR_STAT_SOC_ERROR REG_BIT(16)
+#define DEV_ERR_STAT_SGUNIT_ERROR REG_BIT(12)
+#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
enum gt_vctr_registers {
ERR_STAT_GT_VCTR0 = 0,
@@ -211,4 +212,115 @@ enum gt_vctr_registers {
#define SLM_ECC_ERROR_CNTR(x) _MMIO((x) == HARDWARE_ERROR_CORRECTABLE ? \
_SLM_ECC_ERROR_CNT : \
_SLM_UNCORR_ECC_ERROR_CNT)
+#define SOC_PVC_BASE 0x00282000
+#define SOC_PVC_SLAVE_BASE 0x00283000
+
+#define _SOC_LERRCORSTS 0x000294
+#define _SOC_LERRUNCSTS 0x000280
+#define SOC_LOCAL_ERR_STAT_SLAVE_REG(base, x) _MMIO((x) > HARDWARE_ERROR_CORRECTABLE ? \
+ base + _SOC_LERRUNCSTS : \
+ base + _SOC_LERRCORSTS)
+#define SOC_FABRIC_SS1_3 (7)
+#define SOC_FABRIC_SS1_2 (6)
+#define SOC_FABRIC_SS1_1 (5)
+#define SOC_FABRIC_SS1_0 (4)
+
+#define SOC_LOCAL_ERR_STAT_MASTER_REG(base, x) _MMIO((x) > HARDWARE_ERROR_CORRECTABLE ? \
+ base + _SOC_LERRUNCSTS : \
+ base + _SOC_LERRCORSTS)
+#define PVC_SOC_PSF_2 (13)
+#define PVC_SOC_PSF_1 (12)
+#define PVC_SOC_PSF_0 (11)
+#define SOC_PSF_CSC_2 (10)
+#define SOC_PSF_CSC_1 (9)
+#define SOC_PSF_CSC_0 (8)
+#define SOC_FABRIC_SS0_3 (7)
+#define SOC_FABRIC_SS0_2 (6)
+#define SOC_FABRIC_SS0_1 (5)
+#define SOC_FABRIC_SS0_0 (4)
+
+#define _SOC_GSYSEVTCTL 0x000264
+#define SOC_GSYSEVTCTL_REG(base, slave_base, x) _MMIO(_PICK_EVEN((x), \
+ base + _SOC_GSYSEVTCTL, \
+ slave_base + _SOC_GSYSEVTCTL))
+#define _SOC_GCOERRSTS 0x000200
+#define _SOC_GNFERRSTS 0x000210
+#define _SOC_GFAERRSTS 0x000220
+#define SOC_GLOBAL_ERR_STAT_SLAVE_REG(base, x) _MMIO(_PICK_EVEN((x), \
+ base + _SOC_GCOERRSTS, \
+ base + _SOC_GNFERRSTS))
+#define PVC_SOC_HBM_SS3_7 (16)
+#define PVC_SOC_HBM_SS3_6 (15)
+#define PVC_SOC_HBM_SS3_5 (14)
+#define PVC_SOC_HBM_SS3_4 (13)
+#define PVC_SOC_HBM_SS3_3 (12)
+#define PVC_SOC_HBM_SS3_2 (11)
+#define PVC_SOC_HBM_SS3_1 (10)
+#define PVC_SOC_HBM_SS3_0 (9)
+#define PVC_SOC_HBM_SS2_7 (8)
+#define PVC_SOC_HBM_SS2_6 (7)
+#define PVC_SOC_HBM_SS2_5 (6)
+#define PVC_SOC_HBM_SS2_4 (5)
+#define PVC_SOC_HBM_SS2_3 (4)
+#define PVC_SOC_HBM_SS2_2 (3)
+#define PVC_SOC_HBM_SS2_1 (2)
+#define PVC_SOC_HBM_SS2_0 (1)
+#define SOC_HBM_SS1_15 (17)
+#define SOC_HBM_SS1_14 (16)
+#define SOC_HBM_SS1_13 (15)
+#define SOC_HBM_SS1_12 (14)
+#define SOC_HBM_SS1_11 (13)
+#define SOC_HBM_SS1_10 (12)
+#define SOC_HBM_SS1_9 (11)
+#define SOC_HBM_SS1_8 (10)
+#define SOC_HBM_SS1_7 (9)
+#define SOC_HBM_SS1_6 (8)
+#define SOC_HBM_SS1_5 (7)
+#define SOC_HBM_SS1_4 (6)
+#define SOC_HBM_SS1_3 (5)
+#define SOC_HBM_SS1_2 (4)
+#define SOC_HBM_SS1_1 (3)
+#define SOC_HBM_SS1_0 (2)
+#define SOC_FABRIC_SS1_4 (1)
+#define SOC_IEH1_LOCAL_ERR_STATUS (0)
+
+#define SOC_GLOBAL_ERR_STAT_MASTER_REG(base, x) _MMIO(_PICK_EVEN((x), \
+ base + _SOC_GCOERRSTS, \
+ base + _SOC_GNFERRSTS))
+#define PVC_SOC_MDFI_SOUTH (6)
+#define PVC_SOC_MDFI_EAST (4)
+#define PVC_SOC_CD0_MDFI (18)
+#define PVC_SOC_CD0 (17)
+#define PVC_SOC_HBM_SS1_7 (17)
+#define PVC_SOC_HBM_SS1_6 (16)
+#define PVC_SOC_HBM_SS1_5 (15)
+#define PVC_SOC_HBM_SS1_4 (14)
+#define PVC_SOC_HBM_SS1_3 (13)
+#define PVC_SOC_HBM_SS1_2 (12)
+#define PVC_SOC_HBM_SS1_1 (11)
+#define PVC_SOC_HBM_SS1_0 (10)
+#define SOC_MDFI_SOUTH (21)
+#define SOC_MDFI_WEST (20)
+#define SOC_MDFI_EAST (19)
+#define SOC_PUNIT (18)
+#define SOC_HBM_SS0_15 (17)
+#define SOC_HBM_SS0_14 (16)
+#define SOC_HBM_SS0_13 (15)
+#define SOC_HBM_SS0_12 (14)
+#define SOC_HBM_SS0_11 (13)
+#define SOC_HBM_SS0_10 (12)
+#define SOC_HBM_SS0_9 (11)
+#define SOC_HBM_SS0_8 (10)
+#define SOC_HBM_SS0_7 (9)
+#define SOC_HBM_SS0_6 (8)
+#define SOC_HBM_SS0_5 (7)
+#define SOC_HBM_SS0_4 (6)
+#define SOC_HBM_SS0_3 (5)
+#define SOC_HBM_SS0_2 (4)
+#define SOC_HBM_SS0_1 (3)
+#define SOC_HBM_SS0_0 (2)
+#define SOC_SLAVE_IEH (1)
+#define SOC_IEH0_LOCAL_ERR_STATUS (0)
+#define SOC_HW_ERR_MAX_BITS (32)
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index a79f934e3d2d..771ea5382815 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -334,8 +334,14 @@ int xe_device_probe(struct xe_device *xe)
static void xe_device_remove_display(struct xe_device *xe)
{
+ struct xe_gt *gt;
+ u32 id;
+
xe_display_unregister(xe);
+ for_each_gt(gt, xe, id)
+ xa_destroy(>->errors.soc);
+
drm_dev_unplug(&xe->drm);
xe_display_modset_driver_remove(xe);
}
diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index ce9ce2748394..518c76553e31 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -543,6 +543,7 @@ int xe_gt_init(struct xe_gt *gt)
int err;
int i;
+ xa_init(>->errors.soc);
INIT_WORK(>->reset.worker, gt_reset_worker);
for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i) {
diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
index 9580a40c0142..bd4a85959df3 100644
--- a/drivers/gpu/drm/xe/xe_gt_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_types.h
@@ -6,6 +6,7 @@
#ifndef _XE_GT_TYPES_H_
#define _XE_GT_TYPES_H_
+#include "regs/xe_regs.h"
#include "xe_force_wake_types.h"
#include "xe_hw_engine_types.h"
#include "xe_hw_fence_types.h"
@@ -66,6 +67,17 @@ enum xe_gt_driver_errors {
INTEL_GT_DRIVER_ERROR_COUNT
};
+enum intel_soc_num_ieh {
+ INTEL_GT_SOC_IEH0 = 0,
+ INTEL_GT_SOC_IEH1,
+ INTEL_GT_SOC_NUM_IEH
+};
+
+enum intel_soc_ieh_reg_type {
+ INTEL_SOC_REG_LOCAL = 0,
+ INTEL_SOC_REG_GLOBAL
+};
+
void xe_gt_log_driver_error(struct xe_gt *gt,
const enum xe_gt_driver_errors error,
const char *fmt, ...);
@@ -397,9 +409,25 @@ struct xe_gt {
struct intel_hw_errors {
unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
+ struct xarray soc;
+ unsigned long sgunit[HARDWARE_ERROR_MAX];
unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
} errors;
};
+#define SOC_HW_ERR_SHIFT ilog2(SOC_HW_ERR_MAX_BITS)
+#define SOC_ERR_BIT BIT(IEH_SHIFT + 1)
+#define IEH_SHIFT (REG_GROUP_SHIFT + REG_GROUP_BITS)
+#define IEH_MASK (0x1)
+#define REG_GROUP_SHIFT (HW_ERR_TYPE_BITS + SOC_HW_ERR_SHIFT)
+#define REG_GROUP_BITS (1)
+#define HW_ERR_TYPE_BITS (2)
+#define SOC_ERR_INDEX(IEH, REG_GROUP, HW_ERR, ERRBIT) \
+ (SOC_ERR_BIT | \
+ (IEH) << IEH_SHIFT | \
+ (REG_GROUP) << REG_GROUP_SHIFT | \
+ (HW_ERR) << SOC_HW_ERR_SHIFT | \
+ (ERRBIT))
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 4626f7280aaf..c047d9b66a7c 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -637,6 +637,233 @@ xe_gt_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
xe_mmio_write32(gt, ERR_STAT_GT_REG(hw_err).reg, errstat);
}
+static const char *
+soc_err_index_to_str(unsigned long index)
+{
+ switch (index) {
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_0):
+ return "PSF CSC0";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_1):
+ return "PSF CSC1";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_2):
+ return "PSF CSC2";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_0):
+ return "PSF0";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_1):
+ return "PSF1";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_2):
+ return "PSF2";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_CD0):
+ return "CD0";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_CD0_MDFI):
+ return "CD0 MDFI";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_MDFI_EAST):
+ return "MDFI EAST";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_MDFI_SOUTH):
+ return "MDFI SOUTH";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_PUNIT):
+ return "PUNIT";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_0):
+ return "HBM SS0: Sbbridge0";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_1):
+ return "HBM SS0: Sbbridge1";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_2):
+ return "HBM SS0: Sbbridge2";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_3):
+ return "HBM SS0: Sbbridge3";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_4):
+ return "HBM SS0: Sbbridge4";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_5):
+ return "HBM SS0: Sbbridge5";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_6):
+ return "HBM SS0: Sbbridge6";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_7):
+ return "HBM SS0: Sbbridge7";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_0):
+ return "HBM SS1: Sbbridge0";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_1):
+ return "HBM SS1: Sbbridge1";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_2):
+ return "HBM SS1: Sbbridge2";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_3):
+ return "HBM SS1: Sbbridge3";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_4):
+ return "HBM SS1: Sbbridge4";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_5):
+ return "HBM SS1: Sbbridge5";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_6):
+ return "HBM SS1: Sbbridge6";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_7):
+ return "HBM SS1: Sbbridge7";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_0):
+ return "HBM SS2: Sbbridge0";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_1):
+ return "HBM SS2: Sbbridge1";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_2):
+ return "HBM SS2: Sbbridge2";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_3):
+ return "HBM SS2: Sbbridge3";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_4):
+ return "HBM SS2: Sbbridge4";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_5):
+ return "HBM SS2: Sbbridge5";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_6):
+ return "HBM SS2: Sbbridge6";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_7):
+ return "HBM SS2: Sbbridge7";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_0):
+ return "HBM SS3: Sbbridge0";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_1):
+ return "HBM SS3: Sbbridge1";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_2):
+ return "HBM SS3: Sbbridge2";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_3):
+ return "HBM SS3: Sbbridge3";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_4):
+ return "HBM SS3: Sbbridge4";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_5):
+ return "HBM SS3: Sbbridge5";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_6):
+ return "HBM SS3: Sbbridge6";
+ case SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_7):
+ return "HBM SS3: Sbbridge7";
+ default:
+ return "UNKNOWN";
+ }
+}
+
+static void update_soc_hw_error_cnt(struct xe_gt *gt, unsigned long index)
+{
+ unsigned long flags;
+ void *entry;
+
+ entry = xa_load(>->errors.soc, index);
+ entry = xa_mk_value(xa_to_value(entry) + 1);
+
+ xa_lock_irqsave(>->errors.soc, flags);
+ if (xa_is_err(__xa_store(>->errors.soc, index, entry, GFP_ATOMIC)))
+ drm_err_ratelimited(>->xe->drm,
+ HW_ERR "SOC error reported by IEH%lu on GT %d lost\n",
+ (index >> IEH_SHIFT) & IEH_MASK,
+ gt->info.id);
+ xa_unlock_irqrestore(>->errors.soc, flags);
+}
+
+static void
+xe_soc_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+ unsigned long mst_glb_errstat, slv_glb_errstat, lcl_errstat, index;
+ u32 errbit, base, slave_base;
+ int i;
+
+ lockdep_assert_held(>_to_xe(gt)->irq.lock);
+ if (gt->xe->info.platform != XE_PVC)
+ return;
+
+ base = SOC_PVC_BASE;
+ slave_base = SOC_PVC_SLAVE_BASE;
+
+ xe_gt_hw_err(gt, "SOC %s error\n", hardware_error_type_to_str(hw_err));
+
+ if (hw_err == HARDWARE_ERROR_CORRECTABLE || hw_err == HARDWARE_ERROR_NONFATAL) {
+ for (i = 0; i < INTEL_GT_SOC_NUM_IEH; i++)
+ xe_mmio_write32(gt, SOC_GSYSEVTCTL_REG(base, slave_base, i).reg,
+ ~REG_BIT(hw_err));
+
+ xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err).reg,
+ REG_GENMASK(31, 0));
+ xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err).reg,
+ REG_GENMASK(31, 0));
+ xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err).reg,
+ REG_GENMASK(31, 0));
+ xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err).reg,
+ REG_GENMASK(31, 0));
+
+ xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT, "UNKNOWN SOC %s error\n",
+ hardware_error_type_to_str(hw_err));
+ }
+
+ /*
+ * Mask error type in GSYSEVTCTL so that no new errors of the type
+ * will be reported. Read the master global IEH error register if
+ * BIT 1 is set then process the slave IEH first. If BIT 0 in
+ * global error register is set then process the corresponding
+ * Local error registers
+ */
+ for (i = 0; i < INTEL_GT_SOC_NUM_IEH; i++)
+ xe_mmio_write32(gt, SOC_GSYSEVTCTL_REG(base, slave_base, i).reg, ~REG_BIT(hw_err));
+
+ mst_glb_errstat = xe_mmio_read32(gt, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err).reg);
+ xe_gt_hw_err(gt, "SOC_GLOBAL_ERR_STAT_MASTER_REG_FATAL:0x%08lx\n", mst_glb_errstat);
+ if (mst_glb_errstat & REG_BIT(SOC_SLAVE_IEH)) {
+ slv_glb_errstat = xe_mmio_read32(gt, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base,
+ hw_err).reg);
+ xe_gt_hw_err(gt, "SOC_GLOBAL_ERR_STAT_SLAVE_REG_FATAL:0x%08lx\n",
+ slv_glb_errstat);
+
+ if (slv_glb_errstat & REG_BIT(SOC_IEH1_LOCAL_ERR_STATUS)) {
+ lcl_errstat = xe_mmio_read32(gt, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base,
+ hw_err).reg);
+ xe_gt_hw_err(gt, "SOC_LOCAL_ERR_STAT_SLAVE_REG_FATAL:0x%08lx\n",
+ lcl_errstat);
+
+ for_each_set_bit(errbit, &lcl_errstat, SOC_HW_ERR_MAX_BITS) {
+ /*
+ * SOC errors have global and local error
+ * registers for each correctable non-fatal
+ * and fatal categories and these are per IEH
+ * on platform. XEHPSDV and PVC have two IEHs
+ */
+ index = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_LOCAL, hw_err, errbit);
+ update_soc_hw_error_cnt(gt, index);
+ if (gt->xe->info.platform == XE_PVC)
+ xe_gt_hw_err(gt, "%s SOC FATAL error\n",
+ soc_err_index_to_str(index));
+ }
+ xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err).reg,
+ lcl_errstat);
+ }
+
+ for_each_set_bit(errbit, &slv_glb_errstat, SOC_HW_ERR_MAX_BITS) {
+ index = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, hw_err, errbit);
+ update_soc_hw_error_cnt(gt, index);
+ if (gt->xe->info.platform == XE_PVC)
+ xe_gt_hw_err(gt, "%s SOC FATAL error\n",
+ soc_err_index_to_str(index));
+ }
+ xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err).reg,
+ slv_glb_errstat);
+ }
+
+ if (mst_glb_errstat & REG_BIT(SOC_IEH0_LOCAL_ERR_STATUS)) {
+ lcl_errstat = xe_mmio_read32(gt, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err).reg);
+ xe_gt_hw_err(gt, "SOC_LOCAL_ERR_STAT_MASTER_REG_FATAL:0x%08lx\n", lcl_errstat);
+ for_each_set_bit(errbit, &lcl_errstat, SOC_HW_ERR_MAX_BITS) {
+ index = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, hw_err, errbit);
+ update_soc_hw_error_cnt(gt, index);
+ if (gt->xe->info.platform == XE_PVC)
+ xe_gt_hw_err(gt, "%s SOC FATAL error\n",
+ soc_err_index_to_str(index));
+ }
+ xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err).reg,
+ lcl_errstat);
+ }
+
+ for_each_set_bit(errbit, &mst_glb_errstat, SOC_HW_ERR_MAX_BITS) {
+ index = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, hw_err, errbit);
+ update_soc_hw_error_cnt(gt, index);
+ if (gt->xe->info.platform == XE_PVC)
+ xe_gt_hw_err(gt, "%s SOC FATAL error\n",
+ soc_err_index_to_str(index));
+ }
+ xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err).reg,
+ mst_glb_errstat);
+
+ for (i = 0; i < INTEL_GT_SOC_NUM_IEH; i++)
+ xe_mmio_write32(gt, SOC_GSYSEVTCTL_REG(base, slave_base, i).reg,
+ (HARDWARE_ERROR_MAX << 1) + 1);
+}
+
static void
xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
{
@@ -655,10 +882,11 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
if (errsrc & DEV_ERR_STAT_GT_ERROR)
xe_gt_hw_error_handler(gt, hw_err);
- if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
- xe_gt_log_driver_error(gt, INTEL_GT_DRIVER_ERROR_INTERRUPT,
- "non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
- hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
+ if (errsrc & DEV_ERR_STAT_SGUNIT_ERROR)
+ gt->errors.sgunit[hw_err]++;
+
+ if (errsrc & DEV_ERR_STAT_SOC_ERROR)
+ xe_soc_hw_error_handler(gt, hw_err);
xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread* [Intel-xe] [PATCH 4/4] drm/xe/ras: Add support for reporting CSC HW and FW errors.
2023-04-06 9:26 [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Himal Prasad Ghimiray
` (3 preceding siblings ...)
2023-04-06 9:26 ` [Intel-xe] [PATCH 3/4] drm/xe/ras: Count SOC and SGUNIT errors Himal Prasad Ghimiray
@ 2023-04-06 9:26 ` Himal Prasad Ghimiray
2023-04-06 12:25 ` [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Jani Nikula
5 siblings, 0 replies; 22+ messages in thread
From: Himal Prasad Ghimiray @ 2023-04-06 9:26 UTC (permalink / raw)
To: intel-xe; +Cc: Akeem G Abodunrin
From: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Add support for Memory sparing warning interrupt.
This is applicable to HBM Sparing - when memory degradation
occurred, and it couldn't be repaired through PCLS during
runtime, a warning interrupt generated by the FW, the KMD
driver reads errors register, count them and communicate
the errors to the userspace via UEVENT.
Report CSC HW errors not related to HBM and send
RESET request via uevent.
---
drivers/gpu/drm/xe/regs/xe_regs.h | 32 ++++
drivers/gpu/drm/xe/xe_device_types.h | 2 +
drivers/gpu/drm/xe/xe_gt_types.h | 34 +++++
drivers/gpu/drm/xe/xe_irq.c | 216 +++++++++++++++++++++++++++
drivers/gpu/drm/xe/xe_pci.c | 3 +
5 files changed, 287 insertions(+)
diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
index 422ed63ab32e..fb206e2dd2e1 100644
--- a/drivers/gpu/drm/xe/regs/xe_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_regs.h
@@ -132,6 +132,7 @@ enum hardware_error {
_DEV_ERR_STAT_NONFATAL))
#define DEV_ERR_STAT_SOC_ERROR REG_BIT(16)
#define DEV_ERR_STAT_SGUNIT_ERROR REG_BIT(12)
+#define DEV_ERR_STAT_GSC_ERROR REG_BIT(8)
#define DEV_ERR_STAT_GT_ERROR REG_BIT(0)
enum gt_vctr_registers {
@@ -323,4 +324,35 @@ enum gt_vctr_registers {
#define SOC_IEH0_LOCAL_ERR_STATUS (0)
#define SOC_HW_ERR_MAX_BITS (32)
+#define PVC_GSC_HECI1_BASE 0x00284000
+#define _GSC_HEC_CORR_ERR_STATUS 0x128
+#define _GSC_HEC_UNCORR_ERR_STATUS 0x118
+#define GSC_HEC_CORR_UNCORR_ERR_STATUS(base, x) _MMIO(_PICK_EVEN((x), \
+ (base) + _GSC_HEC_CORR_ERR_STATUS, \
+ (base) + _GSC_HEC_UNCORR_ERR_STATUS))
+#define GSC_COR_FW_REPORTED_ERR (1)
+#define GSC_COR_SRAM_ECC_SINGLE_BIT_ERR (0)
+
+#define GSC_UNCOR_AON_PARITY_ERR (11)
+#define GSC_UNCOR_SELFMBIST_ERR (10)
+#define GSC_UNCOR_FUSE_CRC_CHECK_ERR (9)
+#define GSC_UNCOR_FUSE_PULL_ERR (8)
+#define GSC_UNCOR_GLITCH_DET_ERR (7)
+#define GSC_UNCOR_FW_REPORTED_ERR (6)
+#define GSC_UNCOR_UCODE_PARITY_ERR (5)
+#define GSC_UNCOR_ROM_PARITY_ERR (4)
+#define GSC_UNCOR_WDG_TIMEOUT_ERR (3)
+#define GSC_UNCOR_SRAM_ECC_ERR (2)
+#define GSC_UNCOR_MIA_INT_ERR (1)
+#define GSC_UNCOR_MIA_SHUTDOWN_ERR (0)
+
+#define GSC_HW_ERROR_MAX_ERR_BITS 12
+
+#define GSC_HEC_CORR_FW_ERR_DW0(base) _MMIO((base) + 0x130)
+#define GSC_HEC_UNCORR_FW_ERR_DW0(base) _MMIO((base) + 0x124)
+
+#define BANK_SPARNG_ERR_MITIGATION_DOWNGRADED 0x1000
+#define BANK_SPARNG_DIS_PCLS_EXCEEDED 0x1001
+#define BANK_SPARNG_ENA_PCLS_UNCORRECTABLE 0x1002
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index ecabf4d6690d..68e59bfe4ab8 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -101,6 +101,8 @@ struct xe_device {
bool enable_display;
/** @has_gt_error_vectors: whether platform supports ERROR VECTORS */
bool has_gt_error_vectors;
+ /** @has_mem_sparing: whether platform supports HBM status reporting */
+ bool has_mem_sparing;
#if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
struct xe_device_display_info {
diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
index bd4a85959df3..4275e8cf5d8c 100644
--- a/drivers/gpu/drm/xe/xe_gt_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_types.h
@@ -78,6 +78,22 @@ enum intel_soc_ieh_reg_type {
INTEL_SOC_REG_GLOBAL
};
+enum intel_gsc_hw_errors {
+ INTEL_GSC_HW_ERROR_COR_SRAM_ECC = 0,
+ INTEL_GSC_HW_ERROR_UNCOR_MIA_SHUTDOWN,
+ INTEL_GSC_HW_ERROR_UNCOR_MIA_INT,
+ INTEL_GSC_HW_ERROR_UNCOR_SRAM_ECC,
+ INTEL_GSC_HW_ERROR_UNCOR_WDG_TIMEOUT,
+ INTEL_GSC_HW_ERROR_UNCOR_ROM_PARITY,
+ INTEL_GSC_HW_ERROR_UNCOR_UCODE_PARITY,
+ INTEL_GSC_HW_ERROR_UNCOR_GLITCH_DET,
+ INTEL_GSC_HW_ERROR_UNCOR_FUSE_PULL,
+ INTEL_GSC_HW_ERROR_UNCOR_FUSE_CRC_CHECK,
+ INTEL_GSC_HW_ERROR_UNCOR_SELFMBIST,
+ INTEL_GSC_HW_ERROR_UNCOR_AON_PARITY,
+ INTEL_GSC_HW_ERROR_COUNT
+};
+
void xe_gt_log_driver_error(struct xe_gt *gt,
const enum xe_gt_driver_errors error,
const char *fmt, ...);
@@ -125,6 +141,18 @@ enum xe_steering_type {
NUM_STEERING_TYPES
};
+struct xe_mem_sparing_event {
+ struct work_struct mem_health_work;
+ u32 cause;
+ enum {
+ MEM_HEALTH_OKAY = 0,
+ MEM_HEALTH_ALARM,
+ MEM_HEALTH_EC_PENDING,
+ MEM_HEALTH_DEGRADED,
+ MEM_HEALTH_UNKNOWN
+ } health_status;
+};
+
/**
* struct xe_gt - Top level struct of a graphics tile
*
@@ -409,11 +437,17 @@ struct xe_gt {
struct intel_hw_errors {
unsigned long hw[INTEL_GT_HW_ERROR_COUNT];
+ unsigned long gsc_hw[INTEL_GSC_HW_ERROR_COUNT];
struct xarray soc;
unsigned long sgunit[HARDWARE_ERROR_MAX];
unsigned long driver[INTEL_GT_DRIVER_ERROR_COUNT];
} errors;
+ struct work_struct gsc_hw_error_work;
+
+ /* Memory sparing data structure for errors reporting on root tile */
+ struct xe_mem_sparing_event mem_sparing;
+
};
#define SOC_HW_ERR_SHIFT ilog2(SOC_HW_ERR_MAX_BITS)
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index c047d9b66a7c..67beeec8c187 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -19,7 +19,13 @@
#include "xe_hw_engine.h"
#include "xe_mmio.h"
+/*TODO Move these macros to headers later to be used by UAPI */
#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors)
+#define HAS_MEM_SPARING_SUPPORT(xe) ((xe)->info.has_mem_sparing)
+
+/* TODO move to correct UAPI header */
+#define XE_MEMORY_HEALTH_UEVENT "MEMORY_HEALTH"
+
static void gen3_assert_iir_is_zero(struct xe_gt *gt, i915_reg_t reg)
{
u32 val = xe_mmio_read32(gt, reg.reg);
@@ -864,6 +870,206 @@ xe_soc_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
(HARDWARE_ERROR_MAX << 1) + 1);
}
+static void xe_gsc_hw_error_work(struct work_struct *work)
+{
+ struct xe_gt *gt =
+ container_of(work, typeof(*gt), gsc_hw_error_work);
+ char *csc_hw_error_event[3];
+
+ csc_hw_error_event[0] = XE_MEMORY_HEALTH_UEVENT "=1";
+ csc_hw_error_event[1] = "SPARING_STATUS_UNKNOWN=1 RESET_REQUIRED=1";
+ csc_hw_error_event[2] = NULL;
+ gt->mem_sparing.health_status = MEM_HEALTH_UNKNOWN;
+
+ dev_notice(gt_to_xe(gt)->drm.dev, "Unknown memory health status, Reset Required\n");
+ kobject_uevent_env(>_to_xe(gt)->drm.primary->kdev->kobj, KOBJ_CHANGE,
+ csc_hw_error_event);
+}
+
+static void xe_mem_health_work(struct work_struct *work)
+{
+ struct xe_gt *gt =
+ container_of(work, typeof(*gt), mem_sparing.mem_health_work);
+ u32 cause;
+ int event_idx = 0;
+ char *sparing_event[3];
+
+ spin_lock_irq(>_to_xe(gt)->irq.lock);
+ cause = gt->mem_sparing.cause;
+ gt->mem_sparing.cause = 0;
+ spin_unlock_irq(>_to_xe(gt)->irq.lock);
+ if (!cause)
+ return;
+
+ sparing_event[event_idx++] = XE_MEMORY_HEALTH_UEVENT "=1";
+ switch (cause) {
+ case BANK_SPARNG_ERR_MITIGATION_DOWNGRADED:
+ gt->mem_sparing.health_status = MEM_HEALTH_ALARM;
+ sparing_event[event_idx++] = "MEM_HEALTH_ALARM=1";
+ dev_notice(gt_to_xe(gt)->drm.dev,
+ "Memory Health Report: Error occurred - No action required.\n"
+ "Error Cause: 0x%x\n", cause);
+ break;
+ case BANK_SPARNG_DIS_PCLS_EXCEEDED:
+ gt->mem_sparing.health_status = MEM_HEALTH_EC_PENDING;
+ sparing_event[event_idx++] = "RESET_REQUIRED=1 EC_PENDING=1";
+ dev_crit(gt_to_xe(gt)->drm.dev,
+ "Memory Health Report: Error correction pending.\n"
+ "Card need to be reset.\n"
+ "Memory might now be functioning in unreliable state.\n"
+ "Error Cause: 0x%x\n", cause);
+ add_taint(TAINT_MACHINE_CHECK, LOCKDEP_STILL_OK);
+ break;
+ case BANK_SPARNG_ENA_PCLS_UNCORRECTABLE:
+ gt->mem_sparing.health_status = MEM_HEALTH_DEGRADED;
+ sparing_event[event_idx++] = "DEGRADED=1 EC_FAILED=1";
+ dev_crit(gt_to_xe(gt)->drm.dev,
+ "Memory Health Report: Memory Health degraded, and runtime fix not feasible.\n"
+ "Replacing card might be the best option.\n"
+ "Error Cause: 0x%x\n", cause);
+ add_taint(TAINT_MACHINE_CHECK, LOCKDEP_STILL_OK);
+ break;
+ default:
+ gt->mem_sparing.health_status = MEM_HEALTH_UNKNOWN;
+ sparing_event[event_idx++] = "SPARING_STATUS_UNKNOWN=1";
+ dev_notice(gt_to_xe(gt)->drm.dev,
+ "Unknown memory health status\n");
+ break;
+ }
+
+ sparing_event[event_idx++] = NULL;
+
+ kobject_uevent_env(>_to_xe(gt)->drm.primary->kdev->kobj, KOBJ_CHANGE,
+ sparing_event);
+}
+
+static void
+xe_gsc_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+ u32 base = PVC_GSC_HECI1_BASE;
+ unsigned long err_status;
+ u32 errbit;
+
+ if (!HAS_MEM_SPARING_SUPPORT(gt_to_xe(gt)))
+ return;
+
+ lockdep_assert_held(>_to_xe(gt)->irq.lock);
+
+ err_status = xe_mmio_read32(gt, GSC_HEC_CORR_UNCORR_ERR_STATUS(base, hw_err).reg);
+ if (unlikely(!err_status))
+ return;
+
+ switch (hw_err) {
+ case HARDWARE_ERROR_CORRECTABLE:
+ for_each_set_bit(errbit, &err_status, GSC_HW_ERROR_MAX_ERR_BITS) {
+ u32 err_type = GSC_HW_ERROR_MAX_ERR_BITS;
+ const char *name;
+
+ switch (errbit) {
+ case GSC_COR_SRAM_ECC_SINGLE_BIT_ERR:
+ name = "Single bit error on CSME SRAM";
+ err_type = INTEL_GSC_HW_ERROR_COR_SRAM_ECC;
+ break;
+ case GSC_COR_FW_REPORTED_ERR:
+ gt->mem_sparing.cause |=
+ xe_mmio_read32(gt, GSC_HEC_CORR_FW_ERR_DW0(base).reg);
+ if (unlikely(!gt->mem_sparing.cause))
+ goto re_enable_interrupt;
+ schedule_work(>->mem_sparing.mem_health_work);
+ break;
+ default:
+ name = "Unknown";
+ break;
+ }
+
+ if (err_type != GSC_HW_ERROR_MAX_ERR_BITS)
+ gt->errors.gsc_hw[err_type]++;
+
+ if (errbit != GSC_COR_FW_REPORTED_ERR)
+ drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR
+ "%s GSC Correctable Error, GSC_HEC_CORR_ERR_STATUS:0x%08lx\n",
+ name, err_status);
+ }
+ break;
+ case HARDWARE_ERROR_NONFATAL:
+ for_each_set_bit(errbit, &err_status, GSC_HW_ERROR_MAX_ERR_BITS) {
+ u32 err_type = GSC_HW_ERROR_MAX_ERR_BITS;
+ const char *name;
+
+ switch (errbit) {
+ case GSC_UNCOR_MIA_SHUTDOWN_ERR:
+ name = "MinuteIA Unexpected Shutdown";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_MIA_SHUTDOWN;
+ break;
+ case GSC_UNCOR_MIA_INT_ERR:
+ name = "MinuteIA Internal Error";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_MIA_INT;
+ break;
+ case GSC_UNCOR_SRAM_ECC_ERR:
+ name = "Double bit error on CSME SRAM";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_SRAM_ECC;
+ break;
+ case GSC_UNCOR_WDG_TIMEOUT_ERR:
+ name = "WDT 2nd Timeout";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_WDG_TIMEOUT;
+ break;
+ case GSC_UNCOR_ROM_PARITY_ERR:
+ name = "ROM has a parity error";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_ROM_PARITY;
+ break;
+ case GSC_UNCOR_UCODE_PARITY_ERR:
+ name = "Ucode has a parity error";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_UCODE_PARITY;
+ break;
+ case GSC_UNCOR_FW_REPORTED_ERR:
+ name = "Errors Reported to FW and Detected by FW";
+ break;
+ case GSC_UNCOR_GLITCH_DET_ERR:
+ name = "Glitch is detected on voltage rail";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_GLITCH_DET;
+ break;
+ case GSC_UNCOR_FUSE_PULL_ERR:
+ name = "Fuse Pull Error";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_FUSE_PULL;
+ break;
+ case GSC_UNCOR_FUSE_CRC_CHECK_ERR:
+ name = "Fuse CRC Check Failed on Fuse Pull";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_FUSE_CRC_CHECK;
+ break;
+ case GSC_UNCOR_SELFMBIST_ERR:
+ name = "Self Mbist Failed";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_SELFMBIST;
+ break;
+ case GSC_UNCOR_AON_PARITY_ERR:
+ name = "AON RF has parity error";
+ err_type = INTEL_GSC_HW_ERROR_UNCOR_AON_PARITY;
+ break;
+ default:
+ name = "Unknown";
+ break;
+ }
+
+ if (err_type != GSC_HW_ERROR_MAX_ERR_BITS)
+ gt->errors.gsc_hw[err_type]++;
+
+ schedule_work(>->gsc_hw_error_work);
+ drm_err_ratelimited(>_to_xe(gt)->drm, HW_ERR
+ "%s GSC NON_FATAL Error, GSC_HEC_UNCORR_ERR_STATUS:0x%08lx\n",
+ name, err_status);
+ }
+ break;
+ case HARDWARE_ERROR_FATAL:
+ /* GSC error not handled for Fatal Error status */
+ drm_err_ratelimited(>_to_xe(gt)->drm,
+ HW_ERR "Fatal Memory Error Detected\n");
+ default:
+ break;
+ }
+
+re_enable_interrupt:
+ xe_mmio_write32(gt, GSC_HEC_CORR_UNCORR_ERR_STATUS(base, hw_err).reg, err_status);
+}
+
static void
xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
{
@@ -888,6 +1094,10 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
if (errsrc & DEV_ERR_STAT_SOC_ERROR)
xe_soc_hw_error_handler(gt, hw_err);
+ /* Memory health status is being tracked on root tile only */
+ if ((errsrc & DEV_ERR_STAT_GSC_ERROR) && gt->info.id == 0)
+ xe_gsc_hw_error_handler(gt, hw_err);
+
xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
out_unlock:
@@ -1133,10 +1343,16 @@ static void process_hw_errors(struct xe_device *xe)
int xe_irq_install(struct xe_device *xe)
{
+ struct xe_gt *gt0 = xe_device_get_gt(xe, 0);
int irq = to_pci_dev(xe->drm.dev)->irq;
irq_handler_t irq_handler;
int err;
+ if (HAS_MEM_SPARING_SUPPORT(xe)) {
+ INIT_WORK(>0->gsc_hw_error_work, xe_gsc_hw_error_work);
+ INIT_WORK(>0->mem_sparing.mem_health_work, xe_mem_health_work);
+ }
+
if (IS_DGFX(xe))
process_hw_errors(xe);
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 69098194cef8..6eec7f4490df 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -74,6 +74,7 @@ struct xe_device_desc {
bool has_asid;
bool has_link_copy_engine;
bool has_gt_error_vectors;
+ bool has_mem_sparing;
};
__diag_push();
@@ -234,6 +235,7 @@ static const struct xe_device_desc pvc_desc = {
.has_asid = true,
.has_link_copy_engine = true,
.has_gt_error_vectors = true,
+ .has_mem_sparing = true,
};
#define MTL_MEDIA_ENGINES \
@@ -421,6 +423,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
xe->info.supports_usm = desc->supports_usm;
xe->info.has_asid = desc->has_asid;
xe->info.has_gt_error_vectors = desc->has_gt_error_vectors;
+ xe->info.has_mem_sparing = desc->has_mem_sparing;
xe->info.has_flat_ccs = desc->has_flat_ccs;
xe->info.has_4tile = desc->has_4tile;
xe->info.has_range_tlb_invalidation = desc->has_range_tlb_invalidation;
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
2023-04-06 9:26 [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Himal Prasad Ghimiray
` (4 preceding siblings ...)
2023-04-06 9:26 ` [Intel-xe] [PATCH 4/4] drm/xe/ras: Add support for reporting CSC HW and FW errors Himal Prasad Ghimiray
@ 2023-04-06 12:25 ` Jani Nikula
2023-04-26 12:14 ` Ghimiray, Himal Prasad
5 siblings, 1 reply; 22+ messages in thread
From: Jani Nikula @ 2023-04-06 12:25 UTC (permalink / raw)
To: Himal Prasad Ghimiray, intel-xe; +Cc: Himal Prasad Ghimiray
On Thu, 06 Apr 2023, Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> wrote:
> These patches in series are for adding Reliability,
> Availability and Serviceability support on xe.
> Patches provide the infra for various hardware error
> counting and logging. These error counters will be exposed to
> userspace in subsequent patches.
> In current patches:
> 1) We are adding support to handle new interrupts bits.
> 2) Counting of GT errors.
> 3) Soc/SGunit error counting.
> 4) CSC HW and FW error counting and sending uvent.
>
> Akeem G Abodunrin (1):
> drm/xe/ras: Add support for reporting CSC HW and FW errors.
>
> Aravind Iddamsetty (2):
> drm/xe/ras: Log the GT hw errors.
> drm/xe/ras: Count SOC and SGUNIT errors
>
> Himal Prasad Ghimiray (1):
> drm/xe: Handle GRF/IC ECC error irq
>
> drivers/gpu/drm/xe/regs/xe_regs.h | 244 ++++++++
Please don't recreate i915_reg.h in xe. Please add separate regs files
like we've been doing in i915. It's pain to split a monster register
file later.
BR,
Jani.
> drivers/gpu/drm/xe/xe_device.c | 6 +
> drivers/gpu/drm/xe/xe_device_types.h | 4 +
> drivers/gpu/drm/xe/xe_gt.c | 30 +
> drivers/gpu/drm/xe/xe_gt_types.h | 105 ++++
> drivers/gpu/drm/xe/xe_irq.c | 824 +++++++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_pci.c | 6 +
> 7 files changed, 1219 insertions(+)
--
Jani Nikula, Intel Open Source Graphics Center
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
2023-04-06 12:25 ` [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE Jani Nikula
@ 2023-04-26 12:14 ` Ghimiray, Himal Prasad
2023-05-02 9:38 ` Jani Nikula
0 siblings, 1 reply; 22+ messages in thread
From: Ghimiray, Himal Prasad @ 2023-04-26 12:14 UTC (permalink / raw)
To: Jani Nikula, intel-xe@lists.freedesktop.org
Hi Jani,
Is recommendation to create new .h file for error related registers ?
Can I go ahead with adding file xe_gt_error_regs.h (GT, SOC, GSC) which explicitly mentions registers related to error handling ?
BR
Himal Ghimiray
> -----Original Message-----
> From: Jani Nikula <jani.nikula@linux.intel.com>
> Sent: 06 April 2023 17:56
> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; intel-
> xe@lists.freedesktop.org
> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> Subject: Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
>
> On Thu, 06 Apr 2023, Himal Prasad Ghimiray
> <himal.prasad.ghimiray@intel.com> wrote:
> > These patches in series are for adding Reliability, Availability and
> > Serviceability support on xe.
> > Patches provide the infra for various hardware error counting and
> > logging. These error counters will be exposed to userspace in
> > subsequent patches.
> > In current patches:
> > 1) We are adding support to handle new interrupts bits.
> > 2) Counting of GT errors.
> > 3) Soc/SGunit error counting.
> > 4) CSC HW and FW error counting and sending uvent.
> >
> > Akeem G Abodunrin (1):
> > drm/xe/ras: Add support for reporting CSC HW and FW errors.
> >
> > Aravind Iddamsetty (2):
> > drm/xe/ras: Log the GT hw errors.
> > drm/xe/ras: Count SOC and SGUNIT errors
> >
> > Himal Prasad Ghimiray (1):
> > drm/xe: Handle GRF/IC ECC error irq
> >
> > drivers/gpu/drm/xe/regs/xe_regs.h | 244 ++++++++
>
> Please don't recreate i915_reg.h in xe. Please add separate regs files like
> we've been doing in i915. It's pain to split a monster register file later.
>
> BR,
> Jani.
>
>
> > drivers/gpu/drm/xe/xe_device.c | 6 +
> > drivers/gpu/drm/xe/xe_device_types.h | 4 +
> > drivers/gpu/drm/xe/xe_gt.c | 30 +
> > drivers/gpu/drm/xe/xe_gt_types.h | 105 ++++
> > drivers/gpu/drm/xe/xe_irq.c | 824
> +++++++++++++++++++++++++++
> > drivers/gpu/drm/xe/xe_pci.c | 6 +
> > 7 files changed, 1219 insertions(+)
>
> --
> Jani Nikula, Intel Open Source Graphics Center
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
2023-04-26 12:14 ` Ghimiray, Himal Prasad
@ 2023-05-02 9:38 ` Jani Nikula
2023-05-02 19:58 ` Rodrigo Vivi
0 siblings, 1 reply; 22+ messages in thread
From: Jani Nikula @ 2023-05-02 9:38 UTC (permalink / raw)
To: Ghimiray, Himal Prasad, intel-xe@lists.freedesktop.org
Cc: Matt Roper, Lucas De Marchi, Rodrigo Vivi
On Wed, 26 Apr 2023, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com> wrote:
> Hi Jani,
>
> Is recommendation to create new .h file for error related registers ?
> Can I go ahead with adding file xe_gt_error_regs.h (GT, SOC, GSC) which explicitly mentions registers related to error handling ?
I don't know what the best grouping for this stuff would be. Maybe I'd
go for grouping by hardware blocks rather than functionality like
errors. Cc: Lucas, Matt, Rodrigo, just to pick a few names who might
have a better idea.
Just don't dump register macros to a single file that will bloat to
become unmanageable.
BR,
Jani.
PS. Please also don't top-post on mailing lists.
>
> BR
> Himal Ghimiray
>
>
>> -----Original Message-----
>> From: Jani Nikula <jani.nikula@linux.intel.com>
>> Sent: 06 April 2023 17:56
>> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; intel-
>> xe@lists.freedesktop.org
>> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
>> Subject: Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
>>
>> On Thu, 06 Apr 2023, Himal Prasad Ghimiray
>> <himal.prasad.ghimiray@intel.com> wrote:
>> > These patches in series are for adding Reliability, Availability and
>> > Serviceability support on xe.
>> > Patches provide the infra for various hardware error counting and
>> > logging. These error counters will be exposed to userspace in
>> > subsequent patches.
>> > In current patches:
>> > 1) We are adding support to handle new interrupts bits.
>> > 2) Counting of GT errors.
>> > 3) Soc/SGunit error counting.
>> > 4) CSC HW and FW error counting and sending uvent.
>> >
>> > Akeem G Abodunrin (1):
>> > drm/xe/ras: Add support for reporting CSC HW and FW errors.
>> >
>> > Aravind Iddamsetty (2):
>> > drm/xe/ras: Log the GT hw errors.
>> > drm/xe/ras: Count SOC and SGUNIT errors
>> >
>> > Himal Prasad Ghimiray (1):
>> > drm/xe: Handle GRF/IC ECC error irq
>> >
>> > drivers/gpu/drm/xe/regs/xe_regs.h | 244 ++++++++
>>
>> Please don't recreate i915_reg.h in xe. Please add separate regs files like
>> we've been doing in i915. It's pain to split a monster register file later.
>>
>> BR,
>> Jani.
>>
>>
>> > drivers/gpu/drm/xe/xe_device.c | 6 +
>> > drivers/gpu/drm/xe/xe_device_types.h | 4 +
>> > drivers/gpu/drm/xe/xe_gt.c | 30 +
>> > drivers/gpu/drm/xe/xe_gt_types.h | 105 ++++
>> > drivers/gpu/drm/xe/xe_irq.c | 824
>> +++++++++++++++++++++++++++
>> > drivers/gpu/drm/xe/xe_pci.c | 6 +
>> > 7 files changed, 1219 insertions(+)
>>
>> --
>> Jani Nikula, Intel Open Source Graphics Center
--
Jani Nikula, Intel Open Source Graphics Center
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
2023-05-02 9:38 ` Jani Nikula
@ 2023-05-02 19:58 ` Rodrigo Vivi
2023-05-02 20:41 ` Lucas De Marchi
0 siblings, 1 reply; 22+ messages in thread
From: Rodrigo Vivi @ 2023-05-02 19:58 UTC (permalink / raw)
To: Jani Nikula
Cc: Lucas De Marchi, Matt Roper, Ghimiray, Himal Prasad,
intel-xe@lists.freedesktop.org, Rodrigo Vivi
On Tue, May 02, 2023 at 12:38:21PM +0300, Jani Nikula wrote:
> On Wed, 26 Apr 2023, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com> wrote:
> > Hi Jani,
> >
> > Is recommendation to create new .h file for error related registers ?
> > Can I go ahead with adding file xe_gt_error_regs.h (GT, SOC, GSC) which explicitly mentions registers related to error handling ?
>
> I don't know what the best grouping for this stuff would be. Maybe I'd
> go for grouping by hardware blocks rather than functionality like
> errors. Cc: Lucas, Matt, Rodrigo, just to pick a few names who might
> have a better idea.
I believe the right way is to group by the IP block and/or reset domain,
rather than by functionality.
But Lucas is probably the best one to guide us here. He has some ideas
of tools to generate the regs we use from specs and the organization
might be impacted.
>
> Just don't dump register macros to a single file that will bloat to
> become unmanageable.
>
> BR,
> Jani.
>
>
> PS. Please also don't top-post on mailing lists.
>
>
>
> >
> > BR
> > Himal Ghimiray
> >
> >
> >> -----Original Message-----
> >> From: Jani Nikula <jani.nikula@linux.intel.com>
> >> Sent: 06 April 2023 17:56
> >> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; intel-
> >> xe@lists.freedesktop.org
> >> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> >> Subject: Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
> >>
> >> On Thu, 06 Apr 2023, Himal Prasad Ghimiray
> >> <himal.prasad.ghimiray@intel.com> wrote:
> >> > These patches in series are for adding Reliability, Availability and
> >> > Serviceability support on xe.
> >> > Patches provide the infra for various hardware error counting and
> >> > logging. These error counters will be exposed to userspace in
> >> > subsequent patches.
> >> > In current patches:
> >> > 1) We are adding support to handle new interrupts bits.
> >> > 2) Counting of GT errors.
> >> > 3) Soc/SGunit error counting.
> >> > 4) CSC HW and FW error counting and sending uvent.
> >> >
> >> > Akeem G Abodunrin (1):
> >> > drm/xe/ras: Add support for reporting CSC HW and FW errors.
> >> >
> >> > Aravind Iddamsetty (2):
> >> > drm/xe/ras: Log the GT hw errors.
> >> > drm/xe/ras: Count SOC and SGUNIT errors
> >> >
> >> > Himal Prasad Ghimiray (1):
> >> > drm/xe: Handle GRF/IC ECC error irq
> >> >
> >> > drivers/gpu/drm/xe/regs/xe_regs.h | 244 ++++++++
> >>
> >> Please don't recreate i915_reg.h in xe. Please add separate regs files like
> >> we've been doing in i915. It's pain to split a monster register file later.
> >>
> >> BR,
> >> Jani.
> >>
> >>
> >> > drivers/gpu/drm/xe/xe_device.c | 6 +
> >> > drivers/gpu/drm/xe/xe_device_types.h | 4 +
> >> > drivers/gpu/drm/xe/xe_gt.c | 30 +
> >> > drivers/gpu/drm/xe/xe_gt_types.h | 105 ++++
> >> > drivers/gpu/drm/xe/xe_irq.c | 824
> >> +++++++++++++++++++++++++++
> >> > drivers/gpu/drm/xe/xe_pci.c | 6 +
> >> > 7 files changed, 1219 insertions(+)
> >>
> >> --
> >> Jani Nikula, Intel Open Source Graphics Center
>
> --
> Jani Nikula, Intel Open Source Graphics Center
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-xe] [PATCH 0/4] RFC: drm/xe/ras: Supporting RAS on XE.
2023-05-02 19:58 ` Rodrigo Vivi
@ 2023-05-02 20:41 ` Lucas De Marchi
0 siblings, 0 replies; 22+ messages in thread
From: Lucas De Marchi @ 2023-05-02 20:41 UTC (permalink / raw)
To: Rodrigo Vivi
Cc: Ghimiray, Himal Prasad, Matt Roper,
intel-xe@lists.freedesktop.org, Rodrigo Vivi
On Tue, May 02, 2023 at 03:58:45PM -0400, Rodrigo Vivi wrote:
>On Tue, May 02, 2023 at 12:38:21PM +0300, Jani Nikula wrote:
>> On Wed, 26 Apr 2023, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com> wrote:
>> > Hi Jani,
>> >
>> > Is recommendation to create new .h file for error related registers ?
>> > Can I go ahead with adding file xe_gt_error_regs.h (GT, SOC, GSC) which explicitly mentions registers related to error handling ?
>>
>> I don't know what the best grouping for this stuff would be. Maybe I'd
>> go for grouping by hardware blocks rather than functionality like
>> errors. Cc: Lucas, Matt, Rodrigo, just to pick a few names who might
>> have a better idea.
>
>I believe the right way is to group by the IP block and/or reset domain,
>rather than by functionality.
>
>But Lucas is probably the best one to guide us here. He has some ideas
>of tools to generate the regs we use from specs and the organization
>might be impacted.
what currently guides me when adding registers is the "Graphics Register
Address Map" from bspec. Example: 53616
Note that what we currently have comes mostly from i915 after a lot of
re-organization and changes to adhere to the coding style and
conventions. But the per-file split was kept as is from i915, which I
believe was done by Matt Roper.
My long term goal is to have a tool to generate these headers directly
from the spec, but I think we are far from it as there are more urgent
things to be done. So:
1) Don't dump everything in xe_regs.h. Just by looking at the offsets
you can have a good idea where they should be located
2) Don't dump to the end of the header. Sort registers by offset
3) Don't add registers/bitfields that are not used in the code.
Preferably add them together with the code that uses them
4) Follow the coding style in these files, now that they are mostly
clean.
Lucas De Marchi
^ permalink raw reply [flat|nested] 22+ messages in thread