Re: [PATCH v3 4/4] drm/xe/ras: Introduce correctable error handling

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Raag Jadav <raag.jadav@intel.com>
To: "Tauro, Riana" <riana.tauro@intel.com>
Cc: intel-xe@lists.freedesktop.org, matthew.brost@intel.com,
	rodrigo.vivi@intel.com, michal.wajdeczko@intel.com,
	matthew.d.roper@intel.com, umesh.nerlige.ramappa@intel.com,
	mallesh.koujalagi@intel.com, soham.purkait@intel.com,
	anoop.c.vijay@intel.com, aravind.iddamsetty@linux.intel.com
Subject: Re: [PATCH v3 4/4] drm/xe/ras: Introduce correctable error handling
Date: Mon, 23 Mar 2026 12:45:49 +0100	[thread overview]
Message-ID: <acEn7YTv53VZL6Jj@black.igk.intel.com> (raw)
In-Reply-To: <3aa996d4-1ae2-4bdf-9a07-0c6c345e9328@intel.com>

On Thu, Mar 19, 2026 at 07:30:06PM +0530, Tauro, Riana wrote:
> On 3/12/2026 2:36 PM, Raag Jadav wrote:
> > Add initial support for correctable error handling which is serviced
> > using system controller event. Currently we only log the errors in
> > dmesg but this serves as a foundation for RAS infrastructure and will
> > be further extended to facilitate other RAS features.
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >   drivers/gpu/drm/xe/Makefile           |  1 +
> >   drivers/gpu/drm/xe/xe_ras.c           | 89 +++++++++++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_ras.h           | 14 +++++
> >   drivers/gpu/drm/xe/xe_ras_types.h     | 73 ++++++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_sysctrl_event.c |  3 +-
> >   5 files changed, 179 insertions(+), 1 deletion(-)
> >   create mode 100644 drivers/gpu/drm/xe/xe_ras.c
> >   create mode 100644 drivers/gpu/drm/xe/xe_ras.h
> >   create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index 373adb20afb2..9811cf732260 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -113,6 +113,7 @@ xe-y += xe_bb.o \
> >   	xe_pxp_submit.o \
> >   	xe_query.o \
> >   	xe_range_fence.o \
> > +	xe_ras.o \
> >   	xe_reg_sr.o \
> >   	xe_reg_whitelist.o \
> >   	xe_ring_ops.o \
> > diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> > new file mode 100644
> > index 000000000000..37a996a6abf8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_ras.c
> > @@ -0,0 +1,89 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +#include "xe_assert.h"
> > +#include "xe_printk.h"
> > +#include "xe_ras.h"
> > +#include "xe_ras_types.h"
> > +#include "xe_sysctrl.h"
> > +#include "xe_sysctrl_event_types.h"
> > +
> > +/* Severity of detected errors  */
> > +enum xe_ras_severity {
> > +	XE_RAS_SEV_NOT_SUPPORTED = 0,
> > +	XE_RAS_SEV_CORRECTABLE,
> > +	XE_RAS_SEV_UNCORRECTABLE,
> > +	XE_RAS_SEV_INFORMATIONAL,
> > +	XE_RAS_SEV_MAX
> > +};
> > +
> > +/* Major IP blocks/components where errors can originate */
> > +enum xe_ras_component {
> > +	XE_RAS_COMP_NOT_SUPPORTED = 0,
> > +	XE_RAS_COMP_DEVICE_MEMORY,
> > +	XE_RAS_COMP_CORE_COMPUTE,
> > +	XE_RAS_COMP_RESERVED,
> > +	XE_RAS_COMP_PCIE,
> > +	XE_RAS_COMP_FABRIC,
> > +	XE_RAS_COMP_SOC_INTERNAL,
> > +	XE_RAS_COMP_MAX
> > +};
> > +
> > +static const char *const xe_ras_severities[] = {
> > +	[XE_RAS_SEV_NOT_SUPPORTED]		= "Not Supported",
> > +	[XE_RAS_SEV_CORRECTABLE]		= "Correctable",
> > +	[XE_RAS_SEV_UNCORRECTABLE]		= "Uncorrectable",
> > +	[XE_RAS_SEV_INFORMATIONAL]		= "Informational",
> > +};
> > +static_assert(ARRAY_SIZE(xe_ras_severities) == XE_RAS_SEV_MAX);
> > +
> > +static const char *const xe_ras_components[] = {
> > +	[XE_RAS_COMP_NOT_SUPPORTED]		= "Not Supported",
> > +	[XE_RAS_COMP_DEVICE_MEMORY]		= "Device Memory",
> > +	[XE_RAS_COMP_CORE_COMPUTE]		= "Core Compute",
> > +	[XE_RAS_COMP_RESERVED]			= "Reserved",
> > +	[XE_RAS_COMP_PCIE]			= "PCIe",
> > +	[XE_RAS_COMP_FABRIC]			= "Fabric",
> > +	[XE_RAS_COMP_SOC_INTERNAL]		= "SoC Internal",
> > +};
> > +static_assert(ARRAY_SIZE(xe_ras_components) == XE_RAS_COMP_MAX);
> > +
> > +static inline const char *sev_to_str(struct xe_device *xe, u32 sev)
> > +{
> > +	xe_assert(xe, sev < XE_RAS_SEV_MAX);
> > +
> > +	return sev < XE_RAS_SEV_MAX ? xe_ras_severities[sev] : "Unknown";
> > +}
> > +
> > +static inline const char *comp_to_str(struct xe_device *xe, u32 comp)
> > +{
> > +	xe_assert(xe, comp < XE_RAS_COMP_MAX);
> > +
> > +	return comp < XE_RAS_COMP_MAX ? xe_ras_components[comp] : "Unknown";
> > +}
> > +
> > +void xe_ras_event_log(struct xe_sysctrl *sc, struct xe_sysctrl_event_response *response)
> > +{
> > +	struct xe_ras_event_threshold_crossed *pending = (void *)&response->data;
> > +	struct xe_ras_error *errors = pending->counters;
> > +	struct xe_device *xe = sc_to_xe(sc);
> > +	u32 cid, sev, comp, inst, cause;
> > +	u8 tile;
> > +
> > +	xe_assert(xe, pending->ncounters < XE_RAS_NUM_COUNTERS);
> > +	xe_warn(xe, "[RAS]: threshold crossed, %u new errors\n", pending->ncounters);
> > +
> > +	for (cid = 0; cid < pending->ncounters && cid < XE_RAS_NUM_COUNTERS; cid++) {
> > +		sev = errors[cid].common.severity;
> > +		comp = errors[cid].common.component;
> > +
> > +		tile = errors[cid].product.unit.tile;
> > +		inst = errors[cid].product.unit.instance;
> > +		cause = errors[cid].product.cause.cause;
> > +
> > +		xe_warn(xe, "[RAS]: Tile:%u Instance:%u Component:%s Error:%s Cause:%#x\n",
> > +			tile, inst, comp_to_str(xe, sev), sev_to_str(xe, comp), cause);
> 
> We can have minimal logging here with only severity and component and add
> additional logging
> 
> in following patches.

I plan to remove it entirely once we have cper in place, but until that
happens (and given the fragility of things) it'll good for debugging.

> > +	}
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
> > new file mode 100644
> > index 000000000000..22f035fa498d
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_ras.h
> > @@ -0,0 +1,14 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_RAS_H_
> > +#define _XE_RAS_H_
> > +
> > +struct xe_sysctrl;
> > +struct xe_sysctrl_event_response;
> > +
> > +void xe_ras_event_log(struct xe_sysctrl *sc, struct xe_sysctrl_event_response *response);
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
> > new file mode 100644
> > index 000000000000..2982c4696b6d
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_ras_types.h
> > @@ -0,0 +1,73 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_RAS_TYPES_H_
> > +#define _XE_RAS_TYPES_H_
> > +
> > +#include <linux/types.h>
> > +
> > +#define XE_RAS_NUM_COUNTERS			16
> > +
> > +/**
> > + * struct xe_ras_error_common - Error fields that are common across all products
> > + */
> > +struct xe_ras_error_common {
> > +	/** @severity: Error severity */
> > +	u8 severity;
> > +	/** @component: IP block where error originated */
> > +	u8 component;
> > +} __packed;
> > +
> > +/**
> > + * struct xe_ras_error_unit - Error unit information
> > + */
> > +struct xe_ras_error_unit {
> > +	/** @tile: Tile identifier */
> > +	u8 tile;
> > +	/** @instance: Instance identifier specific to IP */
> > +	u32 instance;
> > +} __packed;
> > +
> > +/**
> > + * struct xe_ras_error_cause - Error cause information
> > + */
> > +struct xe_ras_error_cause {
> > +	/** @cause: Cause/checker */
> > +	u32 cause;
> > +	/** @reserved: For future use */
> > +	u8 reserved;
> > +} __packed;
> > +
> > +/**
> > + * struct xe_ras_error_product - Error fields that are specific to the product
> > + */
> > +struct xe_ras_error_product {
> > +	/** @unit: Unit within IP block */
> > +	struct xe_ras_error_unit unit;
> > +	/** @cause: Cause/checker */
> > +	struct xe_ras_error_cause cause;
> > +} __packed;
> > +
> > +/**
> > + * struct xe_ras_error - Combines common and product-specific parts
> > + */
> > +struct xe_ras_error {
> 
> error_class ?

I know that how it's in the spec, but it's full of needless verbiage that
doesn't add much to the meaning. I'm fine either way.

Raag

> > +	/** @common: Common error type and component */
> > +	struct xe_ras_error_common common;
> > +	/** @product: Product-specific unit and cause */
> > +	struct xe_ras_error_product product;
> > +} __packed;
> > +
> > +/**
> > + * struct xe_ras_event_threshold_crossed - Event data for threshold crossed event
> > + */
> > +struct xe_ras_event_threshold_crossed {
> > +	/** @ncounters: Number of error counters that crossed thresholds */
> > +	u32 ncounters;
> > +	/** @counters: Array of error counters that crossed threshold */
> > +	struct xe_ras_error counters[XE_RAS_NUM_COUNTERS];
> > +} __packed;
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_sysctrl_event.c b/drivers/gpu/drm/xe/xe_sysctrl_event.c
> > index 47afca586bd1..1833ecadd9a1 100644
> > --- a/drivers/gpu/drm/xe/xe_sysctrl_event.c
> > +++ b/drivers/gpu/drm/xe/xe_sysctrl_event.c
> > @@ -6,6 +6,7 @@
> >   #include "xe_device.h"
> >   #include "xe_irq.h"
> >   #include "xe_printk.h"
> > +#include "xe_ras.h"
> >   #include "xe_sysctrl.h"
> >   #include "xe_sysctrl_event_types.h"
> >   #include "xe_sysctrl_mailbox.h"
> > @@ -38,7 +39,7 @@ static void xe_sysctrl_get_pending_event(struct xe_sysctrl *sc,
> >   		}
> >   		if (response.event == XE_SYSCTRL_EVENT_THRESHOLD_CROSSED) {
> > -			xe_warn(xe, "[RAS]: error counter threshold crossed\n");
> > +			xe_ras_event_log(sc, &response);
> >   		} else {
> >   			xe_err(xe, "sysctrl: unexpected event %#x\n", response.event);
> >   			return;

next prev parent reply	other threads:[~2026-03-23 11:45 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12  9:06 [PATCH v3 0/4] Introduce Xe Correctable Error Handling Raag Jadav
2026-03-12  9:06 ` [PATCH v3 1/4] drm/xe/sysctrl: Add System Controller Raag Jadav
2026-03-12  9:06 ` [PATCH v3 2/4] drm/xe/sysctrl: Add system controller interrupt handler Raag Jadav
2026-03-17  5:47   ` Mallesh, Koujalagi
2026-03-23 11:32     ` Raag Jadav
2026-03-12  9:06 ` [PATCH v3 3/4] drm/xe/sysctrl: Add system controller event support Raag Jadav
2026-03-19 14:09   ` Tauro, Riana
2026-03-23 11:40     ` Raag Jadav
2026-03-23 12:27       ` Mallesh, Koujalagi
2026-03-24  4:42         ` Raag Jadav
2026-03-12  9:06 ` [PATCH v3 4/4] drm/xe/ras: Introduce correctable error handling Raag Jadav
2026-03-19 14:00   ` Tauro, Riana
2026-03-23 11:45     ` Raag Jadav [this message]
2026-03-23 14:46       ` Mallesh, Koujalagi
2026-03-24  4:43         ` Raag Jadav
2026-03-12 10:27 ` ✗ CI.checkpatch: warning for Introduce Xe Correctable Error Handling (rev3) Patchwork
2026-03-12 10:28 ` ✓ CI.KUnit: success " Patchwork
2026-03-12 11:15 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-13  6:22 ` ✓ Xe.CI.FULL: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acEn7YTv53VZL6Jj@black.igk.intel.com \
    --to=raag.jadav@intel.com \
    --cc=anoop.c.vijay@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=soham.purkait@intel.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.