From: "Tauro, Riana" <riana.tauro@intel.com>
To: Raag Jadav <raag.jadav@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: <matthew.brost@intel.com>, <rodrigo.vivi@intel.com>,
<michal.wajdeczko@intel.com>, <matthew.d.roper@intel.com>,
<umesh.nerlige.ramappa@intel.com>, <mallesh.koujalagi@intel.com>,
<soham.purkait@intel.com>, <anoop.c.vijay@intel.com>,
<aravind.iddamsetty@linux.intel.com>
Subject: Re: [PATCH v6 3/3] drm/xe/ras: Introduce correctable error handling
Date: Mon, 27 Apr 2026 20:33:03 +0530 [thread overview]
Message-ID: <aa7d04a9-1ced-439d-9135-b95531d61add@intel.com> (raw)
In-Reply-To: <20260410102744.427150-4-raag.jadav@intel.com>
On 4/10/2026 3:57 PM, Raag Jadav wrote:
> Add initial support for correctable error handling which is serviced
> using system controller event. Currently we only log the errors in
> dmesg but this serves as a foundation for RAS infrastructure and will
> be further extended to facilitate other RAS features.
>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
LGTM
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v4: Fix Severity/Component logging (Mallesh)
> s/xe_ras_error/xe_ras_error_class (Riana)
> v5: Handle unexpected counter threshold crossed (Mallesh)
> v6: Drop unused xe_device parameter (Mallesh)
> Fix unexpected counter threshold logic (Mallesh)
> Use xe_device parameter for xe_ras functions (Riana)
> Shorten dmesg logging (Riana)
> s/xe_ras_threshold_crossed_data/xe_ras_threshold_crossed (Riana)
> ---
> drivers/gpu/drm/xe/Makefile | 1 +
> drivers/gpu/drm/xe/xe_ras.c | 92 +++++++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_ras.h | 15 +++++
> drivers/gpu/drm/xe/xe_ras_types.h | 73 +++++++++++++++++++++
> drivers/gpu/drm/xe/xe_sysctrl_event.c | 3 +-
> 5 files changed, 183 insertions(+), 1 deletion(-)
> create mode 100644 drivers/gpu/drm/xe/xe_ras.c
> create mode 100644 drivers/gpu/drm/xe/xe_ras.h
> create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h
>
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 9e6689c86797..0e6e91a6063c 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -113,6 +113,7 @@ xe-y += xe_bb.o \
> xe_pxp_submit.o \
> xe_query.o \
> xe_range_fence.o \
> + xe_ras.o \
> xe_reg_sr.o \
> xe_reg_whitelist.o \
> xe_ring_ops.o \
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> new file mode 100644
> index 000000000000..08e91348c459
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#include "xe_printk.h"
> +#include "xe_ras.h"
> +#include "xe_ras_types.h"
> +#include "xe_sysctrl.h"
> +#include "xe_sysctrl_event_types.h"
> +
> +/* Severity of detected errors */
> +enum xe_ras_severity {
> + XE_RAS_SEV_NOT_SUPPORTED = 0,
> + XE_RAS_SEV_CORRECTABLE,
> + XE_RAS_SEV_UNCORRECTABLE,
> + XE_RAS_SEV_INFORMATIONAL,
> + XE_RAS_SEV_MAX
> +};
> +
> +/* Major IP blocks/components where errors can originate */
> +enum xe_ras_component {
> + XE_RAS_COMP_NOT_SUPPORTED = 0,
> + XE_RAS_COMP_DEVICE_MEMORY,
> + XE_RAS_COMP_CORE_COMPUTE,
> + XE_RAS_COMP_RESERVED,
> + XE_RAS_COMP_PCIE,
> + XE_RAS_COMP_FABRIC,
> + XE_RAS_COMP_SOC_INTERNAL,
> + XE_RAS_COMP_MAX
> +};
> +
> +static const char *const xe_ras_severities[] = {
> + [XE_RAS_SEV_NOT_SUPPORTED] = "Not Supported",
> + [XE_RAS_SEV_CORRECTABLE] = "Correctable Error",
> + [XE_RAS_SEV_UNCORRECTABLE] = "Uncorrectable Error",
> + [XE_RAS_SEV_INFORMATIONAL] = "Informational Error",
> +};
> +static_assert(ARRAY_SIZE(xe_ras_severities) == XE_RAS_SEV_MAX);
> +
> +static const char *const xe_ras_components[] = {
> + [XE_RAS_COMP_NOT_SUPPORTED] = "Not Supported",
> + [XE_RAS_COMP_DEVICE_MEMORY] = "Device Memory",
> + [XE_RAS_COMP_CORE_COMPUTE] = "Core Compute",
> + [XE_RAS_COMP_RESERVED] = "Reserved",
> + [XE_RAS_COMP_PCIE] = "PCIe",
> + [XE_RAS_COMP_FABRIC] = "Fabric",
> + [XE_RAS_COMP_SOC_INTERNAL] = "SoC Internal",
> +};
> +static_assert(ARRAY_SIZE(xe_ras_components) == XE_RAS_COMP_MAX);
> +
> +static inline const char *sev_to_str(u8 sev)
> +{
> + if (sev >= XE_RAS_SEV_MAX)
> + sev = XE_RAS_SEV_NOT_SUPPORTED;
> +
> + return xe_ras_severities[sev];
> +}
> +
> +static inline const char *comp_to_str(u8 comp)
> +{
> + if (comp >= XE_RAS_COMP_MAX)
> + comp = XE_RAS_COMP_NOT_SUPPORTED;
> +
> + return xe_ras_components[comp];
> +}
> +
> +void xe_ras_counter_threshold_crossed(struct xe_device *xe,
> + struct xe_sysctrl_event_response *response)
> +{
> + struct xe_ras_threshold_crossed *pending = (void *)&response->data;
> + struct xe_ras_error_class *errors = pending->counters;
> + u32 counter_id, ncounters = pending->ncounters;
> +
> + if (!ncounters || ncounters > XE_RAS_NUM_COUNTERS) {
> + xe_err(xe, "sysctrl: unexpected counter threshold crossed %u\n", ncounters);
> + return;
> + }
> +
> + BUILD_BUG_ON(sizeof(response->data) < sizeof(*pending));
> + xe_warn(xe, "[RAS]: counter threshold crossed, %u new errors\n", ncounters);
> +
> + for (counter_id = 0; counter_id < ncounters; counter_id++) {
> + u8 severity, component;
> +
> + severity = errors[counter_id].common.severity;
> + component = errors[counter_id].common.component;
> +
> + xe_warn(xe, "[RAS]: %s %s detected\n",
> + comp_to_str(component), sev_to_str(severity));
> + }
> +}
> diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
> new file mode 100644
> index 000000000000..ea90593b62dc
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_RAS_H_
> +#define _XE_RAS_H_
> +
> +struct xe_device;
> +struct xe_sysctrl_event_response;
> +
> +void xe_ras_counter_threshold_crossed(struct xe_device *xe,
> + struct xe_sysctrl_event_response *response);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
> new file mode 100644
> index 000000000000..4e63c67f806a
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras_types.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_RAS_TYPES_H_
> +#define _XE_RAS_TYPES_H_
> +
> +#include <linux/types.h>
> +
> +#define XE_RAS_NUM_COUNTERS 16
> +
> +/**
> + * struct xe_ras_error_common - Error fields that are common across all products
> + */
> +struct xe_ras_error_common {
> + /** @severity: Error severity */
> + u8 severity;
> + /** @component: IP block where error originated */
> + u8 component;
> +} __packed;
> +
> +/**
> + * struct xe_ras_error_unit - Error unit information
> + */
> +struct xe_ras_error_unit {
> + /** @tile: Tile identifier */
> + u8 tile;
> + /** @instance: Instance identifier specific to IP */
> + u32 instance;
> +} __packed;
> +
> +/**
> + * struct xe_ras_error_cause - Error cause information
> + */
> +struct xe_ras_error_cause {
> + /** @cause: Cause/checker */
> + u32 cause;
> + /** @reserved: For future use */
> + u8 reserved;
> +} __packed;
> +
> +/**
> + * struct xe_ras_error_product - Error fields that are specific to the product
> + */
> +struct xe_ras_error_product {
> + /** @unit: Unit within IP block */
> + struct xe_ras_error_unit unit;
> + /** @cause: Cause/checker */
> + struct xe_ras_error_cause cause;
> +} __packed;
> +
> +/**
> + * struct xe_ras_error_class - Combines common and product-specific parts
> + */
> +struct xe_ras_error_class {
> + /** @common: Common error type and component */
> + struct xe_ras_error_common common;
> + /** @product: Product-specific unit and cause */
> + struct xe_ras_error_product product;
> +} __packed;
> +
> +/**
> + * struct xe_ras_threshold_crossed - Data for threshold crossed event
> + */
> +struct xe_ras_threshold_crossed {
> + /** @ncounters: Number of error counters that crossed thresholds */
> + u32 ncounters;
> + /** @counters: Array of error counters that crossed threshold */
> + struct xe_ras_error_class counters[XE_RAS_NUM_COUNTERS];
> +} __packed;
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_sysctrl_event.c b/drivers/gpu/drm/xe/xe_sysctrl_event.c
> index 3edde46a9711..c6ea32f3471e 100644
> --- a/drivers/gpu/drm/xe/xe_sysctrl_event.c
> +++ b/drivers/gpu/drm/xe/xe_sysctrl_event.c
> @@ -6,6 +6,7 @@
> #include "xe_device.h"
> #include "xe_irq.h"
> #include "xe_printk.h"
> +#include "xe_ras.h"
> #include "xe_sysctrl.h"
> #include "xe_sysctrl_event_types.h"
> #include "xe_sysctrl_mailbox.h"
> @@ -34,7 +35,7 @@ static void get_pending_event(struct xe_sysctrl *sc, struct xe_sysctrl_mailbox_c
> }
>
> if (response->event == XE_SYSCTRL_EVENT_THRESHOLD_CROSSED)
> - xe_warn(xe, "[RAS]: counter threshold crossed\n");
> + xe_ras_counter_threshold_crossed(xe, response);
> else
> xe_err(xe, "sysctrl: unexpected event %#x\n", response->event);
>
next prev parent reply other threads:[~2026-04-27 15:03 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-10 10:27 [PATCH v6 0/3] Introduce Xe Correctable Error Handling Raag Jadav
2026-04-10 10:27 ` [PATCH v6 1/3] drm/xe/sysctrl: Add system controller interrupt handler Raag Jadav
2026-04-10 10:27 ` [PATCH v6 2/3] drm/xe/sysctrl: Add system controller event support Raag Jadav
2026-04-27 8:50 ` Tauro, Riana
2026-04-27 12:56 ` Raag Jadav
2026-04-10 10:27 ` [PATCH v6 3/3] drm/xe/ras: Introduce correctable error handling Raag Jadav
2026-04-13 5:56 ` Mallesh, Koujalagi
2026-04-27 15:03 ` Tauro, Riana [this message]
2026-04-10 12:19 ` ✗ CI.checkpatch: warning for Introduce Xe Correctable Error Handling (rev6) Patchwork
2026-04-10 12:20 ` ✓ CI.KUnit: success " Patchwork
2026-04-10 13:06 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-10 20:19 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aa7d04a9-1ced-439d-9135-b95531d61add@intel.com \
--to=riana.tauro@intel.com \
--cc=anoop.c.vijay@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=mallesh.koujalagi@intel.com \
--cc=matthew.brost@intel.com \
--cc=matthew.d.roper@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=raag.jadav@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=soham.purkait@intel.com \
--cc=umesh.nerlige.ramappa@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.