From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: Mallesh Koujalagi <mallesh.koujalagi@intel.com>,
<intel-xe@lists.freedesktop.org>, <rodrigo.vivi@intel.com>,
<matthew.brost@intel.com>, <thomas.hellstrom@linux.intel.com>
Cc: <anshuman.gupta@intel.com>, <badal.nilawar@intel.com>,
<vinay.belgaumkar@intel.com>, <aravind.iddamsetty@intel.com>,
<riana.tauro@intel.com>, <karthik.poosa@intel.com>,
<sk.anirban@intel.com>, <raag.jadav@intel.com>
Subject: Re: [PATCH 1/2] drm/xe: add structured RAS error logging infrastructure
Date: Fri, 12 Jun 2026 00:20:00 +0200 [thread overview]
Message-ID: <33423af2-7e8d-4682-a3ca-ff99bad28979@intel.com> (raw)
In-Reply-To: <20260611091224.17275-5-mallesh.koujalagi@intel.com>
On 6/11/2026 11:12 AM, Mallesh Koujalagi wrote:
> Introduce SIG_ID defining Signature IDs (SIG IDs) for
> categorising XE GPU errors, and __xe_ras_log() to emit
> structured log lines with SIG ID, CPER severity, tile/GT
> location, errno, and a free-form message.
>
> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> ---
> drivers/gpu/drm/xe/Makefile | 1 +
> drivers/gpu/drm/xe/xe_ras_log.c | 64 ++++++++++++++++++++
> drivers/gpu/drm/xe/xe_sig_ids.h | 101 ++++++++++++++++++++++++++++++++
> 3 files changed, 166 insertions(+)
> create mode 100644 drivers/gpu/drm/xe/xe_ras_log.c
> create mode 100644 drivers/gpu/drm/xe/xe_sig_ids.h
>
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 091872771e98..412be4bad952 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -115,6 +115,7 @@ xe-y += xe_bb.o \
> xe_query.o \
> xe_range_fence.o \
> xe_ras.o \
> + xe_ras_log.o \
> xe_reg_sr.o \
> xe_reg_whitelist.o \
> xe_ring_ops.o \
> diff --git a/drivers/gpu/drm/xe/xe_ras_log.c b/drivers/gpu/drm/xe/xe_ras_log.c
> new file mode 100644
> index 000000000000..b6a2792b61d4
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras_log.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#include <drm/drm_print.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt.h"
> +#include "xe_sig_ids.h"
> +
> +/**
> + * __xe_ras_log - Emit a structured RAS log entry
> + * @xe: xe device instance
> + * @gt: GT instance where the error occurred, or NULL if device-wide
what about tile-only errors?
> + * @sig_id: signature ID from xe_sig_ids.h identifying the error class
> + * @cper_sev: CPER severity (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)
> + * @errno_val: negative errno describing the error condition
> + * @fmt: printf-style format string
> + * @...: format arguments
> + *
> + * Formats the message and emits a kernel log line via drm_err() for fatal
> + * events or drm_warn() for all others. CPER record generation and hex dump
> + * are planned as follow-ups.
> + *
> + * Format:
> + * [xe-err] SIG_ID = <id> Severity = <sev> Location = <loc> Errno = <n> Message = "<msg>"
is this some standard format or just our invention?
> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> + u16 sig_id, u32 cper_sev, int errno_val,
> + const char *fmt, ...)
> +{
> + char loc[32];
> + char msg[256];
> + va_list ap;
> + int ret;
> +
> + if (gt)
> + snprintf(loc, sizeof(loc), "tile = %u/gt = %u",
elsewhere in the xe driver we are referring to a tile as "Tile%u" and to a GT as "GT%u"
also, the "=" used here might mess with rest of the items that are also separated by "="
> + gt->tile->id, gt->info.id);
> + else
> + snprintf(loc, sizeof(loc), "device");
> +
> + va_start(ap, fmt);
> + ret = vsnprintf(msg, sizeof(msg), fmt, ap);
can't you just pass va to drm_err/warn below ?
> + va_end(ap);
> +
> + WARN_ON_ONCE(ret >= (int)sizeof(msg));
maybe xe_assert() ? but shouldn't be needed anyway
> +
> + if (cper_sev == CPER_SEV_FATAL)
> + drm_err(&xe->drm,
> + "[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%s\"\n",
do we need this "[xe-err]" ? both 'xe' and 'err' will be there already added by drm_err()
[ ] xe 0000:00:02.0: [drm] *ERROR* ...
also passing a raw SIG_ID number is not user friendly ...
and severity is always present, so maybe just print it without any prefix
for location maybe we can use existing xe_gt_err/xe_tile_err/xe_err macros ?
and is errno really needed/useful? we already have severity and details will be in the msg
so maybe instead of seeing this:
[ ] xe 0000:00:02.0: [drm] *ERROR* [xe-err] SIG_ID = 123 Severity = fatal Location = device Errno = -5 Message = "blah blah"
we can try with messages like:
[ ] xe 0000:00:02.0: [drm] *ERROR* [fatal:123] blah blah (-EIO)
[ ] xe 0000:00:02.0: [drm] *ERROR* [recoverable:987] blah blah (-ETIME)
[ ] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: [fatal:123] blah blah (-EIO)
[ ] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: [recoverable:987] blah blah (-ETIME)
> + sig_id, xe_cper_severity_str(cper_sev), loc,
> + errno_val, msg);
> + else
> + drm_warn(&xe->drm,
> + "[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%s\"\n",
> + sig_id, xe_cper_severity_str(cper_sev), loc,
> + errno_val, msg);
> +
> + /* TODO: Add CPER record driver handler */
> + /* TODO: Add RAS dump cper hex handler */
> +}
> diff --git a/drivers/gpu/drm/xe/xe_sig_ids.h b/drivers/gpu/drm/xe/xe_sig_ids.h
> new file mode 100644
> index 000000000000..8abe7554714e
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_sig_ids.h
maybe this should be "xe_ras_errors.h" ? plain "sig_ids" tells nothing
> @@ -0,0 +1,101 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_SIG_IDS_H_
> +#define _XE_SIG_IDS_H_
> +
> +#include <linux/cper.h>
> +
> +/**
> + * xe_cper_severity_str - local severity-to-string helper
> + * @sev: CPER severity value (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)
> + *
> + * Avoids a link-time dependency on cper_severity_str() which is only
> + * compiled when CONFIG_UEFI_CPER=y.
or maybe we should just select UEFI_CPER in our Kconfig ?
duplicating whole function looks like a bad idea
> + *
> + * Return: string representation of the severity, or "unknown".
> + */
> +static inline const char *xe_cper_severity_str(u32 sev)
> +{
> + switch (sev) {
> + case CPER_SEV_RECOVERABLE: return "recoverable";
> + case CPER_SEV_FATAL: return "fatal";
> + case CPER_SEV_CORRECTED: return "corrected";
> + case CPER_SEV_INFORMATIONAL: return "informational";
> + default: return "unknown";
> + }
> +}
> +
> +/*
> + * Driver SIG_IDs
what's the criteria for each new SIG?
is it per component? or per category?
> + */
> +#define XE_SIG_PROBE 1 /* FATAL: probe failed */
> +#define XE_SIG_WEDGED 2 /* FATAL: device wedged */
> +#define XE_SIG_SURVIVABILITY 3 /* FATAL: survivability mode */
> +#define XE_SIG_FW 4 /* RECOVERABLE: GuC/HuC/UC/GSC/CSC/PCODE */
btw, shouldn't we use existing FW_BUG FW_WARN FW_INFO prefixes when reporting our FW errors?
also, are we sure all FW errors are recoverable?
> +#define XE_SIG_GT_TDR 5 /* RECOVERABLE: engine hang / reset */
> +#define XE_SIG_MEM_FAULT 6 /* RECOVERABLE: VM bind, page fault, GTT */
> +#define XE_SIG_IO_BUS 7 /* RECOVERABLE: runtime PCIe/IOMMU/MMIO */
are those IDs supposed to be stable or can be changed later?
> +
> +/*
> + * HW SIG_IDs
> + */
> +#define XE_SIG_HW_DEVICE_MEMORY 8
> +#define XE_SIG_HW_CORE_COMPUTE 9
> +#define XE_SIG_HW_SCALE_UP_LINK 10
> +#define XE_SIG_HW_PCIE 11
> +#define XE_SIG_HW_FABRIC 12
> +#define XE_SIG_HW_SOC_INTERNAL 13
hmm, those IDs look more like HW components names, not errors
> +
> +/* Must be updated when adding new driver SIG IDs */
> +#define XE_SIG_DRIVER_LAST XE_SIG_IO_BUS
> +#define XE_SIG_HW_FIRST XE_SIG_HW_DEVICE_MEMORY
if IDs must be stable than this will not work
> +
> +struct xe_device;
> +struct xe_gt;
> +
> +/*
> + * Common backend helper
> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> + u16 sig_id, u32 cper_sev, int errno_val,
> + const char *fmt, ...);
this is a wrong place to define function prototype,
it should be in xe_ras_log.h instead I guess
> +
> +/*
> + * Driver-facing reporting macros
> + */
> +
> +/* FATAL */
> +#define XE_RAS_PROBE(xe, errno, fmt, ...) \
> + __xe_ras_log((xe), NULL, XE_SIG_PROBE, CPER_SEV_FATAL, \
> + (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_WEDGED(xe, errno, fmt, ...) \
> + __xe_ras_log((xe), NULL, XE_SIG_WEDGED, CPER_SEV_FATAL, \
> + (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_SURVIVABILITY(xe, errno, fmt, ...) \
> + __xe_ras_log((xe), NULL, XE_SIG_SURVIVABILITY, CPER_SEV_FATAL, \
> + (errno), fmt, ##__VA_ARGS__)
> +
> +/* RECOVERABLE */
> +#define XE_RAS_FW(xe, gt, errno, fmt, ...) \
> + __xe_ras_log((xe), (gt), XE_SIG_FW, CPER_SEV_RECOVERABLE, \
> + (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_GT_TDR(xe, gt, errno, fmt, ...) \
> + __xe_ras_log((xe), (gt), XE_SIG_GT_TDR, CPER_SEV_RECOVERABLE, \
> + (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_MEM_FAULT(xe, gt, errno, fmt, ...) \
> + __xe_ras_log((xe), (gt), XE_SIG_MEM_FAULT, CPER_SEV_RECOVERABLE, \
> + (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_IO_BUS(xe, errno, fmt, ...) \
> + __xe_ras_log((xe), NULL, XE_SIG_IO_BUS, CPER_SEV_RECOVERABLE, \
> + (errno), fmt, ##__VA_ARGS__)
maybe better:
#define xe_ras_fatal(xe, sig, fmt, ...) ..
#define xe_ras_recoverable(xe, sig, fmt, ...) ..
#define xe_gt_ras_fatal(gt, sig, fmt, ...) ..
#define xe_gt_ras_recoverable(gt, sig, fmt, ...) ..
> +
> +#endif /* _XE_SIG_IDS_H_ */
next prev parent reply other threads:[~2026-06-11 22:20 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-11 9:12 [PATCH 0/2] drm/xe: Structured RAS error logging infrastructure Mallesh Koujalagi
2026-06-11 9:12 ` [PATCH 1/2] drm/xe: add structured " Mallesh Koujalagi
2026-06-11 22:20 ` Michal Wajdeczko [this message]
2026-06-15 8:51 ` Mallesh, Koujalagi
2026-06-11 9:12 ` [PATCH 2/2] drm/xe: use XE_RAS_WEDGED macro in xe_device_declare_wedged Mallesh Koujalagi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=33423af2-7e8d-4682-a3ca-ff99bad28979@intel.com \
--to=michal.wajdeczko@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=karthik.poosa@intel.com \
--cc=mallesh.koujalagi@intel.com \
--cc=matthew.brost@intel.com \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=sk.anirban@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=vinay.belgaumkar@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox