Re: [PATCH 1/2] drm/xe: add structured RAS error logging infrastructure

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>,
	<intel-xe@lists.freedesktop.org>, <rodrigo.vivi@intel.com>,
	<matthew.brost@intel.com>, <thomas.hellstrom@linux.intel.com>
Cc: <anshuman.gupta@intel.com>, <badal.nilawar@intel.com>,
	<vinay.belgaumkar@intel.com>, <aravind.iddamsetty@intel.com>,
	<riana.tauro@intel.com>, <karthik.poosa@intel.com>,
	<sk.anirban@intel.com>, <raag.jadav@intel.com>
Subject: Re: [PATCH 1/2] drm/xe: add structured RAS error logging infrastructure
Date: Mon, 15 Jun 2026 14:21:27 +0530	[thread overview]
Message-ID: <859a917b-a449-4e88-b9bd-e923a8de3504@intel.com> (raw)
In-Reply-To: <33423af2-7e8d-4682-a3ca-ff99bad28979@intel.com>


On 12-06-2026 03:50 am, Michal Wajdeczko wrote:
>
> On 6/11/2026 11:12 AM, Mallesh Koujalagi wrote:
>> Introduce SIG_ID defining Signature IDs (SIG IDs) for
>> categorising XE GPU errors, and __xe_ras_log() to emit
>> structured log lines with SIG ID, CPER severity, tile/GT
>> location, errno, and a free-form message.
>>
>> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
>> ---
>>   drivers/gpu/drm/xe/Makefile     |   1 +
>>   drivers/gpu/drm/xe/xe_ras_log.c |  64 ++++++++++++++++++++
>>   drivers/gpu/drm/xe/xe_sig_ids.h | 101 ++++++++++++++++++++++++++++++++
>>   3 files changed, 166 insertions(+)
>>   create mode 100644 drivers/gpu/drm/xe/xe_ras_log.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_sig_ids.h
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> index 091872771e98..412be4bad952 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -115,6 +115,7 @@ xe-y += xe_bb.o \
>>   	xe_query.o \
>>   	xe_range_fence.o \
>>   	xe_ras.o \
>> +	xe_ras_log.o \
>>   	xe_reg_sr.o \
>>   	xe_reg_whitelist.o \
>>   	xe_ring_ops.o \
>> diff --git a/drivers/gpu/drm/xe/xe_ras_log.c b/drivers/gpu/drm/xe/xe_ras_log.c
>> new file mode 100644
>> index 000000000000..b6a2792b61d4
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_ras_log.c
>> @@ -0,0 +1,64 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_print.h>
>> +
>> +#include "xe_device.h"
>> +#include "xe_gt.h"
>> +#include "xe_sig_ids.h"
>> +
>> +/**
>> + * __xe_ras_log - Emit a structured RAS log entry
>> + * @xe: xe device instance
>> + * @gt: GT instance where the error occurred, or NULL if device-wide
> what about tile-only errors?

For tile-only errors, we still route through the corresponding gt, and 
from there we also record gt->tile->id.

So the error report includes both GT context and tile identification.

>
>> + * @sig_id: signature ID from xe_sig_ids.h identifying the error class
>> + * @cper_sev: CPER severity (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)
>> + * @errno_val: negative errno describing the error condition
>> + * @fmt: printf-style format string
>> + * @...: format arguments
>> + *
>> + * Formats the message and emits a kernel log line via drm_err() for fatal
>> + * events or drm_warn() for all others. CPER record generation and hex dump
>> + * are planned as follow-ups.
>> + *
>> + * Format:
>> + *   [xe-err] SIG_ID = <id> Severity = <sev> Location = <loc> Errno = <n> Message = "<msg>"
> is this some standard format or just our invention?
The format is based on specification mentioned by Arch team.
>
>> + */
>> +__printf(6, 7)
>> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
>> +		  u16 sig_id, u32 cper_sev, int errno_val,
>> +		  const char *fmt, ...)
>> +{
>> +	char loc[32];
>> +	char msg[256];
>> +	va_list ap;
>> +	int ret;
>> +
>> +	if (gt)
>> +		snprintf(loc, sizeof(loc), "tile = %u/gt = %u",
> elsewhere in the xe driver we are referring to a tile as "Tile%u" and to a GT as "GT%u"
>
> also, the "=" used here might mess with rest of the items that are also separated by "="
Good catches on both points. Sure, I will update in next revision.
>
>> +			 gt->tile->id, gt->info.id);
>> +	else
>> +		snprintf(loc, sizeof(loc), "device");
>> +
>> +	va_start(ap, fmt);
>> +	ret = vsnprintf(msg, sizeof(msg), fmt, ap);
> can't you just pass va to drm_err/warn below ?
drm_err/warn have no va accepting variants in the DRM API.
>> +	va_end(ap);
>> +
>> +	WARN_ON_ONCE(ret >= (int)sizeof(msg));
> maybe xe_assert() ? but shouldn't be needed anyway
Ok.
>> +
>> +	if (cper_sev == CPER_SEV_FATAL)
>> +		drm_err(&xe->drm,
>> +			"[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%s\"\n",
> do we need this "[xe-err]" ? both 'xe' and 'err' will be there already added by drm_err()
>
> 	[ ] xe 0000:00:02.0: [drm] *ERROR* ...
>
> also passing a raw SIG_ID number is not user friendly ...
>
> and severity is always present, so maybe just print it without any prefix
>
> for location maybe we can use existing xe_gt_err/xe_tile_err/xe_err macros ?
>
> and is errno really needed/useful? we already have severity and details will be in the msg
>
> so maybe instead of seeing this:
>
>    [ ] xe 0000:00:02.0: [drm] *ERROR* [xe-err] SIG_ID = 123 Severity = fatal Location = device Errno = -5 Message = "blah blah"
>
> we can try with messages like:
>
>    [ ] xe 0000:00:02.0: [drm] *ERROR* [fatal:123] blah blah (-EIO)
>    [ ] xe 0000:00:02.0: [drm] *ERROR* [recoverable:987] blah blah (-ETIME)
>
>    [ ] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: [fatal:123] blah blah (-EIO)
>    [ ] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: [recoverable:987] blah blah (-ETIME)
>
>
All are good points, I need to check with Arch team.
>> +			sig_id, xe_cper_severity_str(cper_sev), loc,
>> +			errno_val, msg);
>> +	else
>> +		drm_warn(&xe->drm,
>> +			 "[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%s\"\n",
>> +			 sig_id, xe_cper_severity_str(cper_sev), loc,
>> +			 errno_val, msg);
>> +
>> +	/* TODO: Add CPER record driver handler */
>> +	/* TODO: Add RAS dump cper hex handler */
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_sig_ids.h b/drivers/gpu/drm/xe/xe_sig_ids.h
>> new file mode 100644
>> index 000000000000..8abe7554714e
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_sig_ids.h
> maybe this should be "xe_ras_errors.h" ? plain "sig_ids" tells nothing
Agreed!
>> @@ -0,0 +1,101 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_SIG_IDS_H_
>> +#define _XE_SIG_IDS_H_
>> +
>> +#include <linux/cper.h>
>> +
>> +/**
>> + * xe_cper_severity_str - local severity-to-string helper
>> + * @sev: CPER severity value (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)
>> + *
>> + * Avoids a link-time dependency on cper_severity_str() which is only
>> + * compiled when CONFIG_UEFI_CPER=y.
> or maybe we should just select UEFI_CPER in our Kconfig ?
>
> duplicating whole function looks like a bad idea
Sure!!
>> + *
>> + * Return: string representation of the severity, or "unknown".
>> + */
>> +static inline const char *xe_cper_severity_str(u32 sev)
>> +{
>> +	switch (sev) {
>> +	case CPER_SEV_RECOVERABLE:	return "recoverable";
>> +	case CPER_SEV_FATAL:		return "fatal";
>> +	case CPER_SEV_CORRECTED:	return "corrected";
>> +	case CPER_SEV_INFORMATIONAL:	return "informational";
>> +	default:			return "unknown";
>> +	}
>> +}
>> +
>> +/*
>> + * Driver SIG_IDs
> what's the criteria for each new SIG?
> is it per component? or per category?

The criteria for each new SIG is that when the failure mode is distinct.

It's per error category. SIG_ID indicates the type of failure that occurred.

>> + */
>> +#define XE_SIG_PROBE			1  /* FATAL: probe failed */
>> +#define XE_SIG_WEDGED			2  /* FATAL: device wedged */
>> +#define XE_SIG_SURVIVABILITY		3  /* FATAL: survivability mode */
>> +#define XE_SIG_FW			4  /* RECOVERABLE: GuC/HuC/UC/GSC/CSC/PCODE */
> btw, shouldn't we use existing FW_BUG FW_WARN FW_INFO prefixes when reporting our FW errors?
>
> also, are we sure all FW errors are recoverable?

FW_BUG / FW_WARN / FW_INFO are orthogonal to SIG_IDs.

The former classify the firmware message itself, while SIG_IDs provide a 
consistent encoding for the failure signature.

So using a firmware-related SIG_ID does not replace the existing 
prefixes; it serves a different purpose.

>> +#define XE_SIG_GT_TDR			5  /* RECOVERABLE: engine hang / reset */
>> +#define XE_SIG_MEM_FAULT		6  /* RECOVERABLE: VM bind, page fault, GTT */
>> +#define XE_SIG_IO_BUS			7  /* RECOVERABLE: runtime PCIe/IOMMU/MMIO */
> are those IDs supposed to be stable or can be changed later?

The IDs are stable. We can continue adding new ones over time,

but existing IDs are not meant to change.

>
>> +
>> +/*
>> + * HW SIG_IDs
>> + */
>> +#define XE_SIG_HW_DEVICE_MEMORY		8
>> +#define XE_SIG_HW_CORE_COMPUTE		9
>> +#define XE_SIG_HW_SCALE_UP_LINK		10
>> +#define XE_SIG_HW_PCIE			11
>> +#define XE_SIG_HW_FABRIC		12
>> +#define XE_SIG_HW_SOC_INTERNAL		13
> hmm, those IDs look more like HW components names, not errors

Agreed, these read more like hardware block/component identifiers than 
failure signatures.

If SIG_ID is meant to describe the kind of failure signature around 
those hw block/component.

>> +
>> +/* Must be updated when adding new driver SIG IDs */
>> +#define XE_SIG_DRIVER_LAST		XE_SIG_IO_BUS
>> +#define XE_SIG_HW_FIRST			XE_SIG_HW_DEVICE_MEMORY
> if IDs must be stable than this will not work
Agreed! we need to update after adding new driver or HW SIG IDS.
>> +
>> +struct xe_device;
>> +struct xe_gt;
>> +
>> +/*
>> + * Common backend helper
>> + */
>> +__printf(6, 7)
>> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
>> +		  u16 sig_id, u32 cper_sev, int errno_val,
>> +		  const char *fmt, ...);
> this is a wrong place to define function prototype,
> it should be in xe_ras_log.h instead I guess
Sure!
>> +
>> +/*
>> + * Driver-facing reporting macros
>> + */
>> +
>> +/* FATAL */
>> +#define XE_RAS_PROBE(xe, errno, fmt, ...) \
>> +	__xe_ras_log((xe), NULL, XE_SIG_PROBE, CPER_SEV_FATAL, \
>> +		     (errno), fmt, ##__VA_ARGS__)
>> +
>> +#define XE_RAS_WEDGED(xe, errno, fmt, ...) \
>> +	__xe_ras_log((xe), NULL, XE_SIG_WEDGED, CPER_SEV_FATAL, \
>> +		     (errno), fmt, ##__VA_ARGS__)
>> +
>> +#define XE_RAS_SURVIVABILITY(xe, errno, fmt, ...) \
>> +	__xe_ras_log((xe), NULL, XE_SIG_SURVIVABILITY, CPER_SEV_FATAL, \
>> +		     (errno), fmt, ##__VA_ARGS__)
>> +
>> +/* RECOVERABLE */
>> +#define XE_RAS_FW(xe, gt, errno, fmt, ...) \
>> +	__xe_ras_log((xe), (gt), XE_SIG_FW, CPER_SEV_RECOVERABLE, \
>> +		     (errno), fmt, ##__VA_ARGS__)
>> +
>> +#define XE_RAS_GT_TDR(xe, gt, errno, fmt, ...) \
>> +	__xe_ras_log((xe), (gt), XE_SIG_GT_TDR, CPER_SEV_RECOVERABLE, \
>> +		     (errno), fmt, ##__VA_ARGS__)
>> +
>> +#define XE_RAS_MEM_FAULT(xe, gt, errno, fmt, ...) \
>> +	__xe_ras_log((xe), (gt), XE_SIG_MEM_FAULT, CPER_SEV_RECOVERABLE, \
>> +		     (errno), fmt, ##__VA_ARGS__)
>> +
>> +#define XE_RAS_IO_BUS(xe, errno, fmt, ...) \
>> +	__xe_ras_log((xe), NULL, XE_SIG_IO_BUS, CPER_SEV_RECOVERABLE, \
>> +		     (errno), fmt, ##__VA_ARGS__)
> maybe better:
>
> #define xe_ras_fatal(xe, sig, fmt, ...) ..
> #define xe_ras_recoverable(xe, sig, fmt, ...) ..
>
> #define xe_gt_ras_fatal(gt, sig, fmt, ...) ..
> #define xe_gt_ras_recoverable(gt, sig, fmt, ...) ..

Agreed!!

Thanks,

-/Mallesh

>> +
>> +#endif /* _XE_SIG_IDS_H_ */

next prev parent reply	other threads:[~2026-06-15  8:51 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11  9:12 [PATCH 0/2] drm/xe: Structured RAS error logging infrastructure Mallesh Koujalagi
2026-06-11  9:12 ` [PATCH 1/2] drm/xe: add structured " Mallesh Koujalagi
2026-06-11 22:20   ` Michal Wajdeczko
2026-06-15  8:51     ` Mallesh, Koujalagi [this message]
2026-06-11  9:12 ` [PATCH 2/2] drm/xe: use XE_RAS_WEDGED macro in xe_device_declare_wedged Mallesh Koujalagi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=859a917b-a449-4e88-b9bd-e923a8de3504@intel.com \
    --to=mallesh.koujalagi@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=vinay.belgaumkar@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox