Re: [PATCH 1/2] drm/xe: add structured RAS error logging infrastructure

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: Mallesh Koujalagi <mallesh.koujalagi@intel.com>,
	<intel-xe@lists.freedesktop.org>, <rodrigo.vivi@intel.com>,
	<matthew.brost@intel.com>, <thomas.hellstrom@linux.intel.com>
Cc: <anshuman.gupta@intel.com>, <badal.nilawar@intel.com>,
	<vinay.belgaumkar@intel.com>, <aravind.iddamsetty@intel.com>,
	<riana.tauro@intel.com>, <karthik.poosa@intel.com>,
	<sk.anirban@intel.com>, <raag.jadav@intel.com>
Subject: Re: [PATCH 1/2] drm/xe: add structured RAS error logging infrastructure
Date: Fri, 12 Jun 2026 00:20:00 +0200	[thread overview]
Message-ID: <33423af2-7e8d-4682-a3ca-ff99bad28979@intel.com> (raw)
In-Reply-To: <20260611091224.17275-5-mallesh.koujalagi@intel.com>



On 6/11/2026 11:12 AM, Mallesh Koujalagi wrote:
> Introduce SIG_ID defining Signature IDs (SIG IDs) for
> categorising XE GPU errors, and __xe_ras_log() to emit
> structured log lines with SIG ID, CPER severity, tile/GT
> location, errno, and a free-form message.
> 
> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile     |   1 +
>  drivers/gpu/drm/xe/xe_ras_log.c |  64 ++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_sig_ids.h | 101 ++++++++++++++++++++++++++++++++
>  3 files changed, 166 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_ras_log.c
>  create mode 100644 drivers/gpu/drm/xe/xe_sig_ids.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 091872771e98..412be4bad952 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -115,6 +115,7 @@ xe-y += xe_bb.o \
>  	xe_query.o \
>  	xe_range_fence.o \
>  	xe_ras.o \
> +	xe_ras_log.o \
>  	xe_reg_sr.o \
>  	xe_reg_whitelist.o \
>  	xe_ring_ops.o \
> diff --git a/drivers/gpu/drm/xe/xe_ras_log.c b/drivers/gpu/drm/xe/xe_ras_log.c
> new file mode 100644
> index 000000000000..b6a2792b61d4
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras_log.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#include <drm/drm_print.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt.h"
> +#include "xe_sig_ids.h"
> +
> +/**
> + * __xe_ras_log - Emit a structured RAS log entry
> + * @xe: xe device instance
> + * @gt: GT instance where the error occurred, or NULL if device-wide

what about tile-only errors?

> + * @sig_id: signature ID from xe_sig_ids.h identifying the error class
> + * @cper_sev: CPER severity (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)
> + * @errno_val: negative errno describing the error condition
> + * @fmt: printf-style format string
> + * @...: format arguments
> + *
> + * Formats the message and emits a kernel log line via drm_err() for fatal
> + * events or drm_warn() for all others. CPER record generation and hex dump
> + * are planned as follow-ups.
> + *
> + * Format:
> + *   [xe-err] SIG_ID = <id> Severity = <sev> Location = <loc> Errno = <n> Message = "<msg>"

is this some standard format or just our invention?

> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> +		  u16 sig_id, u32 cper_sev, int errno_val,
> +		  const char *fmt, ...)
> +{
> +	char loc[32];
> +	char msg[256];
> +	va_list ap;
> +	int ret;
> +
> +	if (gt)
> +		snprintf(loc, sizeof(loc), "tile = %u/gt = %u",

elsewhere in the xe driver we are referring to a tile as "Tile%u" and to a GT as "GT%u"

also, the "=" used here might mess with rest of the items that are also separated by "="

> +			 gt->tile->id, gt->info.id);
> +	else
> +		snprintf(loc, sizeof(loc), "device");
> +
> +	va_start(ap, fmt);
> +	ret = vsnprintf(msg, sizeof(msg), fmt, ap);

can't you just pass va to drm_err/warn below ? 
> +	va_end(ap);
> +
> +	WARN_ON_ONCE(ret >= (int)sizeof(msg));

maybe xe_assert() ? but shouldn't be needed anyway

> +
> +	if (cper_sev == CPER_SEV_FATAL)
> +		drm_err(&xe->drm,
> +			"[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%s\"\n",

do we need this "[xe-err]" ? both 'xe' and 'err' will be there already added by drm_err()

	[ ] xe 0000:00:02.0: [drm] *ERROR* ...

also passing a raw SIG_ID number is not user friendly ...

and severity is always present, so maybe just print it without any prefix

for location maybe we can use existing xe_gt_err/xe_tile_err/xe_err macros ?

and is errno really needed/useful? we already have severity and details will be in the msg

so maybe instead of seeing this:

  [ ] xe 0000:00:02.0: [drm] *ERROR* [xe-err] SIG_ID = 123 Severity = fatal Location = device Errno = -5 Message = "blah blah"

we can try with messages like:

  [ ] xe 0000:00:02.0: [drm] *ERROR* [fatal:123] blah blah (-EIO)
  [ ] xe 0000:00:02.0: [drm] *ERROR* [recoverable:987] blah blah (-ETIME)

  [ ] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: [fatal:123] blah blah (-EIO)
  [ ] xe 0000:00:02.0: [drm] *ERROR* Tile0: GT1: [recoverable:987] blah blah (-ETIME)



> +			sig_id, xe_cper_severity_str(cper_sev), loc,
> +			errno_val, msg);
> +	else
> +		drm_warn(&xe->drm,
> +			 "[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%s\"\n",
> +			 sig_id, xe_cper_severity_str(cper_sev), loc,
> +			 errno_val, msg);
> +
> +	/* TODO: Add CPER record driver handler */
> +	/* TODO: Add RAS dump cper hex handler */
> +}
> diff --git a/drivers/gpu/drm/xe/xe_sig_ids.h b/drivers/gpu/drm/xe/xe_sig_ids.h
> new file mode 100644
> index 000000000000..8abe7554714e
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_sig_ids.h

maybe this should be "xe_ras_errors.h" ? plain "sig_ids" tells nothing

> @@ -0,0 +1,101 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_SIG_IDS_H_
> +#define _XE_SIG_IDS_H_
> +
> +#include <linux/cper.h>
> +
> +/**
> + * xe_cper_severity_str - local severity-to-string helper
> + * @sev: CPER severity value (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)
> + *
> + * Avoids a link-time dependency on cper_severity_str() which is only
> + * compiled when CONFIG_UEFI_CPER=y.

or maybe we should just select UEFI_CPER in our Kconfig ?

duplicating whole function looks like a bad idea
> + *
> + * Return: string representation of the severity, or "unknown".
> + */
> +static inline const char *xe_cper_severity_str(u32 sev)
> +{
> +	switch (sev) {
> +	case CPER_SEV_RECOVERABLE:	return "recoverable";
> +	case CPER_SEV_FATAL:		return "fatal";
> +	case CPER_SEV_CORRECTED:	return "corrected";
> +	case CPER_SEV_INFORMATIONAL:	return "informational";
> +	default:			return "unknown";
> +	}
> +}
> +
> +/*
> + * Driver SIG_IDs

what's the criteria for each new SIG?
is it per component? or per category?

> + */
> +#define XE_SIG_PROBE			1  /* FATAL: probe failed */
> +#define XE_SIG_WEDGED			2  /* FATAL: device wedged */
> +#define XE_SIG_SURVIVABILITY		3  /* FATAL: survivability mode */
> +#define XE_SIG_FW			4  /* RECOVERABLE: GuC/HuC/UC/GSC/CSC/PCODE */

btw, shouldn't we use existing FW_BUG FW_WARN FW_INFO prefixes when reporting our FW errors?

also, are we sure all FW errors are recoverable?

> +#define XE_SIG_GT_TDR			5  /* RECOVERABLE: engine hang / reset */
> +#define XE_SIG_MEM_FAULT		6  /* RECOVERABLE: VM bind, page fault, GTT */
> +#define XE_SIG_IO_BUS			7  /* RECOVERABLE: runtime PCIe/IOMMU/MMIO */

are those IDs supposed to be stable or can be changed later?

> +
> +/*
> + * HW SIG_IDs
> + */
> +#define XE_SIG_HW_DEVICE_MEMORY		8
> +#define XE_SIG_HW_CORE_COMPUTE		9
> +#define XE_SIG_HW_SCALE_UP_LINK		10
> +#define XE_SIG_HW_PCIE			11
> +#define XE_SIG_HW_FABRIC		12
> +#define XE_SIG_HW_SOC_INTERNAL		13

hmm, those IDs look more like HW components names, not errors 

> +
> +/* Must be updated when adding new driver SIG IDs */
> +#define XE_SIG_DRIVER_LAST		XE_SIG_IO_BUS
> +#define XE_SIG_HW_FIRST			XE_SIG_HW_DEVICE_MEMORY

if IDs must be stable than this will not work

> +
> +struct xe_device;
> +struct xe_gt;
> +
> +/*
> + * Common backend helper
> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> +		  u16 sig_id, u32 cper_sev, int errno_val,
> +		  const char *fmt, ...);

this is a wrong place to define function prototype,
it should be in xe_ras_log.h instead I guess

> +
> +/*
> + * Driver-facing reporting macros
> + */
> +
> +/* FATAL */
> +#define XE_RAS_PROBE(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_PROBE, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_WEDGED(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_WEDGED, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_SURVIVABILITY(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_SURVIVABILITY, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +/* RECOVERABLE */
> +#define XE_RAS_FW(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_FW, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_GT_TDR(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_GT_TDR, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_MEM_FAULT(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_MEM_FAULT, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define XE_RAS_IO_BUS(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_IO_BUS, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)

maybe better:

#define xe_ras_fatal(xe, sig, fmt, ...) ..
#define xe_ras_recoverable(xe, sig, fmt, ...) ..

#define xe_gt_ras_fatal(gt, sig, fmt, ...) ..
#define xe_gt_ras_recoverable(gt, sig, fmt, ...) ..

> +
> +#endif /* _XE_SIG_IDS_H_ */

next prev parent reply	other threads:[~2026-06-11 22:20 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11  9:12 [PATCH 0/2] drm/xe: Structured RAS error logging infrastructure Mallesh Koujalagi
2026-06-11  9:12 ` [PATCH 1/2] drm/xe: add structured " Mallesh Koujalagi
2026-06-11 22:20   ` Michal Wajdeczko [this message]
2026-06-15  8:51     ` Mallesh, Koujalagi
2026-06-11  9:12 ` [PATCH 2/2] drm/xe: use XE_RAS_WEDGED macro in xe_device_declare_wedged Mallesh Koujalagi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=33423af2-7e8d-4682-a3ca-ff99bad28979@intel.com \
    --to=michal.wajdeczko@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=mallesh.koujalagi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=vinay.belgaumkar@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox