Re: [RFC PATCH v3 2/3] drm/xe: Add RAS logging helpers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: Mallesh Koujalagi <mallesh.koujalagi@intel.com>,
	<intel-xe@lists.freedesktop.org>, <rodrigo.vivi@intel.com>,
	<matthew.brost@intel.com>, <thomas.hellstrom@linux.intel.com>
Cc: <anshuman.gupta@intel.com>, <badal.nilawar@intel.com>,
	<vinay.belgaumkar@intel.com>, <aravind.iddamsetty@intel.com>,
	<riana.tauro@intel.com>, <karthik.poosa@intel.com>,
	<sk.anirban@intel.com>, <raag.jadav@intel.com>
Subject: Re: [RFC PATCH v3 2/3] drm/xe: Add RAS logging helpers
Date: Sat, 20 Jun 2026 21:29:51 +0200	[thread overview]
Message-ID: <6a1dbfd1-9fce-4f02-85bd-7d6038f41ff8@intel.com> (raw)
In-Reply-To: <20260617104711.79646-7-mallesh.koujalagi@intel.com>



On 6/17/2026 12:47 PM, Mallesh Koujalagi wrote:
> Add xe_ras_log.c and xe_ras_log.h so the driver can report
> faults in a structured, machine-readable way.

hmm, but the dmesg logs are mostly for the human admins, no?

> 
> The core is __xe_ras_log(), which emits a single log line with:
> SIG_ID, severity, location (device or tile/GT), errno, and message.

our dmesg logs are already to some extend structured as they contain:

	severity:	<3>
	driver name: 	xe
	device name:	0000:00:02.0
	subsystem name:	[drm]
	error tag:	*ERROR*
	location:	Tile0:
			Tile0: GT0:
	message:	GuC mmio request ...

> Fatal faults go to drm_err(); recoverable ones to drm_warn().

note that we already have xe_err() and xe_warn() wrappers

> 
> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> ---
> v3:
> - Refer "Tile%u" and "GT%u" strings. (Michal Wajdeczko)
> - Remov xe_cper_severity_str(). (Michal Wajdeczko/Riana)
> - Move __xe_ras_log() function to xe_ras_log.h. (Michal Wajdeczko)
> - Make macro function properly.
> - Remove *_FIRST and *_LAST macro. (Michal Wajdeczko/Riana)
> - Add sig id documents. (Riana)
> - Change macro function same prefix as the file.
> - Handle __xe_ras_log() function with variable format.
> ---
>  drivers/gpu/drm/xe/Makefile     |   1 +
>  drivers/gpu/drm/xe/xe_ras_log.c |  63 +++++++++++++++++++
>  drivers/gpu/drm/xe/xe_ras_log.h | 108 ++++++++++++++++++++++++++++++++
>  3 files changed, 172 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_ras_log.c
>  create mode 100644 drivers/gpu/drm/xe/xe_ras_log.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 8e7b146880f4..607c9b099d03 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -114,6 +114,7 @@ xe-y += xe_bb.o \
>  	xe_query.o \
>  	xe_range_fence.o \
>  	xe_ras.o \
> +	xe_ras_log.o \
>  	xe_reg_sr.o \
>  	xe_reg_whitelist.o \
>  	xe_ring_ops.o \
> diff --git a/drivers/gpu/drm/xe/xe_ras_log.c b/drivers/gpu/drm/xe/xe_ras_log.c
> new file mode 100644
> index 000000000000..0e836ef5dcf6
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras_log.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#include <drm/drm_print.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt.h"
> +#include "xe_ras_log.h"
> +
> +/**
> + * __xe_ras_log - Emit a structured RAS log entry
> + * @xe: xe device instance
> + * @gt: GT instance where the error occurred, or NULL if device-wide
> + * @sig_id: signature ID from xe_sig_ids.h identifying the error class
> + * @cper_sev: CPER severity (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)
> + * @errno_val: negative errno describing the error condition
> + * @fmt: printf-style format string
> + * @...: format arguments
> + *
> + * Formats the message and emits a kernel log line via drm_err() for fatal
> + * events or drm_warn() for all others. CPER record generation and hex dump
> + * are planned as follow-ups.
> + *
> + * Format:
> + *   [xe-err] SIG_ID = <id> Severity = <sev> Location = <loc> Errno = <n> Message = "<msg>"
> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> +		  u16 sig_id, u32 cper_sev, int errno_val,
> +		  const char *fmt, ...)
> +{
> +	char loc[32];
> +	struct va_format vaf;
> +	va_list ap;
> +
> +	if (gt)
> +		snprintf(loc, sizeof(loc), "tile%u/gt%u",
> +			 gt->tile->id, gt->info.id);
> +	else
> +		snprintf(loc, sizeof(loc), "device");
> +
> +	va_start(ap, fmt);
> +	vaf.fmt = fmt;
> +	vaf.va = &ap;
> +
> +	if (cper_sev == CPER_SEV_FATAL)
> +		drm_err(&xe->drm,
> +			"[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%pV\"",
> +			sig_id, cper_severity_str(cper_sev), loc,
> +			errno_val, &vaf);
> +	else
> +		drm_warn(&xe->drm,
> +			 "[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%pV\"",
> +			 sig_id, cper_severity_str(cper_sev), loc,
> +			 errno_val, &vaf);

again, this is not a very friendly message for the human admin

can't we just add severity/id pair to our updated xe_printk macros, like

	xe_err_fatal(xe, id, fmt, ...)
	xe_err_recoverable(xe, id, fmt, ...)

	xe_gt_err_fatal(gt, id, fmt, ...)
	xe_gt_err_recoverable(gt, id, fmt, ...)

and print sev/id pair within single tag:

	0000:02.0 xe [drm] *ERROR* [ID=123456] *FATAL* format ...

> +
> +	va_end(ap);
> +
> +	/* TODO: Add CPER record driver handler */
> +	/* TODO: Add RAS dump cper hex handler */
> +}
> diff --git a/drivers/gpu/drm/xe/xe_ras_log.h b/drivers/gpu/drm/xe/xe_ras_log.h
> new file mode 100644
> index 000000000000..08318dea75a9
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras_log.h
> @@ -0,0 +1,108 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_RAS_LOG_H_
> +#define _XE_RAS_LOG_H_
> +
> +#include <linux/cper.h>
> +
> +#include "xe_sig_ids.h"
> +
> +struct xe_device;
> +struct xe_gt;
> +
> +/**
> + * DOC: RAS structured logging and SIG_IDs
> + *
> + * What this file is for
> + * ---------------------
> + * Use the xe_ras_log_*() macros to report faults in a
> + * machine-readable way.
> + *
> + * What is a SIG_ID?
> + * -----------------
> + * A SIG_ID (defined in xe_sig_ids.h) is a stable numeric error class.

hmm, so a SIG is now a CLASS ?
in the prev patch it was described as 'GPU fault signature'

and this kernel-doc for SIG_IDs should be in .h file where we define those IDs

> + * It tells tooling what kind of fault happened, independent of message text.
> + *
> + * Driver SIG_ID classes:
> + *   XE_SIG_PROBE         - probe/init failure
> + *   XE_SIG_WEDGED        - device unrecoverable

hmm, in drm_dev_wedged_event() we can specify recovery method
so what unrecoverable means here?

> + *   XE_SIG_SURVIVABILITY - degraded/safe mode
> + *   XE_SIG_FW            - firmware fault (GuC/HuC/GSC/CSC/PCODE)
> + *   XE_SIG_GT_TDR        - engine hang or GT reset (TDR)
> + *   XE_SIG_MEM_FAULT     - VM/page-fault/GTT error
> + *   XE_SIG_IO_BUS        - PCIe/IOMMU/MMIO bus error
> + *
> + * HW SIG_ID classes are hardware-defined categories, reported by firmware via
> + * CPER records (for example XE_SIG_HW_DEVICE_MEMORY). Driver code does not
> + * emit HW SIG_IDs directly.

if HW SIGs are HW defined, are below IDs also defined by the HW
or HW just defines different components, and we define IDs for them?

> + *   XE_SIG_HW_DEVICE_MEMORY  - device memory errors (e.g. ECC)
> + *   XE_SIG_HW_CORE_COMPUTE   - compute/shader core errors
> + *   XE_SIG_HW_PCIE           - PCIe interface errors
> + *   XE_SIG_HW_FABRIC         - on-package fabric errors
> + *   XE_SIG_HW_SOC_INTERNAL   - SoC-internal errors
> + *
> + * Why this exists
> + * ---------------
> + * SIG_IDs let RAS tools parse logs reliably, correlate CPER events, and apply

dmesg is not reliable

> + * policy/thresholding without depending on fragile string matching.

if the whole idea is to group our errors per fault class,
then maybe our macros should be used like:

	xe_err_fatal(xe, CLASS, fmt, ...)
	xe_err_recoverable(xe, CLASS, fmt, ...)

and output should be:

	0000:02.0 xe [drm] *ERROR* [PROBE] *FATAL* format ...
	0000:02.0 xe [drm] *ERROR* [TDR] *RECOVERABLE* format ...

as printing magic ID numbers is not very friendly

and any 'machine structured output' should be exposed elsewhere?
maybe as sysfs file on the xe module level (to catch probe errors)

and maybe start with that alternate method first,
and use dmesg logging only as a best-effort informational step?

> + *
> + * When to use xe_ras_log_*()
> + * --------------------------
> + * Use these macros when the event is:
> + *   1) a real hardware/firmware fault,
> + *   2) relevant to production monitoring, and
> + *   3) clearly mapped to one SIG_ID class.
> + *
> + * Do not use for all logs
> + * -----------------------
> + * Keep regular debug/info/driver-state messages on standard logging helpers
> + * (xe_gt_dbg(), xe_gt_info(), xe_gt_warn(), xe_gt_err(), drm_dbg(), drm_info()).
> + * RAS macros are for structured fault reporting only.
> + */
> +
> +/*
> + * Common backend helper
> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> +		  u16 sig_id, u32 cper_sev, int errno_val,
> +		  const char *fmt, ...);
> +
> +/*
> + * Driver-facing reporting macros
> + */
> +
> +/* FATAL */
> +#define xe_ras_log_probe(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_PROBE, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_wedged(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_WEDGED, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_survivability(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_SURVIVABILITY, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +/* RECOVERABLE */
> +#define xe_ras_log_fw(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_FW, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_gt_tdr(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_GT_TDR, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_mem_fault(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_MEM_FAULT, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_io_bus(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_IO_BUS, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)

IMO those RAS customized macros, should be per severity and take
a CLASS as a param since there is much less severity levels than
your new SIG_IDs fault classes, so it scales better

	xe_err_fatal(xe, PROBE, fmt, ...)
	xe_err_recoverable(xe, IO_BUS, fmt, ...)

and I'm not sure that there is 1:1 relation between CLASS and severity

> +
> +#endif /* _XE_RAS_LOG_H_ */

next prev parent reply	other threads:[~2026-06-20 19:30 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-17 10:47 [RFC PATCH v3 0/3] drm/xe: Structured RAS error logging infrastructure Mallesh Koujalagi
2026-06-17 10:47 ` [RFC PATCH v3 1/3] drm/xe: Add error Signature IDs for RAS logging Mallesh Koujalagi
2026-06-19  5:48   ` Tauro, Riana
2026-06-23  7:41     ` Mallesh, Koujalagi
2026-06-20 17:02   ` Michal Wajdeczko
2026-06-24 11:51     ` Mallesh, Koujalagi
2026-06-17 10:47 ` [RFC PATCH v3 2/3] drm/xe: Add RAS logging helpers Mallesh Koujalagi
2026-06-19  5:43   ` Tauro, Riana
2026-06-23 10:41     ` Mallesh, Koujalagi
2026-06-20 19:29   ` Michal Wajdeczko [this message]
2026-06-24 16:03     ` Mallesh, Koujalagi
2026-06-17 10:47 ` [RFC PATCH v3 3/3] drm/xe: use XE_RAS_WEDGED macro in xe_device_declare_wedged Mallesh Koujalagi
2026-06-17 16:06   ` Bhadane, Dnyaneshwar
2026-06-20 19:39   ` Michal Wajdeczko
2026-06-24 16:16     ` Mallesh, Koujalagi
2026-06-17 11:11 ` ✗ CI.checkpatch: warning for drm/xe: Structured RAS error logging infrastructure (rev3) Patchwork
2026-06-17 11:11 ` ✗ CI.KUnit: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a1dbfd1-9fce-4f02-85bd-7d6038f41ff8@intel.com \
    --to=michal.wajdeczko@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=mallesh.koujalagi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=vinay.belgaumkar@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.