Re: [RFC PATCH v3 2/3] drm/xe: Add RAS logging helpers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Tauro, Riana" <riana.tauro@intel.com>
To: Mallesh Koujalagi <mallesh.koujalagi@intel.com>,
	<intel-xe@lists.freedesktop.org>, <rodrigo.vivi@intel.com>,
	<matthew.brost@intel.com>, <thomas.hellstrom@linux.intel.com>
Cc: <anshuman.gupta@intel.com>, <badal.nilawar@intel.com>,
	<vinay.belgaumkar@intel.com>, <aravind.iddamsetty@intel.com>,
	<karthik.poosa@intel.com>, <sk.anirban@intel.com>,
	<raag.jadav@intel.com>
Subject: Re: [RFC PATCH v3 2/3] drm/xe: Add RAS logging helpers
Date: Fri, 19 Jun 2026 11:13:38 +0530	[thread overview]
Message-ID: <fcffd5d8-c019-47e8-83cb-c84cdb3b88bb@intel.com> (raw)
In-Reply-To: <20260617104711.79646-7-mallesh.koujalagi@intel.com>


On 17-06-2026 16:17, Mallesh Koujalagi wrote:
> Add xe_ras_log.c and xe_ras_log.h so the driver can report
> faults in a structured, machine-readable way.
>
> The core is __xe_ras_log(), which emits a single log line with:
> SIG_ID, severity, location (device or tile/GT), errno, and message.
> Fatal faults go to drm_err(); recoverable ones to drm_warn().
>
> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> ---
> v3:
> - Refer "Tile%u" and "GT%u" strings. (Michal Wajdeczko)
> - Remov xe_cper_severity_str(). (Michal Wajdeczko/Riana)
> - Move __xe_ras_log() function to xe_ras_log.h. (Michal Wajdeczko)
> - Make macro function properly.
> - Remove *_FIRST and *_LAST macro. (Michal Wajdeczko/Riana)
> - Add sig id documents. (Riana)
> - Change macro function same prefix as the file.
> - Handle __xe_ras_log() function with variable format.
> ---
>   drivers/gpu/drm/xe/Makefile     |   1 +
>   drivers/gpu/drm/xe/xe_ras_log.c |  63 +++++++++++++++++++
>   drivers/gpu/drm/xe/xe_ras_log.h | 108 ++++++++++++++++++++++++++++++++
>   3 files changed, 172 insertions(+)
>   create mode 100644 drivers/gpu/drm/xe/xe_ras_log.c
>   create mode 100644 drivers/gpu/drm/xe/xe_ras_log.h
>
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 8e7b146880f4..607c9b099d03 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -114,6 +114,7 @@ xe-y += xe_bb.o \
>   	xe_query.o \
>   	xe_range_fence.o \
>   	xe_ras.o \
> +	xe_ras_log.o \
>   	xe_reg_sr.o \
>   	xe_reg_whitelist.o \
>   	xe_ring_ops.o \
> diff --git a/drivers/gpu/drm/xe/xe_ras_log.c b/drivers/gpu/drm/xe/xe_ras_log.c
> new file mode 100644
> index 000000000000..0e836ef5dcf6
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras_log.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#include <drm/drm_print.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt.h"
> +#include "xe_ras_log.h"
> +
> +/**
> + * __xe_ras_log - Emit a structured RAS log entry

Why __?

> + * @xe: xe device instance
> + * @gt: GT instance where the error occurred, or NULL if device-wide
> + * @sig_id: signature ID from xe_sig_ids.h identifying the error class
> + * @cper_sev: CPER severity (one of CPER_SEV_FATAL, CPER_SEV_RECOVERABLE, etc.)

Add enum  name instead

> + * @errno_val: negative errno describing the error condition

Why do we need error no?

> + * @fmt: printf-style format string
> + * @...: format arguments
> + *
> + * Formats the message and emits a kernel log line via drm_err() for fatal
> + * events or drm_warn() for all others. CPER record generation and hex dump
> + * are planned as follow-ups.
> + *
> + * Format:
> + *   [xe-err] SIG_ID = <id> Severity = <sev> Location = <loc> Errno = <n> Message = "<msg>"

You are using both error and warn but here it's only xe-err.
Isn't this misleading?

> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> +		  u16 sig_id, u32 cper_sev, int errno_val,
> +		  const char *fmt, ...)
> +{
> +	char loc[32];
> +	struct va_format vaf;
> +	va_list ap;
> +
> +	if (gt)
> +		snprintf(loc, sizeof(loc), "tile%u/gt%u",
> +			 gt->tile->id, gt->info.id);
> +	else
> +		snprintf(loc, sizeof(loc), "device");
> +
> +	va_start(ap, fmt);
> +	vaf.fmt = fmt;
> +	vaf.va = &ap;
> +
> +	if (cper_sev == CPER_SEV_FATAL)
> +		drm_err(&xe->drm,
> +			"[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%pV\"",
> +			sig_id, cper_severity_str(cper_sev), loc,
> +			errno_val, &vaf);
> +	else
> +		drm_warn(&xe->drm,
> +			 "[xe-err] SIG_ID = %u Severity = %s Location = %s Errno = %d Message = \"%pV\"",
> +			 sig_id, cper_severity_str(cper_sev), loc,
> +			 errno_val, &vaf);
> +
> +	va_end(ap);
> +
> +	/* TODO: Add CPER record driver handler */
> +	/* TODO: Add RAS dump cper hex handler */
> +}
> diff --git a/drivers/gpu/drm/xe/xe_ras_log.h b/drivers/gpu/drm/xe/xe_ras_log.h
> new file mode 100644
> index 000000000000..08318dea75a9
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras_log.h
> @@ -0,0 +1,108 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_RAS_LOG_H_
> +#define _XE_RAS_LOG_H_
> +
> +#include <linux/cper.h>
> +
> +#include "xe_sig_ids.h"
> +
> +struct xe_device;
> +struct xe_gt;
> +
> +/**
> + * DOC: RAS structured logging and SIG_IDs

Link this document.  Add a  brief heading

> + *
> + * What this file is for
> + * ---------------------

Once  this document is generated.
"this file" does not indicate what you are referring to.
Use general doc headings

> + * Use the xe_ras_log_*() macros to report faults in a
> + * machine-readable way.
> + *
> + * What is a SIG_ID?
> + * -----------------
> + * A SIG_ID (defined in xe_sig_ids.h) is a stable numeric error class.

Add full form. Define what is SIG ID

> + * It tells tooling what kind of fault happened, independent of message text.
> + *
> + * Driver SIG_ID classes:
> + *   XE_SIG_PROBE         - probe/init failure
> + *   XE_SIG_WEDGED        - device unrecoverable

wedged does not mean device is not recoverable

> + *   XE_SIG_SURVIVABILITY - degraded/safe mode
>

not really safe mode. Please use actual definition

> + *   XE_SIG_FW            - firmware fault (GuC/HuC/GSC/CSC/PCODE)
> + *   XE_SIG_GT_TDR        - engine hang or GT reset (TDR)
> + *   XE_SIG_MEM_FAULT     - VM/page-fault/GTT error
> + *   XE_SIG_IO_BUS        - PCIe/IOMMU/MMIO bus error
> + *
> + * HW SIG_ID classes are hardware-defined categories, reported by firmware via
> + * CPER records (for example XE_SIG_HW_DEVICE_MEMORY). Driver code does not
> + * emit HW SIG_IDs directly.
> + *   XE_SIG_HW_DEVICE_MEMORY  - device memory errors (e.g. ECC)
> + *   XE_SIG_HW_CORE_COMPUTE   - compute/shader core errors
> + *   XE_SIG_HW_PCIE           - PCIe interface errors
> + *   XE_SIG_HW_FABRIC         - on-package fabric errors

on-package?

> + *   XE_SIG_HW_SOC_INTERNAL   - SoC-internal errors
> + *
> + * Why this exists
> + * ---------------
> + * SIG_IDs let RAS tools parse logs reliably, correlate CPER events, and apply
> + * policy/thresholding without depending on fragile string matching.

What do you mean?

> + *
> + * When to use xe_ras_log_*()
> + * --------------------------
> + * Use these macros when the event is:
> + *   1) a real hardware/firmware fault,
> + *   2) relevant to production monitoring, and

How does one decide if it's relevant?

> + *   3) clearly mapped to one SIG_ID class.
> + *
> + * Do not use for all logs
> + * -----------------------
> + * Keep regular debug/info/driver-state messages on standard logging helpers
> + * (xe_gt_dbg(), xe_gt_info(), xe_gt_warn(), xe_gt_err(), drm_dbg(), drm_info()).
> + * RAS macros are for structured fault reporting only.
> + */
> +
> +/*
> + * Common backend helper
> + */
> +__printf(6, 7)
> +void __xe_ras_log(struct xe_device *xe, struct xe_gt *gt,
> +		  u16 sig_id, u32 cper_sev, int errno_val,
> +		  const char *fmt, ...);
> +
> +/*
> + * Driver-facing reporting macros
> + */
> +
> +/* FATAL */
> +#define xe_ras_log_probe(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_PROBE, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_wedged(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_WEDGED, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_survivability(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_SURVIVABILITY, CPER_SEV_FATAL, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +/* RECOVERABLE */
> +#define xe_ras_log_fw(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_FW, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +

Why force a type of severity here?

Thanks
Riana

> +#define xe_ras_log_gt_tdr(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_GT_TDR, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_mem_fault(xe, gt, errno, fmt, ...) \
> +	__xe_ras_log((xe), (gt), XE_SIG_MEM_FAULT, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#define xe_ras_log_io_bus(xe, errno, fmt, ...) \
> +	__xe_ras_log((xe), NULL, XE_SIG_IO_BUS, CPER_SEV_RECOVERABLE, \
> +		     (errno), fmt, ##__VA_ARGS__)
> +
> +#endif /* _XE_RAS_LOG_H_ */

next prev parent reply	other threads:[~2026-06-19  5:43 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-17 10:47 [RFC PATCH v3 0/3] drm/xe: Structured RAS error logging infrastructure Mallesh Koujalagi
2026-06-17 10:47 ` [RFC PATCH v3 1/3] drm/xe: Add error Signature IDs for RAS logging Mallesh Koujalagi
2026-06-19  5:48   ` Tauro, Riana
2026-06-23  7:41     ` Mallesh, Koujalagi
2026-06-20 17:02   ` Michal Wajdeczko
2026-06-24 11:51     ` Mallesh, Koujalagi
2026-06-17 10:47 ` [RFC PATCH v3 2/3] drm/xe: Add RAS logging helpers Mallesh Koujalagi
2026-06-19  5:43   ` Tauro, Riana [this message]
2026-06-23 10:41     ` Mallesh, Koujalagi
2026-06-20 19:29   ` Michal Wajdeczko
2026-06-24 16:03     ` Mallesh, Koujalagi
2026-06-17 10:47 ` [RFC PATCH v3 3/3] drm/xe: use XE_RAS_WEDGED macro in xe_device_declare_wedged Mallesh Koujalagi
2026-06-17 16:06   ` Bhadane, Dnyaneshwar
2026-06-20 19:39   ` Michal Wajdeczko
2026-06-24 16:16     ` Mallesh, Koujalagi
2026-06-17 11:11 ` ✗ CI.checkpatch: warning for drm/xe: Structured RAS error logging infrastructure (rev3) Patchwork
2026-06-17 11:11 ` ✗ CI.KUnit: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fcffd5d8-c019-47e8-83cb-c84cdb3b88bb@intel.com \
    --to=riana.tauro@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=mallesh.koujalagi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=vinay.belgaumkar@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.