Re: [PATCH v3 07/10] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Raag Jadav <raag.jadav@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: intel-xe@lists.freedesktop.org, anshuman.gupta@intel.com,
	rodrigo.vivi@intel.com, aravind.iddamsetty@linux.intel.com,
	badal.nilawar@intel.com, ravi.kishore.koppuravuri@intel.com,
	mallesh.koujalagi@intel.com, soham.purkait@intel.com
Subject: Re: [PATCH v3 07/10] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors
Date: Wed, 8 Apr 2026 13:15:19 +0200	[thread overview]
Message-ID: <adY4x7iYSfs04ufg@black.igk.intel.com> (raw)
In-Reply-To: <20260402070131.1603828-19-riana.tauro@intel.com>

On Thu, Apr 02, 2026 at 12:31:38PM +0530, Riana Tauro wrote:
> Uncorrectable Core-Compute errors are classified into Global and Local
> errors.
> 
> Global error is an error that affects the entire device requiring a
> reset. This type of error is not isolated. When an AER is reported and
> error_detected is invoked return PCI_ERS_RESULT_NEED_RESET.
> 
> A Local error is confined to a specific component or context like a
> engine. These errors can be contained and recovered by resetting
> only the affected part without distrupting the rest of the device.
> 
> Upon detection of an Uncorrectable Local Core-Compute error, an AER is
> generated and GuC is notified of the error which triggers a engine reset.
> Return Recovered from PCI error callbacks for these errors.
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: add newline and fix log
>     add bounds check (Mallesh)
>     add ras specific enum (Raag)
>     helper for sysctrl prepare command
>     process all errors before deciding recovery action
> 
> v3: remove TODO from commit message
>     remove redundant rlen check
>     fix loop
>     add check for sysctrl flooding (Raag)
>     do not use xe_ras prefix for static functions (Soham)
> ---
>  drivers/gpu/drm/xe/xe_ras.c       | 131 ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_ras.h       |   3 +
>  drivers/gpu/drm/xe/xe_ras_types.h |  18 ++++
>  3 files changed, 152 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 515c28167a69..6542ed4a6a24 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -5,7 +5,16 @@
>  
>  #include "xe_assert.h"
>  #include "xe_device_types.h"
> +#include "xe_printk.h"
>  #include "xe_ras.h"
> +#include "xe_ras_types.h"
> +#include "xe_sysctrl_mailbox.h"
> +#include "xe_sysctrl_mailbox_types.h"
> +
> +#define COMPUTE_ERROR_SEVERITY_MASK		GENMASK(26, 25)
> +#define GLOBAL_UNCORR_ERROR			2
> +/* Modify as needed */
> +#define XE_SYSCTRL_ERROR_FLOOD			16
>  
>  /* Severity classification of detected errors */
>  enum xe_ras_severity {
> @@ -61,6 +70,128 @@ static inline const char *comp_to_str(struct xe_device *xe, u32 comp)
>  	return comp < XE_RAS_COMPONENT_MAX ? xe_ras_components[comp] : "Unknown";
>  }
>  
> +static enum xe_ras_recovery_action handle_compute_errors(struct xe_device *xe,
> +							 struct xe_ras_error_array *arr)
> +{
> +	struct xe_ras_compute_error *error_info = (struct xe_ras_compute_error *)arr->error_details;
> +	struct xe_ras_error_common common = arr->error_class.common;
> +	u8 uncorr_type;
> +
> +	uncorr_type = FIELD_GET(COMPUTE_ERROR_SEVERITY_MASK, error_info->error_log_header);
> +
> +	xe_err(xe, "[RAS]: %s %s Error detected", severity_to_str(xe, common.severity),
> +	       comp_to_str(xe, common.component));
> +
> +	/* Request a RESET if error is global */
> +	if (uncorr_type == GLOBAL_UNCORR_ERROR)
> +		return XE_RAS_RECOVERY_ACTION_RESET;
> +
> +	/* Local errors are recovered using a engine reset by GuC */

So do we need to synchronization with GuC before returning?

> +	return XE_RAS_RECOVERY_ACTION_RECOVERED;
> +}
> +
> +static void prepare_sysctrl_command(struct xe_sysctrl_mailbox_command *command,
> +				    u32 cmd_mask, void *request, size_t request_len,
> +				    void *response, size_t response_len)
> +{
> +	struct xe_sysctrl_app_msg_hdr hdr = {0};
> +	u32 req_hdr;

With a single member, not sure if it's worth having a local variable.

> +	req_hdr = FIELD_PREP(APP_HDR_GROUP_ID_MASK, XE_SYSCTRL_GROUP_GFSP) |

Can simply be hdr.data.

> +		  FIELD_PREP(APP_HDR_COMMAND_MASK, cmd_mask);
> +
> +	hdr.data = req_hdr;

With above in place, this won't be needed.

> +	command->header = hdr;
> +	command->data_in = request;
> +	command->data_in_len = request_len;
> +	command->data_out = response;
> +	command->data_out_len = response_len;
> +}
> +
> +/**
> + * xe_ras_process_errors - Process and contain hardware errors
> + * @xe: xe device instance
> + *
> + * Get error details from system controller and return recovery
> + * method. Called only from PCI error handling.
> + *
> + * Returns: recovery action to be taken
> + */
> +enum xe_ras_recovery_action xe_ras_process_errors(struct xe_device *xe)
> +{
> +	struct xe_sysctrl_mailbox_command command = {0};
> +	struct xe_ras_get_error_response response;
> +	enum xe_ras_recovery_action final_action;
> +	u32 count = 0;
> +	size_t rlen;
> +	int ret;
> +
> +	/* Default action */
> +	final_action = XE_RAS_RECOVERY_ACTION_RECOVERED;
> +
> +	if (!xe->info.has_sysctrl)

This looks like should be at the top.

> +		return XE_RAS_RECOVERY_ACTION_RESET;
> +
> +	prepare_sysctrl_command(&command, XE_SYSCTRL_CMD_GET_SOC_ERROR, NULL, 0,
> +				&response, sizeof(response));
> +
> +	do {
> +		memset(&response, 0, sizeof(response));
> +		rlen = 0;

Is this needed? This is guaranteed to be updated if command succeeds.

> +		ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen);
> +		if (ret) {
> +			xe_err(xe, "[RAS]: Sysctrl error ret %d\n", ret);
> +			goto err;
> +		}
> +
> +		if (rlen != sizeof(response)) {
> +			xe_err(xe, "[RAS]: Sysctrl response size mismatch. Expected %zu, got %zu\n",
> +			       sizeof(response), rlen);
> +			goto err;
> +		}
> +
> +		for (int i = 0; i < response.num_errors &&  i < XE_RAS_NUM_ERROR_ARR; i++) {

I know this was my suggestion, but Mallesh later suggested that we fail on
out of spec behavior, which is not guaranteed here. So we can go back to
v2 implementation if it makes more sense. I'll leave it to you all.

Raag

> +			struct xe_ras_error_array arr = response.error_arr[i];
> +			enum xe_ras_recovery_action action;
> +			struct xe_ras_error_class error_class;
> +			u8 component;
> +
> +			error_class = arr.error_class;
> +			component = error_class.common.component;
> +
> +			switch (component) {
> +			case XE_RAS_COMPONENT_CORE_COMPUTE:
> +				action = handle_compute_errors(xe, &arr);
> +				break;
> +			default:
> +				xe_err(xe, "[RAS]: Unknown error component %u\n", component);
> +				action = XE_RAS_RECOVERY_ACTION_RESET;
> +				break;
> +			}
> +
> +			/*
> +			 * Retain the highest severity action. Process and log all errors
> +			 * and then take appropriate recovery action.
> +			 */
> +			if (action > final_action)
> +				final_action = action;
> +		}
> +
> +		/* Break if system controller floods responses */
> +		if (++count > XE_SYSCTRL_ERROR_FLOOD) {
> +			xe_err(xe, "[RAS]: Sysctrl response flooding\n");
> +			break;
> +		}
> +
> +	} while (response.additional_errors);
> +
> +	return final_action;
> +
> +err:
> +	return XE_RAS_RECOVERY_ACTION_RESET;
> +}
> +
>  #ifdef CONFIG_PCIEAER
>  static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
> index 14cb973603e7..e191ab80080c 100644
> --- a/drivers/gpu/drm/xe/xe_ras.h
> +++ b/drivers/gpu/drm/xe/xe_ras.h
> @@ -6,8 +6,11 @@
>  #ifndef _XE_RAS_H_
>  #define _XE_RAS_H_
>  
> +#include "xe_ras_types.h"
> +
>  struct xe_device;
>  
>  void xe_ras_init(struct xe_device *xe);
> +enum xe_ras_recovery_action  xe_ras_process_errors(struct xe_device *xe);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
> index ed69654e5bae..e37dd12bffa3 100644
> --- a/drivers/gpu/drm/xe/xe_ras_types.h
> +++ b/drivers/gpu/drm/xe/xe_ras_types.h
> @@ -11,6 +11,24 @@
>  #define XE_RAS_NUM_ERROR_ARR		3
>  #define XE_RAS_MAX_ERROR_DETAILS	16
>  
> +/**
> + * enum xe_ras_recovery_action - RAS recovery actions
> + *
> + * @XE_RAS_RECOVERY_ACTION_RECOVERED: Error recovered
> + * @XE_RAS_RECOVERY_ACTION_RESET: Requires reset
> + * @XE_RAS_RECOVERY_ACTION_DISCONNECT: Requires disconnect
> + * @XE_RAS_RECOVERY_ACTION_MAX: Max action value
> + *
> + * This enum defines the possible recovery actions that can be taken in response
> + * to RAS errors.
> + */
> +enum xe_ras_recovery_action {
> +	XE_RAS_RECOVERY_ACTION_RECOVERED = 0,
> +	XE_RAS_RECOVERY_ACTION_RESET,
> +	XE_RAS_RECOVERY_ACTION_DISCONNECT,
> +	XE_RAS_RECOVERY_ACTION_MAX
> +};
> +
>  /**
>   * struct xe_ras_error_common - Common RAS error class
>   *
> -- 
> 2.47.1
>

next prev parent reply	other threads:[~2026-04-08 11:15 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02  7:01 [PATCH v3 00/10] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-04-02  7:01 ` [PATCH v3 01/10] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-04-02  7:01 ` [PATCH v3 02/10] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-04-07  4:50   ` Matthew Brost
2026-04-02  7:01 ` [PATCH v3 03/10] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-04-02  7:01 ` [PATCH v3 04/10] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-04-02  7:01 ` [PATCH v3 05/10] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-04-07  5:50   ` Raag Jadav
2026-04-02  7:01 ` [PATCH v3 06/10] drm/xe/xe_ras: Add structures and commands for Uncorrectable Core Compute Errors Riana Tauro
2026-04-07  5:59   ` Raag Jadav
2026-04-02  7:01 ` [PATCH v3 07/10] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors Riana Tauro
2026-04-08 11:15   ` Raag Jadav [this message]
2026-04-02  7:01 ` [PATCH v3 08/10] drm/xe/xe_ras: Add structures for SoC Internal errors Riana Tauro
2026-04-08 11:18   ` Raag Jadav
2026-04-02  7:01 ` [PATCH v3 09/10] drm/xe/xe_ras: Handle Uncorrectable " Riana Tauro
2026-04-02  7:01 ` [PATCH v3 10/10] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-04-02  8:01 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev3) Patchwork
2026-04-02  8:02 ` ✓ CI.KUnit: success " Patchwork
2026-04-02  8:50 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-02 15:08 ` ✓ Xe.CI.FULL: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adY4x7iYSfs04ufg@black.igk.intel.com \
    --to=raag.jadav@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=ravi.kishore.koppuravuri@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=soham.purkait@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox