Re: [PATCH v5 08/14] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: <anshuman.gupta@intel.com>, <rodrigo.vivi@intel.com>,
	<aravind.iddamsetty@linux.intel.com>, <badal.nilawar@intel.com>,
	<raag.jadav@intel.com>, <ravi.kishore.koppuravuri@intel.com>,
	<soham.purkait@intel.com>, <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v5 08/14] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors
Date: Fri, 15 May 2026 13:40:40 +0530	[thread overview]
Message-ID: <5a19f206-3e68-4c24-bc18-c014123a0fc2@intel.com> (raw)
In-Reply-To: <20260511172908.1122252-24-riana.tauro@intel.com>

[-- Attachment #1: Type: text/plain, Size: 6290 bytes --]

Hi Riana,

On 11-05-2026 10:59 pm, Riana Tauro wrote:
> Some critical errors such as CSC firmware and Punit are reported under SoC
> internal errors and require special handling.
>
> CSC errors are classified into hardware errors and firmware errors.
> Hardware errors can be recovered using a SBR (Secondary Bus Reset) whereas
> firmware errors are critical and require a firmware flash. On such errors,
> device is wedged and runtime survivability mode will be enabled to notify
> userspace that a firmware flash is required.
>
> PUNIT uncorrectable errors can only be recovered through a cold reset.
>
> Signed-off-by: Riana Tauro<riana.tauro@intel.com>
> ---
> v2: simplify soc structures
>      return error code for each SoC error (Mallesh)
>
> v3: squash patches (Raag)
>
> v4: re-use csc work
> ---
>   drivers/gpu/drm/xe/xe_ras.c       | 53 +++++++++++++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_ras_types.h | 50 +++++++++++++++++++++++++++++
>   2 files changed, 103 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index ecadad5857b2..0d4e2116ef61 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -103,6 +103,56 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_device *
>   	return XE_RAS_RECOVERY_ACTION_RECOVERED;
>   }
>   
> +static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device *xe,
> +							      struct xe_ras_error_array *arr)
> +{
> +	struct xe_ras_soc_error *error_info = (struct xe_ras_soc_error *)arr->error_details;
> +	struct xe_ras_soc_error_source *source = &error_info->error_source;
> +	struct xe_ras_error_class *error_class = &arr->error_class;
> +	u8 tile_id = error_class->product.unit.tile;
> +	struct xe_tile *tile;
> +
> +	if (tile_id >= xe->info.tile_count) {
> +		xe_err(xe, "sysctrl: SOC internal error reported from invalid tile %u\n", tile_id);
> +		return XE_RAS_RECOVERY_ACTION_RESET;
> +	}
> +
> +	tile = &xe->tiles[tile_id];
> +
> +	if (source->csc) {
> +		struct xe_ras_csc_error *csc_error = (struct xe_ras_csc_error *)error_info->additional_details;
> +
> +		/*
> +		 * CSC uncorrectable errors are classified as hardware errors and firmware errors.
> +		 * CSC firmware errors are critical errors that can be recovered only by firmware
> +		 * update via SPI driver. On a CSC firmware error, PCODE enables FDO mode and sets
> +		 * the bit in the capability register. On receiving this error, the driver enables
> +		 * runtime survivability mode which notifies userspace that a firmware update
> +		 * is required.
> +		 */
> +		if (csc_error->hec_uncorr_fw_err_dw0) {
> +			xe_err(xe, "[RAS]: CSC %s detected: 0x%x\n",
> +			       sev_to_str(error_class->common.severity),
> +			       csc_error->hec_uncorr_fw_err_dw0);
> +			schedule_work(&tile->csc_hw_error_work);

Potential race condition: If the PCI device is removed (after 
disconnect) while csc_hw_error_work is still queued, the work might run 
after the device is gone, causing a crash right?

> +			return XE_RAS_RECOVERY_ACTION_DISCONNECT;
> +		}

We need to check and log hec_uncorr_err_status as well.

Thanks,

-/Mallesh

> +	} else if (source->ieh) {
> +		struct xe_ras_ieh_error *ieh_error = (struct xe_ras_ieh_error *)error_info->additional_details;
> +
> +		if (ieh_error->global_error_status & XE_RAS_SOC_IEH_PUNIT) {
> +			xe_err(xe, "[RAS]: PUNIT %s detected: 0x%x\n",
> +			       sev_to_str(error_class->common.severity),
> +			       ieh_error->global_error_status);
> +			/* TODO: Add PUNIT error handling */
> +			return XE_RAS_RECOVERY_ACTION_DISCONNECT;
> +		}
> +	}
> +
> +	/* For other SOC internal errors, request a reset as recovery mechanism */
> +	return XE_RAS_RECOVERY_ACTION_RESET;
> +}
> +
>   void xe_ras_counter_threshold_crossed(struct xe_device *xe,
>   				      struct xe_sysctrl_event_response *response)
>   {
> @@ -193,6 +243,9 @@ enum xe_ras_recovery_action xe_ras_process_errors(struct xe_device *xe)
>   			case XE_RAS_COMP_CORE_COMPUTE:
>   				action = handle_core_compute_errors(xe, arr);
>   				break;
> +			case XE_RAS_COMP_SOC_INTERNAL:
> +				action = handle_soc_internal_errors(xe, arr);
> +				break;
>   			default:
>   				/* For any other component, reset */
>   				action = XE_RAS_RECOVERY_ACTION_RESET;
> diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
> index e97026fd6ff9..c5a283317d90 100644
> --- a/drivers/gpu/drm/xe/xe_ras_types.h
> +++ b/drivers/gpu/drm/xe/xe_ras_types.h
> @@ -10,6 +10,7 @@
>   
>   #define XE_RAS_NUM_ERROR_ARR			3
>   #define XE_RAS_NUM_COUNTERS			16
> +#define XE_RAS_SOC_IEH_PUNIT			BIT(1)
>   
>   /**
>    * enum xe_ras_recovery_action - RAS recovery actions
> @@ -125,4 +126,53 @@ struct xe_ras_compute_error {
>   	u32 reserved[15];
>   } __packed;
>   
> +/**
> + * struct xe_ras_soc_error_source - Source of SoC error
> + */
> +struct xe_ras_soc_error_source {
> +	/** @csc: CSC */
> +	u32 csc:1;
> +	/** @ieh: IEH (Integrated Error Handler) */
> +	u32 ieh:1;
> +	/** @reserved: Reserved for future use */
> +	u32 reserved:30;
> +} __packed;
> +
> +/**
> + * struct xe_ras_soc_error - Error details of SoC internal error
> + */
> +struct xe_ras_soc_error {
> +	/** @error_source: Error source */
> +	struct xe_ras_soc_error_source error_source;
> +	/** @additional_details: Additional details */
> +	u32 additional_details[15];
> +} __packed;
> +
> +/**
> + * struct xe_ras_csc_error - CSC error details
> + */
> +struct xe_ras_csc_error {
> +	/** @hec_uncorr_err_status: CSC hardware error status */
> +	u32 hec_uncorr_err_status;
> +	/** @hec_uncorr_fw_err_dw0: CSC firmware error */
> +	u32 hec_uncorr_fw_err_dw0;
> +} __packed;
> +
> +/**
> + * struct xe_ras_ieh_error - SoC IEH (Integrated Error Handler) error details
> + */
> +struct xe_ras_ieh_error {
> +	/** @ieh_instance: IEH instance */
> +	u32 ieh_instance:2;
> +	/** @reserved: Reserved for future use */
> +	u32 reserved:30;
> +	/** @global_error_status: Global error status */
> +	u32 global_error_status;
> +	/** @local_error_status: Local error status */
> +	u32 local_error_status;
> +	/** @gerr_mask: Global error mask */
> +	u32 gerr_mask;
> +	/** @additional_info: Additional information */
> +	u32 additional_info[10];
> +} __packed;
>   #endif

[-- Attachment #2: Type: text/html, Size: 6975 bytes --]

next prev parent reply	other threads:[~2026-05-15  8:10 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 17:29 [PATCH v5 00/14] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-05-11 17:29 ` [PATCH v5 01/14] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-05-11 17:29 ` [PATCH v5 02/14] drm/xe/xe_sysctrl: Make sysctrl flood limit reusable Riana Tauro
2026-05-14 12:51   ` Mallesh, Koujalagi
2026-05-15  7:46   ` Raag Jadav
2026-05-11 17:29 ` [PATCH v5 03/14] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-05-14 13:15   ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 04/14] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-05-11 17:29 ` [PATCH v5 05/14] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-05-11 17:29 ` [PATCH v5 06/14] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-05-14 17:40   ` Raag Jadav
2026-05-15  8:30   ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 07/14] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 08/14] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-05-15  8:10   ` Mallesh, Koujalagi [this message]
2026-05-11 17:29 ` [PATCH v5 09/14] drm/xe/xe_ras: Add support to query device memory errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 10/14] drm/xe/xe_ras: Add support to query page offline queue and list Riana Tauro
2026-05-11 17:29 ` [PATCH v5 11/14] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-05-11 21:56   ` Umesh Nerlige Ramappa
2026-05-11 17:29 ` [PATCH v5 12/14] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 13/14] drm/xe/xe_ras: Add support to offline/decline a page address Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 14/14] drm/xe/xe_ras: Process pages from offlined list and queue Riana Tauro
2026-05-12  1:05 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev5) Patchwork
2026-05-12  1:06 ` ✓ CI.KUnit: success " Patchwork
2026-05-12  2:29 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-12  6:26 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a19f206-3e68-4c24-bc18-c014123a0fc2@intel.com \
    --to=mallesh.koujalagi@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=raag.jadav@intel.com \
    --cc=ravi.kishore.koppuravuri@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=soham.purkait@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.