public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
To: Riana Tauro <riana.tauro@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Cc: anshuman.gupta@intel.com, rodrigo.vivi@intel.com,
	badal.nilawar@intel.com, raag.jadav@intel.com,
	ravi.kishore.koppuravuri@intel.com, mallesh.koujalagi@intel.com
Subject: Re: [PATCH 5/8] drm/xe/xe_ras: Initialize Uncorrectable AER Registers
Date: Wed, 4 Feb 2026 14:08:22 +0530	[thread overview]
Message-ID: <1b3f2913-36fa-4028-ae9d-36e19f8047e4@linux.intel.com> (raw)
In-Reply-To: <20260122100613.3631582-15-riana.tauro@intel.com>

[-- Attachment #1: Type: text/plain, Size: 5129 bytes --]

Hi Riana,

On 22-01-2026 15:36, Riana Tauro wrote:
> Uncorrectable errors from different endpoints in the device are steered to
> the USP which is a PCI Advanced Error Reporting (AER) Compliant device.
> Downgrade all the errors to non-fatal to prevent PCIe bus driver
> from triggering a Secondary Bus Reset (SBR). This allows error
> detection, containment and recovery in the driver.
>
> The Uncorrectable Error Severity Register has the 'Uncorrectable
> Internal Error Severity' set to fatal by default. Set this to
> non-fatal and unmask the error.
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile    |  1 +
>  drivers/gpu/drm/xe/xe_device.c |  3 ++
>  drivers/gpu/drm/xe/xe_ras.c    | 71 ++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_ras.h    | 13 +++++++
>  4 files changed, 88 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_ras.c
>  create mode 100644 drivers/gpu/drm/xe/xe_ras.h
>
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 5581f2180b5c..85ec53eb0b62 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -110,6 +110,7 @@ xe-y += xe_bb.o \
>  	xe_pxp_debugfs.o \
>  	xe_pxp_submit.o \
>  	xe_query.o \
> +	xe_ras.o \
>  	xe_range_fence.o \
>  	xe_reg_sr.o \
>  	xe_reg_whitelist.o \
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index f418ebf04f0f..be89ffc9eade 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -59,6 +59,7 @@
>  #include "xe_psmi.h"
>  #include "xe_pxp.h"
>  #include "xe_query.h"
> +#include "xe_ras.h"
>  #include "xe_shrinker.h"
>  #include "xe_soc_remapper.h"
>  #include "xe_survivability_mode.h"
> @@ -1019,6 +1020,8 @@ int xe_device_probe(struct xe_device *xe)
>  
>  	xe_vsec_init(xe);
>  
> +	xe_ras_init(xe);
> +
>  	err = xe_sriov_init_late(xe);
>  	if (err)
>  		goto err_unregister_display;
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> new file mode 100644
> index 000000000000..ba5ed37aed28
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -0,0 +1,71 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +#include <linux/pci.h>
> +
> +#include "xe_device_types.h"
> +#include "xe_ras.h"
> +
> +#ifdef CONFIG_PCIEAER
> +static void unmask_and_downgrade_internal_error(struct xe_device *xe)
> +{
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	struct pci_dev *vsp, *usp;
> +	u32 aer_uncorr_sev, aer_uncorr_mask;
> +	u16 aer_cap;
> +
> +	 /* Gfx Device Hierarchy: USP-->VSP-->SGunit */
> +	vsp = pci_upstream_bridge(pdev);
> +	if (!vsp)
> +		return;
> +
> +	usp = pci_upstream_bridge(vsp);
> +	if (!usp)
> +		return;
> +
> +	aer_cap = usp->aer_cap;
> +
> +	if (!aer_cap)
> +		return;
> +
> +	/*
> +	 * All errors are steered to USP which is a PCIe AER Complaint device.
> +	 * Downgrade all the errors to non-fatal to prevent PCIe bus driver
> +	 * from triggering a Secondary Bus Reset (SBR). This allows error
> +	 * detection, containment and recovery in the driver.
> +	 *
> +	 * The Uncorrectable Error Severity Register has the 'Uncorrectable
> +	 * Internal Error Severity' set to fatal by default. Set this to
> +	 * non-fatal and unmask the error.
> +	 */
> +

Before unmasking the PCI_ERR_UNC_INTN bit, we shall clear stale event in
PCI_ERR_UNCOR_STATUS register that would be signaled once we unmask the
bit. (Assuming the bit wasn't unmasked already.)

There is a pci_aer_unmask_internal_errors() helper declared in
drivers/pci/pcie/aer.c which we could probably use by exporting it.

Also do you think it makes more sense to move this to pci quirks,
because in virtualized environment the XeKMD might be in VM(passthrough
model) and USP in host then this might not work.

> +	/* Initialize Uncorrectable Error Severity Register */
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, &aer_uncorr_sev);
> +	aer_uncorr_sev &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, aer_uncorr_sev);
> +
> +	/* Initialize Uncorrectable Error Mask Register */
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
> +	aer_uncorr_mask &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
> +
> +	pci_save_state(usp);
> +}
> +#endif
> +
> +/**
> + * xe_ras_init - Initialize Xe RAS
> + * @xe: xe device instance
> + *
> + * Initialize Xe RAS
> + */
> +void xe_ras_init(struct xe_device *xe)
> +{
> +	if (!xe->info.has_sysctrl)
> +		return;
> +
> +#ifdef CONFIG_PCIEAER
> +	unmask_and_downgrade_internal_error(xe);
> +#endif
> +}
> diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
> new file mode 100644
> index 000000000000..14cb973603e7
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_RAS_H_
> +#define _XE_RAS_H_
> +
> +struct xe_device;
> +
> +void xe_ras_init(struct xe_device *xe);
> +
> +#endif
Thanks,
Aravind.

[-- Attachment #2: Type: text/html, Size: 5940 bytes --]

  parent reply	other threads:[~2026-02-04  8:38 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22 10:06 [PATCH 0/8] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-01-22  9:42 ` ✗ CI.checkpatch: warning for " Patchwork
2026-01-22  9:43 ` ✓ CI.KUnit: success " Patchwork
2026-01-22 10:06 ` [PATCH 1/8] drm/xe/xe_sysctrl: Add System controller patch Riana Tauro
2026-01-22 10:06 ` [PATCH 2/8] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-01-27 22:49   ` Michal Wajdeczko
2026-02-02  9:45     ` Riana Tauro
2026-01-29  9:09   ` Nilawar, Badal
2026-02-02 13:19     ` Nilawar, Badal
2026-02-03  3:46       ` Riana Tauro
2026-02-03  3:41     ` Riana Tauro
2026-02-08  8:02   ` Raag Jadav
2026-02-24  3:23     ` Riana Tauro
2026-02-24  5:33       ` Raag Jadav
2026-02-16  8:53   ` Mallesh, Koujalagi
2026-02-24  3:26     ` Riana Tauro
2026-01-22 10:06 ` [PATCH 3/8] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-01-27 11:23   ` Mallesh, Koujalagi
2026-02-02  8:46     ` Riana Tauro
2026-01-22 10:06 ` [PATCH 4/8] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-01-22 10:06 ` [PATCH 5/8] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-01-27 12:41   ` Mallesh, Koujalagi
2026-02-02  9:34     ` Riana Tauro
2026-02-04  8:38   ` Aravind Iddamsetty [this message]
2026-02-16 12:27   ` Mallesh, Koujalagi
2026-02-18 14:48     ` Riana Tauro
2026-01-22 10:06 ` [PATCH 6/8] drm/xe/xe_ras: Add structures and commands for Uncorrectable Core Compute Errors Riana Tauro
2026-02-23 14:19   ` Mallesh, Koujalagi
2026-02-23 14:30     ` Riana Tauro
2026-01-22 10:06 ` [PATCH 7/8] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors Riana Tauro
2026-01-27 11:44   ` Mallesh, Koujalagi
2026-02-02  8:38     ` Riana Tauro
2026-01-27 14:03   ` Mallesh, Koujalagi
2026-02-02  8:54     ` Riana Tauro
2026-02-24 12:17     ` Mallesh, Koujalagi
2026-02-17 14:02   ` Raag Jadav
2026-02-23 14:10     ` Riana Tauro
2026-01-22 10:06 ` [PATCH 8/8] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-02-24 12:46   ` Mallesh, Koujalagi
2026-01-22 10:21 ` ✓ Xe.CI.BAT: success for Introduce Xe Uncorrectable Error Handling Patchwork
2026-01-22 20:28 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b3f2913-36fa-4028-ae9d-36e19f8047e4@linux.intel.com \
    --to=aravind.iddamsetty@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=ravi.kishore.koppuravuri@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox