From: Raag Jadav <raag.jadav@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: intel-xe@lists.freedesktop.org, anshuman.gupta@intel.com,
rodrigo.vivi@intel.com, aravind.iddamsetty@linux.intel.com,
badal.nilawar@intel.com, ravi.kishore.koppuravuri@intel.com,
mallesh.koujalagi@intel.com, soham.purkait@intel.com
Subject: Re: [PATCH v5 06/14] drm/xe/xe_ras: Initialize Uncorrectable AER Registers
Date: Thu, 14 May 2026 19:40:30 +0200 [thread overview]
Message-ID: <agYJDqkslpMvjprk@black.igk.intel.com> (raw)
In-Reply-To: <20260511172908.1122252-22-riana.tauro@intel.com>
On Mon, May 11, 2026 at 10:59:13PM +0530, Riana Tauro wrote:
> Uncorrectable errors from different endpoints in the device are steered to
> the USP(Upstream Switch Port) which is a PCI Advanced Error Reporting (AER)
> Compliant device. Downgrade all the errors to non-fatal to prevent PCIe
> bus driver from triggering a Secondary Bus Reset (SBR). This allows error
> detection, containment and recovery in the driver.
>
> The Uncorrectable Error Severity Register has the 'Uncorrectable
> Internal Error Severity' set to fatal by default. Set this to
> non-fatal and unmask the error.
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: clear stale uncorrectable internal status in status register
> (Aravind)
>
> v3: abbrevate TLA's (Raag)
> add a info message if USP does not support AER
>
> v4: add a success log (Raag)
> ---
> drivers/gpu/drm/xe/xe_device.c | 3 ++
> drivers/gpu/drm/xe/xe_ras.c | 78 ++++++++++++++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_ras.h | 2 +-
> 3 files changed, 82 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 4b45b617a039..200d6bbb1b70 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -62,6 +62,7 @@
> #include "xe_psmi.h"
> #include "xe_pxp.h"
> #include "xe_query.h"
> +#include "xe_ras.h"
> #include "xe_shrinker.h"
> #include "xe_soc_remapper.h"
> #include "xe_survivability_mode.h"
> @@ -1048,6 +1049,8 @@ int xe_device_probe(struct xe_device *xe)
> if (err)
> goto err_unregister_display;
>
> + xe_ras_init(xe);
> +
> err = xe_device_sysfs_init(xe);
> if (err)
> goto err_unregister_display;
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 4cb16b419b0c..24642c309967 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -91,3 +91,81 @@ void xe_ras_counter_threshold_crossed(struct xe_device *xe,
> comp_to_str(component), sev_to_str(severity));
> }
> }
> +
> +#ifdef CONFIG_PCIEAER
I think all the PCI stuff should be part of xe_pci_error.c but I'll leave
it to you all.
Raag
> +static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe)
> +{
> + struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> + struct pci_dev *vsp, *usp;
> + u32 aer_uncorr_mask, aer_uncorr_sev, aer_uncorr_status;
> + u16 aer_cap;
> +
> + /*
> + * Device Hierarchy:
> + *
> + * Upstream Switch Port (USP)--> Virtual Switch Port (VSP)--> SGunit (GPU endpoint)
> + */
> + vsp = pci_upstream_bridge(pdev);
> + if (!vsp)
> + return;
> +
> + usp = pci_upstream_bridge(vsp);
> + if (!usp)
> + return;
> +
> + aer_cap = usp->aer_cap;
> +
> + if (!aer_cap) {
> + dev_info(&usp->dev, "USP doesn't support AER capability\n");
> + return;
> + }
> +
> + /*
> + * Clear any stale Uncorrectable Internal Error Status event in Uncorrectable Error
> + * Status Register.
> + */
> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, &aer_uncorr_status);
> + if (aer_uncorr_status & PCI_ERR_UNC_INTN)
> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, PCI_ERR_UNC_INTN);
> +
> + /*
> + * All errors are steered to USP which is a PCIe AER Compliant device.
> + * Downgrade all the errors to non-fatal to prevent PCIe bus driver
> + * from triggering a Secondary Bus Reset (SBR). This allows error
> + * detection, containment and recovery in the driver.
> + *
> + * The Uncorrectable Error Severity Register has the 'Uncorrectable
> + * Internal Error Severity' set to fatal by default. Set this to
> + * non-fatal and unmask the error.
> + */
> +
> + /* Initialize Uncorrectable Error Severity Register */
> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, &aer_uncorr_sev);
> + aer_uncorr_sev &= ~PCI_ERR_UNC_INTN;
> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, aer_uncorr_sev);
> +
> + /* Initialize Uncorrectable Error Mask Register */
> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
> + aer_uncorr_mask &= ~PCI_ERR_UNC_INTN;
> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
> +
> + pci_save_state(usp);
> + dev_dbg(&usp->dev, "Uncorrectable Internal Errors downgraded and unmasked\n");
> +}
> +#endif
> +
> +/**
> + * xe_ras_init - Initialize Xe RAS
> + * @xe: xe device instance
> + *
> + * Initialize Xe RAS
> + */
> +void xe_ras_init(struct xe_device *xe)
> +{
> + if (!xe->info.has_sysctrl || IS_SRIOV_VF(xe))
> + return;
> +
> +#ifdef CONFIG_PCIEAER
> + aer_unmask_and_downgrade_internal_error(xe);
> +#endif
> +}
> diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
> index ea90593b62dc..a88ea0a46766 100644
> --- a/drivers/gpu/drm/xe/xe_ras.h
> +++ b/drivers/gpu/drm/xe/xe_ras.h
> @@ -11,5 +11,5 @@ struct xe_sysctrl_event_response;
>
> void xe_ras_counter_threshold_crossed(struct xe_device *xe,
> struct xe_sysctrl_event_response *response);
> -
> +void xe_ras_init(struct xe_device *xe);
> #endif
> --
> 2.47.1
>
next prev parent reply other threads:[~2026-05-14 17:40 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-11 17:29 [PATCH v5 00/14] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-05-11 17:29 ` [PATCH v5 01/14] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-05-11 17:29 ` [PATCH v5 02/14] drm/xe/xe_sysctrl: Make sysctrl flood limit reusable Riana Tauro
2026-05-14 12:51 ` Mallesh, Koujalagi
2026-05-15 7:46 ` Raag Jadav
2026-05-11 17:29 ` [PATCH v5 03/14] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-05-14 13:15 ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 04/14] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-05-11 17:29 ` [PATCH v5 05/14] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-05-11 17:29 ` [PATCH v5 06/14] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-05-14 17:40 ` Raag Jadav [this message]
2026-05-15 8:30 ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 07/14] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 08/14] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-05-15 8:10 ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 09/14] drm/xe/xe_ras: Add support to query device memory errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 10/14] drm/xe/xe_ras: Add support to query page offline queue and list Riana Tauro
2026-05-11 17:29 ` [PATCH v5 11/14] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-05-11 21:56 ` Umesh Nerlige Ramappa
2026-05-11 17:29 ` [PATCH v5 12/14] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 13/14] drm/xe/xe_ras: Add support to offline/decline a page address Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 14/14] drm/xe/xe_ras: Process pages from offlined list and queue Riana Tauro
2026-05-12 1:05 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev5) Patchwork
2026-05-12 1:06 ` ✓ CI.KUnit: success " Patchwork
2026-05-12 2:29 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-12 6:26 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agYJDqkslpMvjprk@black.igk.intel.com \
--to=raag.jadav@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=mallesh.koujalagi@intel.com \
--cc=ravi.kishore.koppuravuri@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=soham.purkait@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.