From: "Tauro, Riana" <riana.tauro@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
<rodrigo.vivi@intel.com>, <aravind.iddamsetty@linux.intel.com>,
<badal.nilawar@intel.com>, <ravi.kishore.koppuravuri@intel.com>,
<mallesh.koujalagi@intel.com>, <soham.purkait@intel.com>
Subject: Re: [PATCH v4 05/13] drm/xe/xe_ras: Initialize Uncorrectable AER Registers
Date: Tue, 5 May 2026 10:52:26 +0530 [thread overview]
Message-ID: <5f1bc117-d6a2-4547-8fce-73cd920bf2e2@intel.com> (raw)
In-Reply-To: <ae8WrfTLjnJiSW4C@black.igk.intel.com>
On 4/27/2026 1:26 PM, Raag Jadav wrote:
> On Fri, Apr 17, 2026 at 02:28:17PM +0530, Riana Tauro wrote:
>> Uncorrectable errors from different endpoints in the device are steered to
>> the USP(Upstream Switch Port) which is a PCI Advanced Error Reporting (AER)
>> Compliant device. Downgrade all the errors to non-fatal to prevent PCIe
>> bus driver from triggering a Secondary Bus Reset (SBR). This allows error
>> detection, containment and recovery in the driver.
>>
>> The Uncorrectable Error Severity Register has the 'Uncorrectable
>> Internal Error Severity' set to fatal by default. Set this to
>> non-fatal and unmask the error.
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: clear stale uncorrectable internal status in status register
>> (Aravind)
>>
>> v3: Abbrevate TLA's (Raag)
>> Add a info message if USP does not support AER
>> ---
>> drivers/gpu/drm/xe/Makefile | 1 +
>> drivers/gpu/drm/xe/xe_device.c | 3 ++
>> drivers/gpu/drm/xe/xe_ras.c | 84 ++++++++++++++++++++++++++++++++++
>> drivers/gpu/drm/xe/xe_ras.h | 13 ++++++
>> 4 files changed, 101 insertions(+)
>> create mode 100644 drivers/gpu/drm/xe/xe_ras.c
>> create mode 100644 drivers/gpu/drm/xe/xe_ras.h
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> index 69c233d9a488..e29a4ae99ac6 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -113,6 +113,7 @@ xe-y += xe_bb.o \
>> xe_pxp_debugfs.o \
>> xe_pxp_submit.o \
>> xe_query.o \
>> + xe_ras.o \
>> xe_range_fence.o \
>> xe_reg_sr.o \
>> xe_reg_whitelist.o \
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index cbdf7426e09c..c1c54836ac73 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -62,6 +62,7 @@
>> #include "xe_psmi.h"
>> #include "xe_pxp.h"
>> #include "xe_query.h"
>> +#include "xe_ras.h"
>> #include "xe_shrinker.h"
>> #include "xe_soc_remapper.h"
>> #include "xe_survivability_mode.h"
>> @@ -1074,6 +1075,8 @@ int xe_device_probe(struct xe_device *xe)
>>
>> xe_vsec_init(xe);
>>
>> + xe_ras_init(xe);
>> +
>> err = xe_sriov_init_late(xe);
>> if (err)
>> goto err_unregister_display;
>> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
>> new file mode 100644
>> index 000000000000..4f705deaeefa
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_ras.c
>> @@ -0,0 +1,84 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +
>> +#include "xe_device_types.h"
>> +#include "xe_ras.h"
>> +
>> +#ifdef CONFIG_PCIEAER
>> +static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> + struct pci_dev *vsp, *usp;
>> + u32 aer_uncorr_mask, aer_uncorr_sev, aer_uncorr_status;
>> + u16 aer_cap;
>> +
>> + /*
>> + * Device Hierarchy:
>> + *
>> + * Upstream Switch Port (USP)--> Virtual Switch Port (VSP)--> SGunit (GPU endpoint)
>> + */
>> + vsp = pci_upstream_bridge(pdev);
>> + if (!vsp)
>> + return;
>> +
>> + usp = pci_upstream_bridge(vsp);
>> + if (!usp)
>> + return;
>> +
>> + aer_cap = usp->aer_cap;
>> +
>> + if (!aer_cap) {
>> + dev_info(&usp->dev, "USP doesn't support AER capability\n");
>> + return;
>> + }
>> +
>> + /*
>> + * Clear any stale Uncorrectable Internal Error Status event in Uncorrectable Error
>> + * Status Register.
>> + */
>> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, &aer_uncorr_status);
>> + if (aer_uncorr_status & PCI_ERR_UNC_INTN)
>> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, PCI_ERR_UNC_INTN);
>> +
>> + /*
>> + * All errors are steered to USP which is a PCIe AER Compliant device.
>> + * Downgrade all the errors to non-fatal to prevent PCIe bus driver
>> + * from triggering a Secondary Bus Reset (SBR). This allows error
>> + * detection, containment and recovery in the driver.
>> + *
>> + * The Uncorrectable Error Severity Register has the 'Uncorrectable
>> + * Internal Error Severity' set to fatal by default. Set this to
>> + * non-fatal and unmask the error.
>> + */
>> +
>> + /* Initialize Uncorrectable Error Severity Register */
>> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, &aer_uncorr_sev);
>> + aer_uncorr_sev &= ~PCI_ERR_UNC_INTN;
>> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, aer_uncorr_sev);
>> +
>> + /* Initialize Uncorrectable Error Mask Register */
>> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
>> + aer_uncorr_mask &= ~PCI_ERR_UNC_INTN;
>> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
>> +
>> + pci_save_state(usp);
>> +}
>> +#endif
>> +
>> +/**
>> + * xe_ras_init - Initialize Xe RAS
>> + * @xe: xe device instance
>> + *
>> + * Initialize Xe RAS
>> + */
>> +void xe_ras_init(struct xe_device *xe)
>> +{
>> + if (!xe->info.has_sysctrl)
>> + return;
>> +
>> +#ifdef CONFIG_PCIEAER
>> + aer_unmask_and_downgrade_internal_error(xe);
> If we fail silently we'd most likely be clueless why RAS isn't working.
> So either add error log here or have an explicit success log inside
> downgrade function.
We already have a log if AER is not supported. Handling ret for
pci_write/read
is unnecessary. This fails only when device is disconnected or there is
something wrong with the device.
yeah i can add a debug success log.
Thanks
Riana
>
> Raag
>
>> +#endif
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
>> new file mode 100644
>> index 000000000000..14cb973603e7
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_ras.h
>> @@ -0,0 +1,13 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_RAS_H_
>> +#define _XE_RAS_H_
>> +
>> +struct xe_device;
>> +
>> +void xe_ras_init(struct xe_device *xe);
>> +
>> +#endif
>> --
>> 2.47.1
>>
next prev parent reply other threads:[~2026-05-05 5:22 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-17 8:58 [PATCH v4 00/13] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-04-17 8:58 ` [PATCH v4 01/13] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-04-17 8:58 ` [PATCH v4 02/13] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-04-27 6:35 ` Raag Jadav
2026-05-06 13:59 ` Tauro, Riana
2026-04-17 8:58 ` [PATCH v4 03/13] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-04-17 8:58 ` [PATCH v4 04/13] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-04-30 12:58 ` Anshuman Gupta
2026-05-06 15:41 ` Tauro, Riana
2026-04-17 8:58 ` [PATCH v4 05/13] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-04-27 7:56 ` Raag Jadav
2026-05-05 5:22 ` Tauro, Riana [this message]
2026-04-17 8:58 ` [PATCH v4 06/13] drm/xe/xe_ras: Add basic structures and commands for uncorrectable errors Riana Tauro
2026-04-17 17:38 ` Matt Roper
2026-04-17 21:25 ` Jadav, Raag
2026-04-17 21:32 ` Matt Roper
2026-04-20 5:34 ` Tauro, Riana
2026-04-20 7:49 ` Raag Jadav
2026-04-17 8:58 ` [PATCH v4 07/13] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-04-27 8:24 ` Raag Jadav
2026-05-05 5:28 ` Tauro, Riana
2026-04-17 8:58 ` [PATCH v4 08/13] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-04-17 8:58 ` [PATCH v4 09/13] drm/xe/xe_ras: Handle uncorrectable device memory errors Riana Tauro
2026-04-21 6:08 ` Upadhyay, Tejas
2026-05-05 5:03 ` Tauro, Riana
2026-04-17 8:58 ` [PATCH v4 10/13] drm/xe/xe_ras: Add support to offline/decline a page Riana Tauro
2026-04-21 6:21 ` Upadhyay, Tejas
2026-05-05 5:16 ` Tauro, Riana
2026-04-17 8:58 ` [PATCH v4 11/13] drm/xe/xe_ras: Add support for page offline list and queue commands Riana Tauro
2026-04-21 6:19 ` Upadhyay, Tejas
2026-05-05 5:08 ` Tauro, Riana
2026-04-21 9:10 ` Upadhyay, Tejas
2026-05-05 5:17 ` Tauro, Riana
2026-04-17 8:58 ` [PATCH v4 12/13] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-04-28 11:46 ` Raag Jadav
2026-05-05 13:50 ` Tauro, Riana
2026-04-17 8:58 ` [PATCH v4 13/13] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-04-28 11:39 ` Raag Jadav
2026-05-05 5:31 ` Tauro, Riana
2026-04-30 11:15 ` Gupta, Anshuman
2026-05-02 17:55 ` Raag Jadav
2026-04-20 13:33 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev4) Patchwork
2026-04-20 13:35 ` ✓ CI.KUnit: success " Patchwork
2026-04-20 14:42 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-20 17:14 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5f1bc117-d6a2-4547-8fce-73cd920bf2e2@intel.com \
--to=riana.tauro@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=mallesh.koujalagi@intel.com \
--cc=raag.jadav@intel.com \
--cc=ravi.kishore.koppuravuri@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=soham.purkait@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.