public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v4 00/13] Introduce Xe Uncorrectable Error Handling
@ 2026-04-17  8:58 Riana Tauro
  2026-04-17  8:58 ` [PATCH v4 01/13] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
                   ` (16 more replies)
  0 siblings, 17 replies; 27+ messages in thread
From: Riana Tauro @ 2026-04-17  8:58 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, aravind.iddamsetty,
	badal.nilawar, raag.jadav, ravi.kishore.koppuravuri,
	mallesh.koujalagi, soham.purkait

This series adds support for XE Uncorrectable Error Handling

The first four patches implement PCI error recovery callbacks for AER events.
On fatal errors, the device is wedged in error_detected and a Secondary
Bus reset (SBR) is requested from PCI core by returning
PCI_ERS_RESULT_NEED_RESET.

On non-fatal errors, the mmio_enabled callback is invoked to query the
error and attempt the required recovery.

This series adds support for handling Uncorrectable core compute,
SoC internal and device memory errors.

Core Compute Errors: Uncorrectable Core-Compute errors are classified
into Global and Local errors.
Global error is an error that affects the entire device requiring
a reset to recover. When an AER is reported and error_detected is invoked
return PCI_ERS_RESULT_NEED_RESET.
A Local error is confined to a specific component or context like a
engine. These errors can be contained and recovered by resetting
only the affected part without distrupting the rest of the device.

SoC Internal errors: Most of the uncorrectable SoC internal errors
are recovered using a SBR apart from CSC firmware and Punit errors.
CSC firmware errors requires a firmware flash to be recovered whereas
Punit error requires cold-reset.

Device memory errors: Most of the uncorrectable memory errors are
recovered using a SBR. However, Double-bit ECC error requires page
offlining for containment. Add helpers to instruct firmware to offline
page. Also add helpers to get pages offlined by firmware on module load.
The page offlining patch will be integrated after the patch
https://patchwork.freedesktop.org/series/161473/ is merged.

Rev2: Add support for SoC internal errors
      fix review comments

Rev3: remove in_recovery flag for disconnect error
      prevent sysctrl flooding
      use minimal logging
      simplify soc structures
      add error_count to GT structures

Rev4: add device memory errors
      add helpers for memory errors
      fix cosmetic review comments

Riana Tauro (13):
  drm/xe/xe_survivability: Decouple survivability info from boot
    survivability
  drm/xe/xe_pci_error: Implement PCI error recovery callbacks
  drm/xe/xe_pci_error: Group all devres to release them on PCIe slot
    reset
  drm/xe: Skip device access during PCI error recovery
  drm/xe/xe_ras: Initialize Uncorrectable AER Registers
  drm/xe/xe_ras: Add basic structures and commands for uncorrectable
    errors
  drm/xe/xe_ras: Add support for uncorrectable core-compute errors
  drm/xe/xe_ras: Handle uncorrectable SoC Internal errors
  drm/xe/xe_ras: Handle uncorrectable device memory errors
  drm/xe/xe_ras: Add support to offline/decline a page
  drm/xe/xe_ras: Add support for page offline list and queue commands
  drm/xe/xe_ras: Query errors from system controller on probe
  drm/xe/xe_pci_error: Process errors in mmio_enabled

 drivers/gpu/drm/xe/Makefile                   |   2 +
 drivers/gpu/drm/xe/xe_device.c                |  26 +-
 drivers/gpu/drm/xe/xe_device.h                |  15 +
 drivers/gpu/drm/xe/xe_device_types.h          |   6 +
 drivers/gpu/drm/xe/xe_gt.c                    |  14 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_pci.c                   |   3 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 121 +++++
 drivers/gpu/drm/xe/xe_ras.c                   | 489 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |  18 +
 drivers/gpu/drm/xe/xe_ras_types.h             | 339 ++++++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  13 +-
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  19 +
 13 files changed, 1057 insertions(+), 17 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-04-21  9:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-17  8:58 [PATCH v4 00/13] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-04-17  8:58 ` [PATCH v4 01/13] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-04-17  8:58 ` [PATCH v4 02/13] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-04-17  8:58 ` [PATCH v4 03/13] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-04-17  8:58 ` [PATCH v4 04/13] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-04-17  8:58 ` [PATCH v4 05/13] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-04-17  8:58 ` [PATCH v4 06/13] drm/xe/xe_ras: Add basic structures and commands for uncorrectable errors Riana Tauro
2026-04-17 17:38   ` Matt Roper
2026-04-17 21:25     ` Jadav, Raag
2026-04-17 21:32       ` Matt Roper
2026-04-20  5:34         ` Tauro, Riana
2026-04-20  7:49           ` Raag Jadav
2026-04-17  8:58 ` [PATCH v4 07/13] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-04-17  8:58 ` [PATCH v4 08/13] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-04-17  8:58 ` [PATCH v4 09/13] drm/xe/xe_ras: Handle uncorrectable device memory errors Riana Tauro
2026-04-21  6:08   ` Upadhyay, Tejas
2026-04-17  8:58 ` [PATCH v4 10/13] drm/xe/xe_ras: Add support to offline/decline a page Riana Tauro
2026-04-21  6:21   ` Upadhyay, Tejas
2026-04-17  8:58 ` [PATCH v4 11/13] drm/xe/xe_ras: Add support for page offline list and queue commands Riana Tauro
2026-04-21  6:19   ` Upadhyay, Tejas
2026-04-21  9:10   ` Upadhyay, Tejas
2026-04-17  8:58 ` [PATCH v4 12/13] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-04-17  8:58 ` [PATCH v4 13/13] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-04-20 13:33 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev4) Patchwork
2026-04-20 13:35 ` ✓ CI.KUnit: success " Patchwork
2026-04-20 14:42 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-20 17:14 ` ✗ Xe.CI.FULL: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox