public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v3 00/10] Introduce Xe Uncorrectable Error Handling
@ 2026-04-02  7:01 Riana Tauro
  2026-04-02  7:01 ` [PATCH v3 01/10] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
                   ` (13 more replies)
  0 siblings, 14 replies; 20+ messages in thread
From: Riana Tauro @ 2026-04-02  7:01 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, aravind.iddamsetty,
	badal.nilawar, raag.jadav, ravi.kishore.koppuravuri,
	mallesh.koujalagi, soham.purkait

This series adds the base support for XE Uncorrectable Error Handling
on top of the system controller patch [1].

The first four patches implement PCI error recovery callbacks for AER events.
On fatal errors, the device is wedged in error_detected and a Secondary
Bus reset (SBR) is requested from PCI core by returning
PCI_ERS_RESULT_NEED_RESET.

On non-fatal errors, the mmio_enabled callback is invoked to query the
error and attempt the required recovery.

This series adds support for handling Uncorrectable Core compute
and SoC internal errors.

Core Compute Errors: Uncorrectable Core-Compute errors are classified
into Global and Local errors.
Global error is an error that affects the entire device requiring
a reset to recover. When an AER is reported and error_detected is invoked
return PCI_ERS_RESULT_NEED_RESET.
A Local error is confined to a specific component or context like a
engine. These errors can be contained and recovered by resetting
only the affected part without distrupting the rest of the device.

SoC Internal errors: Most of the uncorrectable SoC internal errors
are recovered using a SBR apart from CSC firmware and Punit errors.
CSC firmware errors requires a firmware flash to be recovered whereas
Punit error requires cold-reset.

Rev2: Add support for SoC internal errors
      fix review comments

Rev3: remove in_recovery flag for disconnect error
      prevent sysctrl flooding
      use minimal logging
      simplify soc structures
      add error_count to GT structures

Riana Tauro (10):
  drm/xe/xe_survivability: Decouple survivability info from boot
    survivability
  drm/xe/xe_pci_error: Implement PCI error recovery callbacks
  drm/xe/xe_pci_error: Group all devres to release them on PCIe slot
    reset
  drm/xe: Skip device access during PCI error recovery
  drm/xe/xe_ras: Initialize Uncorrectable AER Registers
  drm/xe/xe_ras: Add structures and commands for Uncorrectable Core
    Compute Errors
  drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors
  drm/xe/xe_ras: Add structures for SoC Internal errors
  drm/xe/xe_ras: Handle Uncorrectable SoC Internal errors
  drm/xe/xe_pci_error: Process errors in mmio_enabled

 drivers/gpu/drm/xe/Makefile                   |   2 +
 drivers/gpu/drm/xe/xe_device.c                |  10 +
 drivers/gpu/drm/xe/xe_device.h                |  15 +
 drivers/gpu/drm/xe/xe_device_types.h          |   6 +
 drivers/gpu/drm/xe/xe_gt.c                    |  14 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_pci.c                   |   3 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 118 +++++++
 drivers/gpu/drm/xe/xe_ras.c                   | 313 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |  16 +
 drivers/gpu/drm/xe/xe_ras_types.h             | 203 ++++++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  12 +-
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  13 +
 13 files changed, 726 insertions(+), 8 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-04-08 11:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02  7:01 [PATCH v3 00/10] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-04-02  7:01 ` [PATCH v3 01/10] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-04-02  7:01 ` [PATCH v3 02/10] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-04-07  4:50   ` Matthew Brost
2026-04-02  7:01 ` [PATCH v3 03/10] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-04-02  7:01 ` [PATCH v3 04/10] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-04-02  7:01 ` [PATCH v3 05/10] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-04-07  5:50   ` Raag Jadav
2026-04-02  7:01 ` [PATCH v3 06/10] drm/xe/xe_ras: Add structures and commands for Uncorrectable Core Compute Errors Riana Tauro
2026-04-07  5:59   ` Raag Jadav
2026-04-02  7:01 ` [PATCH v3 07/10] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors Riana Tauro
2026-04-08 11:15   ` Raag Jadav
2026-04-02  7:01 ` [PATCH v3 08/10] drm/xe/xe_ras: Add structures for SoC Internal errors Riana Tauro
2026-04-08 11:18   ` Raag Jadav
2026-04-02  7:01 ` [PATCH v3 09/10] drm/xe/xe_ras: Handle Uncorrectable " Riana Tauro
2026-04-02  7:01 ` [PATCH v3 10/10] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-04-02  8:01 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev3) Patchwork
2026-04-02  8:02 ` ✓ CI.KUnit: success " Patchwork
2026-04-02  8:50 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-02 15:08 ` ✓ Xe.CI.FULL: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox