Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/14] Introduce Xe Uncorrectable Error Handling
@ 2026-05-11 17:29 Riana Tauro
  2026-05-11 17:29 ` [PATCH v5 01/14] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
                   ` (17 more replies)
  0 siblings, 18 replies; 23+ messages in thread
From: Riana Tauro @ 2026-05-11 17:29 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, aravind.iddamsetty,
	badal.nilawar, raag.jadav, ravi.kishore.koppuravuri,
	mallesh.koujalagi, soham.purkait

This series adds support for XE Uncorrectable Error Handling

The first four patches implement PCI error recovery callbacks for AER events.
On fatal errors, the device is wedged in error_detected and a Secondary
Bus reset (SBR) is requested from PCI core by returning
PCI_ERS_RESULT_NEED_RESET.

On non-fatal errors, the mmio_enabled callback is invoked to query the
error and attempt the required recovery.

This series adds support for handling Uncorrectable core compute,
SoC internal and device memory errors.

Core Compute Errors: Uncorrectable Core-Compute errors are classified
into Global and Local errors.
Global error is an error that affects the entire device requiring
a reset to recover. When an AER is reported and error_detected is invoked
return PCI_ERS_RESULT_NEED_RESET.
A Local error is confined to a specific component or context like a
engine. These errors can be contained and recovered by resetting
only the affected part without distrupting the rest of the device.

SoC Internal errors: Most of the uncorrectable SoC internal errors
are recovered using a SBR apart from CSC firmware and Punit errors.
CSC firmware errors requires a firmware flash to be recovered whereas
Punit error requires cold-reset.

Device memory errors: Most of the uncorrectable memory errors are
recovered using a SBR. However, double-bit ecc errors require page
offlining in both s/w (done in a later patch) and firmware.
Add helper to send offline/decline command to firmware. 
These pages are also saved by firmware in flash and need to be offlined
by software on module load. Add helpers to retrieve the list and queue
from firmware.

Rev2: Add support for SoC internal errors
      fix review comments

Rev3: remove in_recovery flag for disconnect error
      prevent sysctrl flooding
      use minimal logging
      simplify soc structures
      add error_count to GT structures

Rev4: add device memory errors
      add helpers for memory errors
      fix cosmetic review comments

Rev5: simplify structures in all patches
      disconnect on wedged or survivability mode
      rename in_recovery to in_reset
      add minimal integration patch for device memory errors
      rename system controller flooding macro
      fix comments

Riana Tauro (14):
  drm/xe/xe_survivability: Decouple survivability info from boot
    survivability
  drm/xe/xe_sysctrl: Make sysctrl flood limit reusable
  drm/xe/xe_pci_error: Implement PCI error recovery callbacks
  drm/xe/xe_pci_error: Group all devres to release them on PCIe slot
    reset
  drm/xe: Skip device access during PCI error recovery
  drm/xe/xe_ras: Initialize Uncorrectable AER Registers
  drm/xe/xe_ras: Add support for uncorrectable core-compute errors
  drm/xe/xe_ras: Handle uncorrectable SoC Internal errors
  drm/xe/xe_ras: Add initial device memory error processing
  drm/xe/xe_ras: Add support to query page offline queue and list
  drm/xe/xe_ras: Query errors from system controller on probe
  drm/xe/xe_pci_error: Process errors in mmio_enabled
  drm/xe/xe_ras: Add support to offline/decline a page address
  drm/xe/xe_ras: Process pages from offlined list and queue

 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/xe_device.c                |  19 +-
 drivers/gpu/drm/xe/xe_device.h                |  15 +
 drivers/gpu/drm/xe/xe_device_types.h          |   6 +
 drivers/gpu/drm/xe/xe_gt.c                    |  14 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_pci.c                   |  10 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 138 +++++
 drivers/gpu/drm/xe/xe_ras.c                   | 488 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |   5 +-
 drivers/gpu/drm/xe/xe_ras_types.h             | 215 ++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  13 +-
 drivers/gpu/drm/xe/xe_sysctrl_event.c         |   2 +-
 drivers/gpu/drm/xe/xe_sysctrl_event_types.h   |   2 +-
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  11 +
 15 files changed, 928 insertions(+), 20 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c

-- 
2.47.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-05-14 17:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 17:29 [PATCH v5 00/14] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-05-11 17:29 ` [PATCH v5 01/14] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-05-11 17:29 ` [PATCH v5 02/14] drm/xe/xe_sysctrl: Make sysctrl flood limit reusable Riana Tauro
2026-05-14 12:51   ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 03/14] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-05-14 13:15   ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 04/14] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-05-11 17:29 ` [PATCH v5 05/14] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-05-11 17:29 ` [PATCH v5 06/14] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-05-14 17:40   ` Raag Jadav
2026-05-11 17:29 ` [PATCH v5 07/14] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 08/14] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 09/14] drm/xe/xe_ras: Add support to query device memory errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 10/14] drm/xe/xe_ras: Add support to query page offline queue and list Riana Tauro
2026-05-11 17:29 ` [PATCH v5 11/14] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-05-11 21:56   ` Umesh Nerlige Ramappa
2026-05-11 17:29 ` [PATCH v5 12/14] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 13/14] drm/xe/xe_ras: Add support to offline/decline a page address Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 14/14] drm/xe/xe_ras: Process pages from offlined list and queue Riana Tauro
2026-05-12  1:05 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev5) Patchwork
2026-05-12  1:06 ` ✓ CI.KUnit: success " Patchwork
2026-05-12  2:29 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-12  6:26 ` ✗ Xe.CI.FULL: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox