All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/15] Introduce Xe Uncorrectable Error Handling
@ 2026-06-08  8:47 Riana Tauro
  2026-06-08  8:47 ` [PATCH v8 01/15] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
                   ` (20 more replies)
  0 siblings, 21 replies; 26+ messages in thread
From: Riana Tauro @ 2026-06-08  8:47 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, aravind.iddamsetty,
	badal.nilawar, raag.jadav, ravi.kishore.koppuravuri,
	mallesh.koujalagi, soham.purkait

This series adds support for XE Uncorrectable Error Handling

The first four patches implement PCI error recovery callbacks for AER
events.
On fatal errors, the device is wedged in error_detected and a Secondary
Bus reset (SBR) is requested from PCI core by returning
PCI_ERS_RESULT_NEED_RESET.

On non-fatal errors, the mmio_enabled callback is invoked to query the
error and attempt the required recovery.

This series adds support for handling Uncorrectable core compute,
SoC internal and device memory errors.

Core Compute Errors: Uncorrectable Core-Compute errors are classified
into Global and Local errors.
Global error is an error that affects the entire device requiring
a reset to recover. When an AER is reported and error_detected is
invoked
return PCI_ERS_RESULT_NEED_RESET.
A Local error is confined to a specific component or context like a
engine. These errors can be contained and recovered by resetting
only the affected part without distrupting the rest of the device.

SoC Internal errors: Most of the uncorrectable SoC internal errors
are recovered using a SBR apart from CSC firmware and Punit errors.
CSC firmware errors requires a firmware flash to be recovered whereas
Punit error requires cold-reset.

Device memory errors: The recovery action for memory errors depends on
the error category. Double bit ECC (Error Correcting Code) errors will be
handled using Page offlining in a later patch. Poison and data parity
errors are only logged. Rest of the errors require SBR
(Secondary Bus Reset) to recover. Add helper to send offline/decline
command to firmware. 

Pages are also saved by firmware in flash and need to be offlined
by software on module load. Add helpers to retrieve the list and queue
from firmware.

Rev2: Add support for SoC internal errors
      fix review comments

Rev3: remove in_recovery flag for disconnect error
      prevent sysctrl flooding
      use minimal logging
      simplify soc structures
      add error_count to GT structures

Rev4: add device memory errors
      add helpers for memory errors
      fix cosmetic review comments

Rev5: simplify structures in all patches
      disconnect on wedged or survivability mode
      rename in_recovery to in_reset
      add minimal integration patch for device memory errors
      rename system controller flooding macro
      fix comments

Rev6: rename function to prepare_reset.
      call pci_prepare_reset() before requesting SBR
      fix cosmetic review comments
      add patch to skip run_ticks while reading fdinfo
      when SBR is in flight

Rev7: rename sysctrl build command function
      rename pci prepare reset function
      use wedged state management
      split function for usp dev

Rev8: fix kunit error
      add error handling for poison and data parity memory errors
      add request structure for page offlining list

Raag Jadav (1):
  drm/xe: Improve wedged state management

Riana Tauro (14):
  drm/xe/xe_survivability: Decouple survivability info from boot
    survivability
  drm/xe/xe_sysctrl: Make sysctrl flood limit reusable
  drm/xe/xe_pci_error: Implement PCI error recovery callbacks
  drm/xe/xe_pci_error: Group all devres to release them on PCIe slot
    reset
  drm/xe: Skip device access during PCI error recovery
  drm/xe/xe_ras: Initialize Uncorrectable AER Registers
  drm/xe/xe_ras: Add support for uncorrectable core-compute errors
  drm/xe/xe_ras: Handle uncorrectable SoC Internal errors
  drm/xe/xe_ras: Query errors from system controller on probe
  drm/xe/xe_pci_error: Process errors in mmio_enabled
  drm/xe/xe_ras: Add support to query device memory errors
  drm/xe/xe_ras: Add support to query page offline queue and list
  drm/xe/xe_ras: Add support to offline/decline a page address
  drm/xe/xe_ras: Process pages from offlined list and queue

 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/xe_device.c                |  24 +-
 drivers/gpu/drm/xe/xe_device.h                |  27 +-
 drivers/gpu/drm/xe/xe_device_types.h          |  12 +-
 drivers/gpu/drm/xe/xe_gt.c                    |  14 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_pci.c                   |   9 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 135 +++++
 drivers/gpu/drm/xe/xe_pci_error.h             |  13 +
 drivers/gpu/drm/xe/xe_ras.c                   | 494 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |   4 +
 drivers/gpu/drm/xe/xe_ras_types.h             | 227 ++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  13 +-
 drivers/gpu/drm/xe/xe_sysctrl_event.c         |   2 +-
 drivers/gpu/drm/xe/xe_sysctrl_event_types.h   |   3 -
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.c       |  28 +
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.h       |   4 +-
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  11 +
 18 files changed, 1002 insertions(+), 28 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-06-19 11:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08  8:47 [PATCH v8 00/15] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-06-08  8:47 ` [PATCH v8 01/15] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-06-08  8:47 ` [PATCH v8 02/15] drm/xe/xe_sysctrl: Make sysctrl flood limit reusable Riana Tauro
2026-06-08  8:47 ` [PATCH v8 03/15] drm/xe: Improve wedged state management Riana Tauro
2026-06-08  8:47 ` [PATCH v8 04/15] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-06-19 10:47   ` Raag Jadav
2026-06-19 11:22     ` Tauro, Riana
2026-06-08  8:47 ` [PATCH v8 05/15] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-06-08  8:47 ` [PATCH v8 06/15] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-06-08  8:47 ` [PATCH v8 07/15] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-06-08  8:47 ` [PATCH v8 08/15] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-06-12  1:43   ` Mallesh, Koujalagi
2026-06-08  8:47 ` [PATCH v8 09/15] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-06-08  8:47 ` [PATCH v8 10/15] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-06-08  8:47 ` [PATCH v8 11/15] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-06-08 10:18   ` Mallesh, Koujalagi
2026-06-08  8:47 ` [PATCH v8 12/15] drm/xe/xe_ras: Add support to query device memory errors Riana Tauro
2026-06-08  8:47 ` [PATCH v8 13/15] drm/xe/xe_ras: Add support to query page offline queue and list Riana Tauro
2026-06-08  8:47 ` [RFC PATCH v8 14/15] drm/xe/xe_ras: Add support to offline and decline a page address Riana Tauro
2026-06-08  8:47 ` [RFC PATCH v8 15/15] drm/xe/xe_ras: Process pages from offlined list and queue Riana Tauro
2026-06-08 12:50 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev8) Patchwork
2026-06-08 12:52 ` ✓ CI.KUnit: success " Patchwork
2026-06-09  5:28 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev9) Patchwork
2026-06-09  5:29 ` ✓ CI.KUnit: success " Patchwork
2026-06-09  6:07 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-09 14:53 ` ✗ Xe.CI.FULL: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.