public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] Introduce Xe Uncorrectable Error Handling
@ 2026-03-02 10:21 Riana Tauro
  2026-03-02 10:21 ` [PATCH v2 01/11] drm/xe/xe_sysctrl: Add System controller patch Riana Tauro
                   ` (14 more replies)
  0 siblings, 15 replies; 43+ messages in thread
From: Riana Tauro @ 2026-03-02 10:21 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, aravind.iddamsetty,
	badal.nilawar, raag.jadav, ravi.kishore.koppuravuri,
	mallesh.koujalagi

This series adds the base support for XE Uncorrectable Error Handling
on top of the system controller patch [1].

The first four patches implement PCI error recovery callbacks for AER events.
On fatal errors, the device is wedged in error_detected and a Secondary
Bus reset (SBR) is requested from PCI core by returning
PCI_ERS_RESULT_NEED_RESET.

On non-fatal errors, the mmio_enabled callback is invoked to query the
error and attempt the required recovery.

This series adds support for handling Uncorrectable Core compute
and SoC internal errors.

Core Compute Errors: Uncorrectable Core-Compute errors are classified
into Global and Local errors.
Global error is an error that affects the entire device requiring a reset.
This type of error is not isolated. When an AER is reported and
error_detected is invoked return PCI_ERS_RESULT_NEED_RESET.
A Local error is confined to a specific component or context like a
engine. These errors can be contained and recovered by resetting
only the affected part without distrupting the rest of the device.

SoC Internal errors: Most of the uncorrectable SoC internal errors
are recovered using a SBR apart from CSC firmware and Punit errors.
CSC firmware errors requires a firmware flash to be recovered whereas
Punit error requires cold-reset.

Rev2: Add support for SoC internal errors
      fix review comments

Anoop Vijay (1):
  drm/xe/xe_sysctrl: Add System controller patch

Riana Tauro (10):
  drm/xe/xe_survivability: Decouple survivability info from boot
    survivability
  drm/xe/xe_pci_error: Implement PCI error recovery callbacks
  drm/xe/xe_pci_error: Group all devres to release them on PCIe slot
    reset
  drm/xe: Skip device access during PCI error recovery
  drm/xe/xe_ras: Initialize Uncorrectable AER Registers
  drm/xe/xe_ras: Add structures and commands for Uncorrectable Core
    Compute Errors
  drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors
  drm/xe/xe_ras: Add structures for SoC Internal errors
  drm/xe/xe_ras: Handle Uncorrectable SoC Internal errors
  drm/xe/xe_pci_error: Process errors in mmio_enabled

 drivers/gpu/drm/xe/Makefile                   |   4 +
 drivers/gpu/drm/xe/regs/xe_sysctrl_regs.h     |  36 ++
 drivers/gpu/drm/xe/xe_device.c                |  15 +
 drivers/gpu/drm/xe/xe_device.h                |  15 +
 drivers/gpu/drm/xe/xe_device_types.h          |  12 +
 drivers/gpu/drm/xe/xe_gt.c                    |  11 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_pci.c                   |   5 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 111 +++++
 drivers/gpu/drm/xe/xe_pci_types.h             |   1 +
 drivers/gpu/drm/xe/xe_ras.c                   | 325 +++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |  16 +
 drivers/gpu/drm/xe/xe_ras_types.h             | 282 +++++++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  12 +-
 drivers/gpu/drm/xe/xe_sysctrl.c               |  80 ++++
 drivers/gpu/drm/xe/xe_sysctrl.h               |  13 +
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.c       | 390 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.h       |  35 ++
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  55 +++
 drivers/gpu/drm/xe/xe_sysctrl_types.h         |  33 ++
 20 files changed, 1452 insertions(+), 8 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_sysctrl_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl.c
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_mailbox.c
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_mailbox.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_types.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2026-04-01  6:47 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 10:21 [PATCH v2 00/11] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-03-02 10:21 ` [PATCH v2 01/11] drm/xe/xe_sysctrl: Add System controller patch Riana Tauro
2026-03-02 10:21 ` [PATCH v2 02/11] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-03-02 17:00   ` Raag Jadav
2026-03-03  8:18     ` Mallesh, Koujalagi
2026-03-30 12:56       ` Tauro, Riana
2026-03-30 13:00     ` Tauro, Riana
2026-03-02 10:21 ` [PATCH v2 03/11] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-03-02 17:37   ` Raag Jadav
2026-03-03  5:09     ` Riana Tauro
2026-03-04 10:38   ` Mallesh, Koujalagi
2026-03-31  5:18     ` Tauro, Riana
2026-03-02 10:21 ` [PATCH v2 04/11] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-03-02 10:22 ` [PATCH v2 05/11] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-03-04 10:59   ` Mallesh, Koujalagi
2026-03-02 10:22 ` [PATCH v2 06/11] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-03-02 10:22 ` [PATCH v2 07/11] drm/xe/xe_ras: Add structures and commands for Uncorrectable Core Compute Errors Riana Tauro
2026-03-04 16:32   ` Raag Jadav
2026-03-31 16:14     ` Tauro, Riana
2026-04-01  6:25       ` Raag Jadav
2026-04-01  6:39         ` Tauro, Riana
2026-03-02 10:22 ` [PATCH v2 08/11] drm/xe/xe_ras: Add support for Uncorrectable Core-Compute errors Riana Tauro
2026-03-04 16:52   ` Raag Jadav
2026-03-06 18:37     ` Raag Jadav
2026-03-31 16:24     ` Tauro, Riana
2026-04-01  6:34       ` Raag Jadav
2026-04-01  6:47         ` Tauro, Riana
2026-03-06  3:50   ` [v2,08/11] " Purkait, Soham
2026-03-31 16:16     ` Tauro, Riana
2026-03-02 10:22 ` [PATCH v2 09/11] drm/xe/xe_ras: Add structures for SoC Internal errors Riana Tauro
2026-03-10 13:02   ` Mallesh, Koujalagi
2026-03-11 14:51     ` Riana Tauro
2026-03-02 10:22 ` [PATCH v2 10/11] drm/xe/xe_ras: Handle Uncorrectable " Riana Tauro
2026-03-10 13:29   ` Mallesh, Koujalagi
2026-03-11 14:55     ` Riana Tauro
2026-03-02 10:22 ` [PATCH v2 11/11] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-03-11  7:10   ` Mallesh, Koujalagi
2026-03-11 14:39     ` Riana Tauro
2026-03-12  8:08       ` Mallesh, Koujalagi
2026-03-02 16:10 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev2) Patchwork
2026-03-02 16:11 ` ✓ CI.KUnit: success " Patchwork
2026-03-02 16:48 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-02 18:29 ` ✗ Xe.CI.FULL: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox