Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] Handle Firmware reported Hardware Errors
@ 2025-07-09 11:20 Riana Tauro
  2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
                   ` (13 more replies)
  0 siblings, 14 replies; 48+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add support to handle firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of firmware error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
wedged and the only way to recover from these errors is firmware flash.

Add a vendor-specific recovery method to drm device wedged uevent.
The device will enter runtime survivability mode and send a drm device
wedged uevent when a firmware flash is required to notify userspace.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[754.709341] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=vendor-specific
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0

Bspec: 50875, 53073, 53074, 53075, 53076

IGT: https://patchwork.freedesktop.org/patch/660122/

Rev2: add a fault injection for csc errors
      fix review comments

Rev3: add a vendor-specific recovery method
      add support for runtime survivability mode
      enable runtime survivability mode when csc errors are reported

Rev4: refactor survivability code

Riana Tauro (9):
  drm: Add a vendor-specific recovery method to device wedged uevent
  drm/xe: Set GT as wedged before sending wedged uevent
  drm/xe: Add a helper function to set recovery method
  drm/xe/xe_survivability: Refactor survivability mode
  drm/xe/xe_survivability: Add support for Runtime survivability mode
  drm/xe/doc: Document device wedged and runtime survivability
  drm/xe: Add support to handle hardware errors
  drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  drm/xe/xe_hw_error: Add fault injection to trigger csc error handler

 Documentation/gpu/drm-uapi.rst                |   9 +-
 Documentation/gpu/xe/index.rst                |   1 +
 Documentation/gpu/xe/xe_device.rst            |  10 +
 Documentation/gpu/xe/xe_pcode.rst             |   6 +-
 drivers/gpu/drm/drm_drv.c                     |   2 +
 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/regs/xe_gsc_regs.h         |   2 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h    |  20 ++
 drivers/gpu/drm/xe/regs/xe_irq_regs.h         |   1 +
 drivers/gpu/drm/xe/xe_debugfs.c               |   2 +
 drivers/gpu/drm/xe/xe_device.c                |  53 ++++-
 drivers/gpu/drm/xe/xe_device.h                |   1 +
 drivers/gpu/drm/xe/xe_device_types.h          |   5 +
 drivers/gpu/drm/xe/xe_heci_gsc.c              |   2 +-
 drivers/gpu/drm/xe/xe_hw_error.c              | 185 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h              |  15 ++
 drivers/gpu/drm/xe/xe_irq.c                   |   4 +
 drivers/gpu/drm/xe/xe_pci.c                   |   6 +-
 drivers/gpu/drm/xe/xe_survivability_mode.c    | 164 +++++++++++++---
 drivers/gpu/drm/xe/xe_survivability_mode.h    |   5 +-
 .../gpu/drm/xe/xe_survivability_mode_types.h  |   8 +
 include/drm/drm_device.h                      |   4 +
 22 files changed, 454 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_device.rst
 create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-07-14 12:33 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-09 13:41   ` Simona Vetter
2025-07-09 14:09     ` Christian König
2025-07-09 14:18       ` Raag Jadav
2025-07-09 16:52         ` Rodrigo Vivi
2025-07-10  9:01           ` Simona Vetter
2025-07-10  9:37             ` Christian König
2025-07-10 10:24               ` Raag Jadav
2025-07-10 19:00                 ` Rodrigo Vivi
2025-07-10 21:46                   ` Raag Jadav
2025-07-11  5:17                     ` Riana Tauro
2025-07-11  6:08                       ` Raag Jadav
2025-07-11  8:56                   ` Simona Vetter
2025-07-11  8:59               ` Simona Vetter
2025-07-14  5:27                 ` Riana Tauro
2025-07-14 12:33                   ` Simona Vetter
2025-07-09 14:46     ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-09 17:26   ` Matthew Brost
2025-07-09 11:20 ` [PATCH v4 3/9] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-07-09 11:20 ` [PATCH v4 4/9] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
2025-07-09 11:20 ` [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
2025-07-09 23:44   ` Umesh Nerlige Ramappa
2025-07-10  5:59     ` Riana Tauro
2025-07-10 17:12       ` Umesh Nerlige Ramappa
2025-07-11  5:23         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-11  5:39   ` Raag Jadav
2025-07-11  6:09     ` Riana Tauro
2025-07-12  5:45       ` Raag Jadav
2025-07-14  9:04         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-10 21:09   ` Umesh Nerlige Ramappa
2025-07-11  5:35     ` Riana Tauro
2025-07-11 17:34       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-11  0:36   ` Umesh Nerlige Ramappa
2025-07-11  5:46     ` Riana Tauro
2025-07-11 17:38       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-11 17:41   ` Umesh Nerlige Ramappa
2025-07-14  7:07     ` Riana Tauro
2025-07-09 12:28 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev4) Patchwork
2025-07-09 12:30 ` ✓ CI.KUnit: success " Patchwork
2025-07-09 12:44 ` ✗ CI.checksparse: warning " Patchwork
2025-07-09 13:06 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-09 15:02 ` ✗ Xe.CI.Full: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox