public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Introduce cold reset recovery method
@ 2026-03-18  6:40 Mallesh Koujalagi
  2026-03-18  6:40 ` [PATCH v2 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Mallesh Koujalagi @ 2026-03-18  6:40 UTC (permalink / raw)
  To: intel-xe, dri-devel, rodrigo.vivi
  Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard,
	anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa,
	sk.anirban, raag.jadav, Mallesh Koujalagi

This series builds on top of Introduce Xe Uncorrectable Error Handling[1]
and adds support for handling power management unit (PMU) errors
that require a complete device power cycle to recover.

Certain PMU error conditions leave the device in a persistent hardware
error state that cannot be cleared through existing recovery mechanisms
such as driver reload or PCIe reset. In these cases, functionality can
only be restored by performing a cold reset (complete power cycle).

To support this, the series introduces a new DRM wedging recovery
method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged
with this method, the DRM core notifies userspace via a uevent that a cold
reset is required. This allows userspace to take appropriate action to
power-cycle the device.

Example uevent received:
  SUBSYSTEM=drm
  WEDGED=cold-reset
  DEVPATH=/devices/.../drm/card0

The cold reset recovery path can be exercised through the debugfs
interface:

  echo 1 > /sys/kernel/debug/dri/N/trigger_punit_error

This triggers the PMU error handler, wedges the device using the cold
reset recovery method, and emits the corresponding uevent to userspace.

Detailed description in commit message.

[1] https://patchwork.freedesktop.org/series/160482/
This patch series introduces a call to xe_punit_error_handler() from
within handle_soc_internal_errors() when PMU errors detected.

v2:
- Add use case: Handling errors from power management unit,
  which requires a complete power cycle (cold reset)
  to recover. (Christian)
- Add several instead of number to avoid update. (Jani)

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Maxime Ripard <mripard@kernel.org>

Mallesh Koujalagi (4):
  drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for power management unit error
  drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
  drm/xe: Add handler for power management unit errors which require
    cold-reset
  drm/xe/debugfs: Add interface to trigger power management unit error
    handler

Riana Tauro (1):
  Introduce Xe Uncorrectable Error Handling

 Documentation/gpu/drm-uapi.rst                |  73 +++-
 drivers/gpu/drm/drm_drv.c                     |   2 +
 drivers/gpu/drm/xe/Makefile                   |   4 +
 drivers/gpu/drm/xe/regs/xe_sysctrl_regs.h     |  36 ++
 drivers/gpu/drm/xe/xe_debugfs.c               |  38 ++
 drivers/gpu/drm/xe/xe_device.c                |  15 +
 drivers/gpu/drm/xe/xe_device.h                |  15 +
 drivers/gpu/drm/xe/xe_device_types.h          |  12 +
 drivers/gpu/drm/xe/xe_gt.c                    |  11 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_hw_error.c              |  27 ++
 drivers/gpu/drm/xe/xe_hw_error.h              |   1 +
 drivers/gpu/drm/xe/xe_pci.c                   |   5 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 111 +++++
 drivers/gpu/drm/xe/xe_pci_types.h             |   1 +
 drivers/gpu/drm/xe/xe_ras.c                   | 332 +++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |  16 +
 drivers/gpu/drm/xe/xe_ras_types.h             | 282 +++++++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  12 +-
 drivers/gpu/drm/xe/xe_sysctrl.c               |  80 ++++
 drivers/gpu/drm/xe/xe_sysctrl.h               |  13 +
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.c       | 390 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.h       |  35 ++
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  55 +++
 drivers/gpu/drm/xe/xe_sysctrl_types.h         |  33 ++
 include/drm/drm_device.h                      |   1 +
 26 files changed, 1600 insertions(+), 9 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_sysctrl_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl.c
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_mailbox.c
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_mailbox.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
 create mode 100644 drivers/gpu/drm/xe/xe_sysctrl_types.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-04-06 12:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-18  6:40 [PATCH v2 0/5] Introduce cold reset recovery method Mallesh Koujalagi
2026-03-18  6:40 ` [PATCH v2 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-03-18 19:35   ` kernel test robot
2026-03-19 14:42   ` kernel test robot
2026-03-19 20:02   ` kernel test robot
2026-03-18  6:40 ` [PATCH v2 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for power management unit error Mallesh Koujalagi
2026-03-30  5:26   ` Tauro, Riana
2026-03-18  6:40 ` [PATCH v2 3/5] drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-03-30  5:00   ` Tauro, Riana
2026-03-30 14:02     ` Mallesh, Koujalagi
2026-04-02  8:16   ` Raag Jadav
2026-04-06 12:26     ` Mallesh, Koujalagi
2026-03-18  6:40 ` [PATCH v2 4/5] drm/xe: Add handler for power management unit errors which require cold-reset Mallesh Koujalagi
2026-03-30  4:54   ` Tauro, Riana
2026-03-30 13:50     ` Mallesh, Koujalagi
2026-04-02  8:19   ` Raag Jadav
2026-03-18  6:40 ` [PATCH v2 5/5] drm/xe/debugfs: Add interface to trigger power management unit error handler Mallesh Koujalagi
2026-03-30  4:55   ` Tauro, Riana
2026-03-30 13:40     ` Mallesh, Koujalagi
2026-04-02  8:31       ` Raag Jadav
2026-04-06 12:49         ` Mallesh, Koujalagi
2026-03-18  6:49 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method Patchwork
2026-03-18  6:50 ` ✓ CI.KUnit: success " Patchwork
2026-03-18  7:33 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-19 20:20 ` ✓ Xe.CI.FULL: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox