Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Add cold reset recovery method for critical errors
@ 2026-02-11 11:59 Mallesh Koujalagi
  2026-02-11 11:59 ` [PATCH 1/4] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error Mallesh Koujalagi
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Mallesh Koujalagi @ 2026-02-11 11:59 UTC (permalink / raw)
  To: intel-xe, dri-devel, rodrigo.vivi
  Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard,
	anshuman.gupta, badal.nilawar, riana.tauro, karthik.poosa,
	sk.anirban, raag.jadav, Mallesh Koujalagi

This RFC patch series introduces a new DRM wedge recovery method
'DRM_WEDGE_RECOVERY_COLD_RESET' for handling critical errors
that cannot be recovered through existing software-based mechanisms.

Background
----------
Current recovery methods (driver rebind, bus reset, FLR) are effective
for most error scenarios. However, certain critical errors
affect device-level persistent state that survives warm resets and
software recovery attempts. These errors require complete device power
cycling to restore functionality.

Proposed Solution
-----------------
This series adds DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) as a new
recovery method to the DRM wedging framework. When this method is set,
it signals to userspace that only a complete device cold reset (power
cycle) can restore normal operation.

Example uevent received:
  SUBSYSTEM=drm
  WEDGED=cold-reset
  DEVPATH=/devices/.../drm/card0

Testing
-------
The debugfs interface allows testing the cold reset recovery path:

  echo 1 > /sys/kernel/debug/dri/N/trigger_critical_error

This triggers the critical error handler, wedges the device with
cold reset method, and sends the appropriate uevent to userspace.

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Maxime Ripard <mripard@kernel.org>

Mallesh Koujalagi (4):
  drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error
  drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
  drm/xe: Add handler for critical errors which require cold-reset
  drm/xe/debugfs: Add interface to trigger critical error handler

 Documentation/gpu/drm-uapi.rst   | 73 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/drm_drv.c        |  2 +
 drivers/gpu/drm/xe/xe_debugfs.c  | 38 +++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.c | 28 ++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h |  1 +
 include/drm/drm_device.h         |  1 +
 6 files changed, 142 insertions(+), 1 deletion(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-02-13 10:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-11 11:59 [RFC PATCH 0/4] Add cold reset recovery method for critical errors Mallesh Koujalagi
2026-02-11 11:59 ` [PATCH 1/4] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error Mallesh Koujalagi
2026-02-11 11:59 ` [PATCH 2/4] drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-02-11 13:29   ` Jani Nikula
2026-02-12  7:54     ` Mallesh, Koujalagi
2026-02-11 11:59 ` [PATCH 3/4] drm/xe: Add handler for critical errors which require cold-reset Mallesh Koujalagi
2026-02-11 11:59 ` [PATCH 4/4] drm/xe/debugfs: Add interface to trigger critical error handler Mallesh Koujalagi
2026-02-11 12:27 ` [RFC PATCH 0/4] Add cold reset recovery method for critical errors Christian König
2026-02-13 10:39   ` Mallesh, Koujalagi
2026-02-11 15:02 ` ✓ CI.KUnit: success for " Patchwork
2026-02-11 15:23 ` ✗ CI.checksparse: warning " Patchwork
2026-02-11 16:16 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-12 22:30 ` ✗ Xe.CI.FULL: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox