Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] Introduce cold reset recovery method
@ 2026-05-12 13:26 Mallesh Koujalagi
  2026-05-12 13:26 ` [PATCH v5 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Mallesh Koujalagi @ 2026-05-12 13:26 UTC (permalink / raw)
  To: intel-xe, dri-devel, rodrigo.vivi
  Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard,
	maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar,
	riana.tauro, karthik.poosa, sk.anirban, raag.jadav,
	Mallesh Koujalagi

This series builds on top of Introduce Xe Uncorrectable Error Handling[1]
and adds support for handling errors that require a complete
device power cycle (cold reset) to recover.

Certain error conditions leave the device in a persistent hardware
error state that cannot be cleared through existing recovery mechanisms
such as driver reload or PCIe reset. In these cases, functionality can
only be restored by performing a cold reset.

To support this, the series introduces a new DRM wedging recovery
method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged
with this method, the DRM core notifies userspace via a uevent that a cold
reset is required. This allows userspace to take appropriate action to
power-cycle the device.

Example uevent received:
  SUBSYSTEM=drm
  WEDGED=cold-reset
  DEVPATH=/devices/.../drm/card0

Detailed description in commit message.

[1] https://patchwork.freedesktop.org/series/160482/
This patch series introduces a call to punit_error_handler() from
within handle_soc_internal_errors() when PUNIT errors detected.

v2:
- Add use case: Handling errors from power management unit,
  which requires a complete power cycle to
  recover. (Christian)
- Add several instead of number to avoid update. (Jani)

v3:
- Update any scenario that requires cold-reset. (Riana)
- Update document with generic scenario. (Riana)
- Consistent with terminology. (Raag)
- Remove already covered information.
- Use PUNIT instead of PMU. (Riana)
- Use consistent wordingi.
- Remove log. (Raag)

v4:
- Rename cold reset to power cyclce. (Raag)
- Update doc. (Raag/Riana)
- Change commit message. (Raag)
- Make function static. (Raag)

v5:
- Make it consistent with consumer expectations. (Raag)
- Update commit message.
- Remove unbind.
- Simplify cold-reset script.
- Remove kdoc for static function.
- Remove xe_ prefix for static function.

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>

Mallesh Koujalagi (4):
  drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method
  drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
  drm/xe: Handle PUNIT errors by requesting cold-reset recovery
  drm/xe: Suppress Surprise Link Down on non-hotplug device

Riana Tauro (1):
  Introduce Xe Uncorrectable Error Handling

 Documentation/gpu/drm-uapi.rst                |  64 +-
 drivers/gpu/drm/drm_drv.c                     |   2 +
 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/xe_device.c                |  19 +-
 drivers/gpu/drm/xe/xe_device.h                |  15 +
 drivers/gpu/drm/xe/xe_device_types.h          |   6 +
 drivers/gpu/drm/xe/xe_gt.c                    |  14 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_pci.c                   |  10 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 138 +++++
 drivers/gpu/drm/xe/xe_ras.c                   | 552 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |   5 +-
 drivers/gpu/drm/xe/xe_ras_types.h             | 215 +++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  13 +-
 drivers/gpu/drm/xe/xe_sysctrl_event.c         |   2 +-
 drivers/gpu/drm/xe/xe_sysctrl_event_types.h   |   2 +-
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  11 +
 include/drm/drm_device.h                      |   1 +
 18 files changed, 1058 insertions(+), 21 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-14  9:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 13:26 [PATCH v5 0/5] Introduce cold reset recovery method Mallesh Koujalagi
2026-05-12 13:26 ` [PATCH v5 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-05-12 13:26 ` [PATCH v5 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-05-14  7:59   ` Raag Jadav
2026-05-14  9:12   ` Tauro, Riana
2026-05-12 13:26 ` [PATCH v5 3/5] drm/doc: Document " Mallesh Koujalagi
2026-05-14  8:50   ` Raag Jadav
2026-05-12 13:26 ` [PATCH v5 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-05-14  8:13   ` Raag Jadav
2026-05-12 13:26 ` [PATCH v5 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device Mallesh Koujalagi
2026-05-14  8:35   ` Raag Jadav
2026-05-14  9:36   ` Tauro, Riana
2026-05-12 20:01 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev4) Patchwork
2026-05-12 20:03 ` ✓ CI.KUnit: success " Patchwork
2026-05-12 21:42 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-13 12:34 ` ✗ Xe.CI.FULL: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox