All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/6] Introduce cold reset recovery method
@ 2026-06-12  8:07 Mallesh Koujalagi
  2026-06-12  8:07 ` [PATCH v8 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Mallesh Koujalagi @ 2026-06-12  8:07 UTC (permalink / raw)
  To: intel-xe, dri-devel, rodrigo.vivi
  Cc: andrealmeid, christian.koenig, airlied, simona.vetter, mripard,
	maarten.lankhorst, tzimmermann, anshuman.gupta, badal.nilawar,
	riana.tauro, karthik.poosa, sk.anirban, raag.jadav,
	Mallesh Koujalagi

This series builds on top of Introduce Xe Uncorrectable Error Handling[1]
and adds support for handling errors that require a complete
device power cycle (cold reset) to recover.

Certain error conditions leave the device in a persistent hardware
error state that cannot be cleared through existing recovery mechanisms
such as driver reload or PCIe reset. In these cases, functionality can
only be restored by performing a cold reset.

To support this, the series introduces a new DRM wedging recovery
method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged
with this method, the DRM core notifies userspace via a uevent that a cold
reset is required. This allows userspace to take appropriate action to
power-cycle the device.

Example uevent received:
  SUBSYSTEM=drm
  WEDGED=cold-reset
  DEVPATH=/devices/.../drm/card0

Detailed description in commit message.

[1] https://patchwork.freedesktop.org/series/160482/
This patch series introduces a call to punit_error_handler() from
within handle_soc_internal_errors() when PUNIT errors detected.

v2:
- Add use case: Handling errors from power management unit,
  which requires a complete power cycle to
  recover. (Christian)
- Add several instead of number to avoid update. (Jani)

v3:
- Update any scenario that requires cold-reset. (Riana)
- Update document with generic scenario. (Riana)
- Consistent with terminology. (Raag)
- Remove already covered information.
- Use PUNIT instead of PMU. (Riana)
- Use consistent wordingi.
- Remove log. (Raag)

v4:
- Rename cold reset to power cyclce. (Raag)
- Update doc. (Raag/Riana)
- Change commit message. (Raag)
- Make function static. (Raag)

v5:
- Make it consistent with consumer expectations. (Raag)
- Update commit message.
- Remove unbind.
- Simplify cold-reset script.
- Remove kdoc for static function.
- Remove xe_ prefix for static function.

v6:
- Drop "last resort" wording. (Riana)
- Look up the hotplug slot in DEVPATH instead of scanning
  every PCI slot on the system. (Raag)
- Drop arbitrary sleep values from the example script.
- Expand commit message to explain why SUR_DN is masked. (Raag/Riana)
- Check Slot Implemented bit before reading Slot Capabilities, per
  PCIe spec. (Riana)
- Add debug log.

v7:
- Update recovery script. (Raag)
- Handle surprise link down event properly. (Aravind/Riana)
- Update commit message. (Riana)
- Correct log message.

v8:
- Add rescan instead of reset. (Raag)
- Use find_usp_dev() in punit_error_handler() function.

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>

Mallesh Koujalagi (5):
  drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method
  drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
  drm/xe: Handle PUNIT errors by requesting cold-reset recovery
  drm/xe: Suppress Surprise Link Down on device
  drm/xe/ras: Add debugfs entry to inject punit error

Riana Tauro (1):
  Introduce Xe Uncorrectable Error Handling

 Documentation/gpu/drm-uapi.rst                |  85 ++-
 drivers/gpu/drm/drm_drv.c                     |   2 +
 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/xe_debugfs.c               |   3 +
 drivers/gpu/drm/xe/xe_device.c                |  24 +-
 drivers/gpu/drm/xe/xe_device.h                |  27 +-
 drivers/gpu/drm/xe/xe_device_types.h          |  12 +-
 drivers/gpu/drm/xe/xe_gt.c                    |  14 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |   9 +-
 drivers/gpu/drm/xe/xe_pci.c                   |   9 +
 drivers/gpu/drm/xe/xe_pci_error.c             | 135 +++++
 drivers/gpu/drm/xe/xe_pci_error.h             |  13 +
 drivers/gpu/drm/xe/xe_ras.c                   | 570 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |  11 +
 drivers/gpu/drm/xe/xe_ras_types.h             | 227 +++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  13 +-
 drivers/gpu/drm/xe/xe_sysctrl_event.c         |   2 +-
 drivers/gpu/drm/xe/xe_sysctrl_event_types.h   |   3 -
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.c       |  28 +
 drivers/gpu/drm/xe/xe_sysctrl_mailbox.h       |   4 +-
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |  11 +
 include/drm/drm_device.h                      |   1 +
 22 files changed, 1175 insertions(+), 29 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_pci_error.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-06-18 13:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  8:07 [PATCH v8 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-12  8:07 ` [PATCH v8 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-12  8:24   ` sashiko-bot
2026-06-12  8:07 ` [PATCH v8 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-12  8:07 ` [PATCH v8 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-12  8:07 ` [PATCH v8 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-12  8:27   ` sashiko-bot
2026-06-12  8:07 ` [PATCH v8 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-12  8:21   ` sashiko-bot
2026-06-15  8:06   ` Tauro, Riana
2026-06-18 13:24   ` Raag Jadav
2026-06-12  8:07 ` [PATCH v8 6/6] drm/xe/ras: Add debugfs entry to inject punit error Mallesh Koujalagi
2026-06-12  8:23   ` sashiko-bot
2026-06-12  8:16 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev8) Patchwork
2026-06-12  8:18 ` ✓ CI.KUnit: success " Patchwork
2026-06-12  9:03 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-13  1:18 ` ✓ Xe.CI.FULL: " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.