Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-xe] [PATCH v5 0/4] Supporting RAS on XE
@ 2023-08-23  8:58 Himal Prasad Ghimiray
  2023-08-23  8:58 ` [Intel-xe] [PATCH v5 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Himal Prasad Ghimiray @ 2023-08-23  8:58 UTC (permalink / raw)
  To: intel-xe

Our platforms support Reliability, Availability and Serviceability(RAS).
In case of hardware errors, our hardwares provides the causes via
sending interrupt or pcie errors. The fatal errors are propogated 
as pci errors and non fatal errors as MSI. This series focuses on 
loging and updating counters for these errors, which will be  helpful to avoid, 
detect and repair hardware faults.

This [1] series proposes mechanism to expose this counters to userspace.
[1]: https://patchwork.freedesktop.org/series/118435/

The error counters  exposed by KMD will be used by L0/sysman 
They will be categorized to specific category of error in sysman:
https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras

Current version covers the error propagation at tile level and possible 
causes from GT subsystem. Future patches will cover the possible causes
from other subsystems (for example SOC, GSC/CSC).

We have very limited capabilities for error injection to validate the
code flow.
Output of L3 fabric fatal injection from PVC is:
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: TILE0 detected GT FATAL error bit[0] is set
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected L3 FABRIC FATAL error. ERR_VECT_GT_FATAL[7]:0x00000087

v2
- Use different headers for error registers. (Nikula)
- Correctable errors shouldn't be considered as dmesg errors (Matt)
- Limit series to HW errors.(Aravind)

v3
- Rebase

v4
- Use xe_regs.h only for registers, move enums out of it.
- Make sure global data/structures are immutable.
- Avoid adding custom error logging macro's.
- Redesign the registers error name and counter index
structures for maintainability. (Nikula)

v5
- move struct hw_err_regs out of CONFIG_DRM_XE_DISPLAY.

Himal Prasad Ghimiray (4):
  drm/xe: Handle errors from various components.
  drm/xe: Log and count the GT hardware errors.
  drm/xe: Support GT hardware error reporting for PVC.
  drm/xe: Process fatal hardware errors.

 drivers/gpu/drm/xe/Makefile                  |   1 +
 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h   |  29 ++
 drivers/gpu/drm/xe/regs/xe_regs.h            |   5 +-
 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h |  15 +
 drivers/gpu/drm/xe/xe_device_types.h         |  12 +
 drivers/gpu/drm/xe/xe_gt_types.h             |   7 +
 drivers/gpu/drm/xe/xe_hw_error.c             | 454 +++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h             | 109 +++++
 drivers/gpu/drm/xe/xe_irq.c                  |   3 +
 9 files changed, 634 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-10-05  3:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-23  8:58 [Intel-xe] [PATCH v5 0/4] Supporting RAS on XE Himal Prasad Ghimiray
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
2023-09-26  4:20   ` Aravind Iddamsetty
2023-09-26  4:57     ` Ghimiray, Himal Prasad
2023-09-26 10:09       ` Aravind Iddamsetty
2023-10-04 12:07   ` Aravind Iddamsetty
2023-10-05  4:01     ` Aravind Iddamsetty
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 2/4] drm/xe: Log and count the GT hardware errors Himal Prasad Ghimiray
2023-09-26  4:20   ` Aravind Iddamsetty
2023-09-26  5:08     ` Ghimiray, Himal Prasad
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 3/4] drm/xe: Support GT hardware error reporting for PVC Himal Prasad Ghimiray
2023-09-26  4:21   ` Aravind Iddamsetty
2023-09-26  5:11     ` Ghimiray, Himal Prasad
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 4/4] drm/xe: Process fatal hardware errors Himal Prasad Ghimiray
2023-09-26  4:21   ` Aravind Iddamsetty
2023-09-26 10:24     ` Ghimiray, Himal Prasad
2023-08-23  9:00 ` [Intel-xe] ✓ CI.Patch_applied: success for Supporting RAS on XE (rev4) Patchwork
2023-08-23  9:00 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork
2023-08-23  9:01 ` [Intel-xe] ✓ CI.KUnit: success " Patchwork
2023-08-23  9:05 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-08-23  9:05 ` [Intel-xe] ✗ CI.Hooks: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox