From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
To: intel-xe@lists.freedesktop.org
Cc: riana.tauro@intel.com, rodrigo.vivi@intel.com,
himal.prasad.ghimiray@intel.com, anshuman.gupta@intel.com
Subject: [PATCH 00/10] Supporting RAS on XE
Date: Wed, 30 Jul 2025 11:18:04 +0530 [thread overview]
Message-ID: <20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com> (raw)
Rebasing series as a prep for netlink infra series which would be floated
as a follow up.
Our platforms supports Reliability, Availability and Serviceability(RAS).
In case of hardware errors, our hardwares provides the causes via
sending interrupt or pcie errors. The fatal errors are propogated
as pci errors and non fatal errors as MSI. This series focuses on
loging and updating counters for these errors, which will be helpful to
detect and repair hardware faults.
This [1] series proposes mechanism to expose this counters to userspace.
[1]: https://patchwork.freedesktop.org/series/118435/
The error counters exposed by KMD will be used by L0/sysman
They will be categorized to specific category of error in sysman:
https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras
We have very limited capabilities for error injection to validate the
code flow.
Output of L3 fabric fatal injection from PVC is:
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: TILE0 detected GT FATAL
error bit[0] is set
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected L3 FABRIC
FATAL error. ERR_VECT_GT_FATAL[7]:0x00000087
Cc: Riana Tauro <riana.tauro@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Anshuman Gupta <anshuman.gupta@intel.com>
Himal Prasad Ghimiray (10):
drm/xe: Handle errors from various components.
drm/xe: Add new helpers to log hardware errrors.
drm/xe: Log and count the GT hardware errors.
drm/xe: Support GT hardware error reporting for PVC.
drm/xe: Support GSC hardware error reporting for PVC.
drm/xe: Support SOC FATAL error handling for PVC.
drm/xe: Support SOC NONFATAL error handling for PVC.
drm/xe: Handle MDFI error severity.
drm/xe: Clear SOC CORRECTABLE error registers.
drm/xe: Clear all SoC errors post warm reset.
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/regs/xe_gt_error_regs.h | 29 +
drivers/gpu/drm/xe/regs/xe_irq_regs.h | 1 +
drivers/gpu/drm/xe/regs/xe_regs.h | 3 +
drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 66 ++
drivers/gpu/drm/xe/xe_device.c | 16 +
drivers/gpu/drm/xe/xe_device_types.h | 17 +
drivers/gpu/drm/xe/xe_gt.c | 1 +
drivers/gpu/drm/xe/xe_gt_printk.h | 7 +
drivers/gpu/drm/xe/xe_gt_types.h | 6 +
drivers/gpu/drm/xe/xe_hw_error.c | 896 +++++++++++++++++++
drivers/gpu/drm/xe/xe_hw_error.h | 194 ++++
drivers/gpu/drm/xe/xe_irq.c | 1 +
drivers/gpu/drm/xe/xe_tile.c | 2 +
14 files changed, 1240 insertions(+)
create mode 100644 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
--
2.25.1
next reply other threads:[~2025-07-30 5:49 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-30 5:48 Aravind Iddamsetty [this message]
2025-07-30 5:48 ` [PATCH 01/10] drm/xe: Handle errors from various components Aravind Iddamsetty
2025-07-30 9:08 ` Michal Wajdeczko
2025-07-30 19:59 ` Rodrigo Vivi
2025-07-30 5:48 ` [PATCH 02/10] drm/xe: Add new helpers to log hardware errrors Aravind Iddamsetty
2025-07-30 8:55 ` Michal Wajdeczko
2025-07-30 5:48 ` [PATCH 03/10] drm/xe: Log and count the GT hardware errors Aravind Iddamsetty
2025-07-30 5:48 ` [PATCH 04/10] drm/xe: Support GT hardware error reporting for PVC Aravind Iddamsetty
2025-07-30 5:48 ` [PATCH 05/10] drm/xe: Support GSC " Aravind Iddamsetty
2025-07-30 5:48 ` [PATCH 06/10] drm/xe: Support SOC FATAL error handling " Aravind Iddamsetty
2025-07-30 5:48 ` [PATCH 07/10] drm/xe: Support SOC NONFATAL " Aravind Iddamsetty
2025-07-30 5:48 ` [PATCH 08/10] drm/xe: Handle MDFI error severity Aravind Iddamsetty
2025-07-30 5:48 ` [PATCH 09/10] drm/xe: Clear SOC CORRECTABLE error registers Aravind Iddamsetty
2025-07-30 5:48 ` [PATCH 10/10] drm/xe: Clear all SoC errors post warm reset Aravind Iddamsetty
2025-07-30 5:57 ` ✗ CI.checkpatch: warning for Supporting RAS on XE Patchwork
2025-07-30 5:58 ` ✓ CI.KUnit: success " Patchwork
2025-07-30 6:59 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-07-30 8:03 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com \
--to=aravind.iddamsetty@linux.intel.com \
--cc=anshuman.gupta@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).