public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
* ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev4)
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
@ 2026-01-19  3:36 ` Patchwork
  2026-01-19  3:37 ` ✓ CI.KUnit: success " Patchwork
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Patchwork @ 2026-01-19  3:36 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev4)
URL   : https://patchwork.freedesktop.org/series/155188/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
ee83616c430ce70bd254bd2774d143a5733c8666
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit e8649becc11e54d51a4d464dfb18ce00b93b8915
Author: Riana Tauro <riana.tauro@intel.com>
Date:   Mon Jan 19 09:30:26 2026 +0530

    drm/xe/xe_hw_error: Add support for PVC SOC errors
    
    Report the SOC nonfatal/fatal hardware error and update the counters.
    
    Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
    Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
    Signed-off-by: Riana Tauro <riana.tauro@intel.com>
+ /mt/dim checkpatch af18785a6a8621b1a5805ba8a1b35d290cb4bcac drm-intel
a0c8c3d0add5 drm/ras: Introduce the DRM RAS infrastructure over generic netlink
-:57: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#57: 
new file mode 100644

-:805: WARNING:LONG_LINE: line length of 114 exceeds 100 columns
#805: FILE: drivers/gpu/drm/drm_ras_nl.c:13:
+static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {

-:810: WARNING:LONG_LINE: line length of 116 exceeds 100 columns
#810: FILE: drivers/gpu/drm/drm_ras_nl.c:18:
+static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {

total: 0 errors, 3 warnings, 0 checks, 905 lines checked
b59f1bb3a603 drm/xe/xe_drm_ras: Add support for drm ras
-:27: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#27: 
$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":0}'

-:73: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#73: 
new file mode 100644

-:269: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'i' - possible side-effects?
#269: FILE: drivers/gpu/drm/xe/xe_drm_ras.h:10:
+#define for_each_error_severity(i)	\
+	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++)

total: 0 errors, 2 warnings, 1 checks, 474 lines checked
c0b38ced1943 drm/xe/xe_hw_error: Add support for GT hardware errors
-:83: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'hw_err' may be better as '(hw_err)' to avoid precedence issues
#83: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:61:
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+						ERR_STAT_GT_COR_VECTOR_REG(x) : \
+						ERR_STAT_GT_FATAL_VECTOR_REG(x))

-:83: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'x' - possible side-effects?
#83: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:61:
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+						ERR_STAT_GT_COR_VECTOR_REG(x) : \
+						ERR_STAT_GT_FATAL_VECTOR_REG(x))

-:106: WARNING:SPACE_BEFORE_TAB: please, no space before tabs
#106: FILE: drivers/gpu/drm/xe/xe_hw_error.c:20:
+#define  HEC_UNCORR_FW_ERR_BITS ^I4$

total: 0 errors, 1 warnings, 2 checks, 299 lines checked
e8649becc11e drm/xe/xe_hw_error: Add support for PVC SOC errors
-:33: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'base' - possible side-effects?
#33: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:71:
+#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
+								  (base) + SOC_GCOERRSTS, \
+								  (base) + SOC_GNFERRSTS))

-:47: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'base' - possible side-effects?
#47: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:85:
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)	XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+						      (base) + SOC_LERRCORSTS : \
+						      (base) + SOC_LERRUNCSTS)

-:47: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'hw_err' may be better as '(hw_err)' to avoid precedence issues
#47: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:85:
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)	XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+						      (base) + SOC_LERRCORSTS : \
+						      (base) + SOC_LERRUNCSTS)

-:60: WARNING:SPACE_BEFORE_TAB: please, no space before tabs
#60: FILE: drivers/gpu/drm/xe/xe_hw_error.c:22:
+#define  XE_SOC_NUM_IEH ^I^I2$

total: 0 errors, 1 warnings, 3 checks, 272 lines checked



^ permalink raw reply	[flat|nested] 22+ messages in thread

* ✓ CI.KUnit: success for Introduce DRM_RAS using generic netlink for RAS (rev4)
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
  2026-01-19  3:36 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev4) Patchwork
@ 2026-01-19  3:37 ` Patchwork
  2026-01-19  3:52 ` ✗ CI.checksparse: warning " Patchwork
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Patchwork @ 2026-01-19  3:37 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev4)
URL   : https://patchwork.freedesktop.org/series/155188/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[03:36:09] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:36:13] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[03:36:45] Starting KUnit Kernel (1/1)...
[03:36:45] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[03:36:45] ================== guc_buf (11 subtests) ===================
[03:36:45] [PASSED] test_smallest
[03:36:45] [PASSED] test_largest
[03:36:45] [PASSED] test_granular
[03:36:45] [PASSED] test_unique
[03:36:45] [PASSED] test_overlap
[03:36:45] [PASSED] test_reusable
[03:36:45] [PASSED] test_too_big
[03:36:45] [PASSED] test_flush
[03:36:45] [PASSED] test_lookup
[03:36:45] [PASSED] test_data
[03:36:45] [PASSED] test_class
[03:36:45] ===================== [PASSED] guc_buf =====================
[03:36:45] =================== guc_dbm (7 subtests) ===================
[03:36:45] [PASSED] test_empty
[03:36:45] [PASSED] test_default
[03:36:45] ======================== test_size  ========================
[03:36:45] [PASSED] 4
[03:36:45] [PASSED] 8
[03:36:45] [PASSED] 32
[03:36:45] [PASSED] 256
[03:36:45] ==================== [PASSED] test_size ====================
[03:36:45] ======================= test_reuse  ========================
[03:36:45] [PASSED] 4
[03:36:45] [PASSED] 8
[03:36:45] [PASSED] 32
[03:36:45] [PASSED] 256
[03:36:45] =================== [PASSED] test_reuse ====================
[03:36:45] =================== test_range_overlap  ====================
[03:36:45] [PASSED] 4
[03:36:45] [PASSED] 8
[03:36:45] [PASSED] 32
[03:36:45] [PASSED] 256
[03:36:45] =============== [PASSED] test_range_overlap ================
[03:36:45] =================== test_range_compact  ====================
[03:36:45] [PASSED] 4
[03:36:45] [PASSED] 8
[03:36:45] [PASSED] 32
[03:36:45] [PASSED] 256
[03:36:45] =============== [PASSED] test_range_compact ================
[03:36:45] ==================== test_range_spare  =====================
[03:36:45] [PASSED] 4
[03:36:45] [PASSED] 8
[03:36:45] [PASSED] 32
[03:36:45] [PASSED] 256
[03:36:45] ================ [PASSED] test_range_spare =================
[03:36:45] ===================== [PASSED] guc_dbm =====================
[03:36:45] =================== guc_idm (6 subtests) ===================
[03:36:45] [PASSED] bad_init
[03:36:45] [PASSED] no_init
[03:36:45] [PASSED] init_fini
[03:36:45] [PASSED] check_used
[03:36:45] [PASSED] check_quota
[03:36:45] [PASSED] check_all
[03:36:45] ===================== [PASSED] guc_idm =====================
[03:36:45] ================== no_relay (3 subtests) ===================
[03:36:45] [PASSED] xe_drops_guc2pf_if_not_ready
[03:36:45] [PASSED] xe_drops_guc2vf_if_not_ready
[03:36:45] [PASSED] xe_rejects_send_if_not_ready
[03:36:45] ==================== [PASSED] no_relay =====================
[03:36:45] ================== pf_relay (14 subtests) ==================
[03:36:45] [PASSED] pf_rejects_guc2pf_too_short
[03:36:45] [PASSED] pf_rejects_guc2pf_too_long
[03:36:45] [PASSED] pf_rejects_guc2pf_no_payload
[03:36:45] [PASSED] pf_fails_no_payload
[03:36:45] [PASSED] pf_fails_bad_origin
[03:36:45] [PASSED] pf_fails_bad_type
[03:36:45] [PASSED] pf_txn_reports_error
[03:36:45] [PASSED] pf_txn_sends_pf2guc
[03:36:45] [PASSED] pf_sends_pf2guc
[03:36:45] [SKIPPED] pf_loopback_nop
[03:36:45] [SKIPPED] pf_loopback_echo
[03:36:45] [SKIPPED] pf_loopback_fail
[03:36:45] [SKIPPED] pf_loopback_busy
[03:36:45] [SKIPPED] pf_loopback_retry
[03:36:45] ==================== [PASSED] pf_relay =====================
[03:36:45] ================== vf_relay (3 subtests) ===================
[03:36:45] [PASSED] vf_rejects_guc2vf_too_short
[03:36:45] [PASSED] vf_rejects_guc2vf_too_long
[03:36:45] [PASSED] vf_rejects_guc2vf_no_payload
[03:36:45] ==================== [PASSED] vf_relay =====================
[03:36:45] ================ pf_gt_config (6 subtests) =================
[03:36:45] [PASSED] fair_contexts_1vf
[03:36:45] [PASSED] fair_doorbells_1vf
[03:36:45] [PASSED] fair_ggtt_1vf
[03:36:45] ====================== fair_contexts  ======================
[03:36:45] [PASSED] 1 VF
[03:36:45] [PASSED] 2 VFs
[03:36:45] [PASSED] 3 VFs
[03:36:45] [PASSED] 4 VFs
[03:36:45] [PASSED] 5 VFs
[03:36:45] [PASSED] 6 VFs
[03:36:45] [PASSED] 7 VFs
[03:36:45] [PASSED] 8 VFs
[03:36:45] [PASSED] 9 VFs
[03:36:45] [PASSED] 10 VFs
[03:36:45] [PASSED] 11 VFs
[03:36:45] [PASSED] 12 VFs
[03:36:45] [PASSED] 13 VFs
[03:36:45] [PASSED] 14 VFs
[03:36:45] [PASSED] 15 VFs
[03:36:45] [PASSED] 16 VFs
[03:36:45] [PASSED] 17 VFs
[03:36:45] [PASSED] 18 VFs
[03:36:45] [PASSED] 19 VFs
[03:36:45] [PASSED] 20 VFs
[03:36:45] [PASSED] 21 VFs
[03:36:45] [PASSED] 22 VFs
[03:36:45] [PASSED] 23 VFs
[03:36:45] [PASSED] 24 VFs
[03:36:45] [PASSED] 25 VFs
[03:36:45] [PASSED] 26 VFs
[03:36:45] [PASSED] 27 VFs
[03:36:45] [PASSED] 28 VFs
[03:36:45] [PASSED] 29 VFs
[03:36:45] [PASSED] 30 VFs
[03:36:45] [PASSED] 31 VFs
[03:36:45] [PASSED] 32 VFs
[03:36:45] [PASSED] 33 VFs
[03:36:45] [PASSED] 34 VFs
[03:36:45] [PASSED] 35 VFs
[03:36:45] [PASSED] 36 VFs
[03:36:45] [PASSED] 37 VFs
[03:36:45] [PASSED] 38 VFs
[03:36:45] [PASSED] 39 VFs
[03:36:45] [PASSED] 40 VFs
[03:36:45] [PASSED] 41 VFs
[03:36:45] [PASSED] 42 VFs
[03:36:45] [PASSED] 43 VFs
[03:36:45] [PASSED] 44 VFs
[03:36:45] [PASSED] 45 VFs
[03:36:45] [PASSED] 46 VFs
[03:36:45] [PASSED] 47 VFs
[03:36:45] [PASSED] 48 VFs
[03:36:45] [PASSED] 49 VFs
[03:36:45] [PASSED] 50 VFs
[03:36:45] [PASSED] 51 VFs
[03:36:45] [PASSED] 52 VFs
[03:36:45] [PASSED] 53 VFs
[03:36:45] [PASSED] 54 VFs
[03:36:45] [PASSED] 55 VFs
[03:36:45] [PASSED] 56 VFs
[03:36:45] [PASSED] 57 VFs
[03:36:45] [PASSED] 58 VFs
[03:36:45] [PASSED] 59 VFs
[03:36:45] [PASSED] 60 VFs
[03:36:45] [PASSED] 61 VFs
[03:36:45] [PASSED] 62 VFs
[03:36:45] [PASSED] 63 VFs
[03:36:45] ================== [PASSED] fair_contexts ==================
[03:36:45] ===================== fair_doorbells  ======================
[03:36:45] [PASSED] 1 VF
[03:36:45] [PASSED] 2 VFs
[03:36:45] [PASSED] 3 VFs
[03:36:45] [PASSED] 4 VFs
[03:36:45] [PASSED] 5 VFs
[03:36:45] [PASSED] 6 VFs
[03:36:45] [PASSED] 7 VFs
[03:36:45] [PASSED] 8 VFs
[03:36:45] [PASSED] 9 VFs
[03:36:45] [PASSED] 10 VFs
[03:36:45] [PASSED] 11 VFs
[03:36:45] [PASSED] 12 VFs
[03:36:45] [PASSED] 13 VFs
[03:36:45] [PASSED] 14 VFs
[03:36:45] [PASSED] 15 VFs
[03:36:45] [PASSED] 16 VFs
[03:36:45] [PASSED] 17 VFs
[03:36:45] [PASSED] 18 VFs
[03:36:45] [PASSED] 19 VFs
[03:36:45] [PASSED] 20 VFs
[03:36:45] [PASSED] 21 VFs
[03:36:45] [PASSED] 22 VFs
[03:36:45] [PASSED] 23 VFs
[03:36:45] [PASSED] 24 VFs
[03:36:45] [PASSED] 25 VFs
[03:36:45] [PASSED] 26 VFs
[03:36:45] [PASSED] 27 VFs
[03:36:45] [PASSED] 28 VFs
[03:36:45] [PASSED] 29 VFs
[03:36:45] [PASSED] 30 VFs
[03:36:45] [PASSED] 31 VFs
[03:36:45] [PASSED] 32 VFs
[03:36:45] [PASSED] 33 VFs
[03:36:45] [PASSED] 34 VFs
[03:36:45] [PASSED] 35 VFs
[03:36:45] [PASSED] 36 VFs
[03:36:45] [PASSED] 37 VFs
[03:36:45] [PASSED] 38 VFs
[03:36:45] [PASSED] 39 VFs
[03:36:45] [PASSED] 40 VFs
[03:36:45] [PASSED] 41 VFs
[03:36:45] [PASSED] 42 VFs
[03:36:45] [PASSED] 43 VFs
[03:36:45] [PASSED] 44 VFs
[03:36:45] [PASSED] 45 VFs
[03:36:45] [PASSED] 46 VFs
[03:36:45] [PASSED] 47 VFs
[03:36:45] [PASSED] 48 VFs
[03:36:45] [PASSED] 49 VFs
[03:36:45] [PASSED] 50 VFs
[03:36:45] [PASSED] 51 VFs
[03:36:45] [PASSED] 52 VFs
[03:36:45] [PASSED] 53 VFs
[03:36:45] [PASSED] 54 VFs
[03:36:45] [PASSED] 55 VFs
[03:36:45] [PASSED] 56 VFs
[03:36:45] [PASSED] 57 VFs
[03:36:45] [PASSED] 58 VFs
[03:36:45] [PASSED] 59 VFs
[03:36:45] [PASSED] 60 VFs
[03:36:45] [PASSED] 61 VFs
[03:36:45] [PASSED] 62 VFs
[03:36:45] [PASSED] 63 VFs
[03:36:45] ================= [PASSED] fair_doorbells ==================
[03:36:45] ======================== fair_ggtt  ========================
[03:36:45] [PASSED] 1 VF
[03:36:45] [PASSED] 2 VFs
[03:36:45] [PASSED] 3 VFs
[03:36:45] [PASSED] 4 VFs
[03:36:45] [PASSED] 5 VFs
[03:36:45] [PASSED] 6 VFs
[03:36:45] [PASSED] 7 VFs
[03:36:45] [PASSED] 8 VFs
[03:36:45] [PASSED] 9 VFs
[03:36:45] [PASSED] 10 VFs
[03:36:45] [PASSED] 11 VFs
[03:36:45] [PASSED] 12 VFs
[03:36:45] [PASSED] 13 VFs
[03:36:45] [PASSED] 14 VFs
[03:36:45] [PASSED] 15 VFs
[03:36:45] [PASSED] 16 VFs
[03:36:45] [PASSED] 17 VFs
[03:36:45] [PASSED] 18 VFs
[03:36:45] [PASSED] 19 VFs
[03:36:45] [PASSED] 20 VFs
[03:36:45] [PASSED] 21 VFs
[03:36:45] [PASSED] 22 VFs
[03:36:45] [PASSED] 23 VFs
[03:36:45] [PASSED] 24 VFs
[03:36:45] [PASSED] 25 VFs
[03:36:45] [PASSED] 26 VFs
[03:36:45] [PASSED] 27 VFs
[03:36:45] [PASSED] 28 VFs
[03:36:45] [PASSED] 29 VFs
[03:36:45] [PASSED] 30 VFs
[03:36:45] [PASSED] 31 VFs
[03:36:45] [PASSED] 32 VFs
[03:36:45] [PASSED] 33 VFs
[03:36:45] [PASSED] 34 VFs
[03:36:45] [PASSED] 35 VFs
[03:36:45] [PASSED] 36 VFs
[03:36:45] [PASSED] 37 VFs
[03:36:45] [PASSED] 38 VFs
[03:36:45] [PASSED] 39 VFs
[03:36:45] [PASSED] 40 VFs
[03:36:45] [PASSED] 41 VFs
[03:36:45] [PASSED] 42 VFs
[03:36:45] [PASSED] 43 VFs
[03:36:45] [PASSED] 44 VFs
[03:36:45] [PASSED] 45 VFs
[03:36:45] [PASSED] 46 VFs
[03:36:45] [PASSED] 47 VFs
[03:36:45] [PASSED] 48 VFs
[03:36:45] [PASSED] 49 VFs
[03:36:45] [PASSED] 50 VFs
[03:36:45] [PASSED] 51 VFs
[03:36:45] [PASSED] 52 VFs
[03:36:45] [PASSED] 53 VFs
[03:36:45] [PASSED] 54 VFs
[03:36:45] [PASSED] 55 VFs
[03:36:45] [PASSED] 56 VFs
[03:36:45] [PASSED] 57 VFs
[03:36:45] [PASSED] 58 VFs
[03:36:45] [PASSED] 59 VFs
[03:36:45] [PASSED] 60 VFs
[03:36:45] [PASSED] 61 VFs
[03:36:45] [PASSED] 62 VFs
[03:36:45] [PASSED] 63 VFs
[03:36:45] ==================== [PASSED] fair_ggtt ====================
[03:36:45] ================== [PASSED] pf_gt_config ===================
[03:36:45] ===================== lmtt (1 subtest) =====================
[03:36:45] ======================== test_ops  =========================
[03:36:45] [PASSED] 2-level
[03:36:45] [PASSED] multi-level
[03:36:45] ==================== [PASSED] test_ops =====================
[03:36:45] ====================== [PASSED] lmtt =======================
[03:36:45] ================= pf_service (11 subtests) =================
[03:36:45] [PASSED] pf_negotiate_any
[03:36:45] [PASSED] pf_negotiate_base_match
[03:36:45] [PASSED] pf_negotiate_base_newer
[03:36:45] [PASSED] pf_negotiate_base_next
[03:36:45] [SKIPPED] pf_negotiate_base_older
[03:36:45] [PASSED] pf_negotiate_base_prev
[03:36:45] [PASSED] pf_negotiate_latest_match
[03:36:45] [PASSED] pf_negotiate_latest_newer
[03:36:45] [PASSED] pf_negotiate_latest_next
[03:36:45] [SKIPPED] pf_negotiate_latest_older
[03:36:45] [SKIPPED] pf_negotiate_latest_prev
[03:36:45] =================== [PASSED] pf_service ====================
[03:36:45] ================= xe_guc_g2g (2 subtests) ==================
[03:36:45] ============== xe_live_guc_g2g_kunit_default  ==============
[03:36:45] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[03:36:45] ============== xe_live_guc_g2g_kunit_allmem  ===============
[03:36:45] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[03:36:45] =================== [SKIPPED] xe_guc_g2g ===================
[03:36:45] =================== xe_mocs (2 subtests) ===================
[03:36:45] ================ xe_live_mocs_kernel_kunit  ================
[03:36:45] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[03:36:45] ================ xe_live_mocs_reset_kunit  =================
[03:36:45] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[03:36:45] ==================== [SKIPPED] xe_mocs =====================
[03:36:45] ================= xe_migrate (2 subtests) ==================
[03:36:45] ================= xe_migrate_sanity_kunit  =================
[03:36:45] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[03:36:45] ================== xe_validate_ccs_kunit  ==================
[03:36:45] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[03:36:45] =================== [SKIPPED] xe_migrate ===================
[03:36:45] ================== xe_dma_buf (1 subtest) ==================
[03:36:45] ==================== xe_dma_buf_kunit  =====================
[03:36:45] ================ [SKIPPED] xe_dma_buf_kunit ================
[03:36:45] =================== [SKIPPED] xe_dma_buf ===================
[03:36:45] ================= xe_bo_shrink (1 subtest) =================
[03:36:45] =================== xe_bo_shrink_kunit  ====================
[03:36:45] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[03:36:45] ================== [SKIPPED] xe_bo_shrink ==================
[03:36:45] ==================== xe_bo (2 subtests) ====================
[03:36:45] ================== xe_ccs_migrate_kunit  ===================
[03:36:45] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[03:36:45] ==================== xe_bo_evict_kunit  ====================
[03:36:45] =============== [SKIPPED] xe_bo_evict_kunit ================
[03:36:45] ===================== [SKIPPED] xe_bo ======================
[03:36:45] ==================== args (13 subtests) ====================
[03:36:45] [PASSED] count_args_test
[03:36:45] [PASSED] call_args_example
[03:36:45] [PASSED] call_args_test
[03:36:45] [PASSED] drop_first_arg_example
[03:36:45] [PASSED] drop_first_arg_test
[03:36:45] [PASSED] first_arg_example
[03:36:45] [PASSED] first_arg_test
[03:36:45] [PASSED] last_arg_example
[03:36:45] [PASSED] last_arg_test
[03:36:45] [PASSED] pick_arg_example
[03:36:45] [PASSED] if_args_example
[03:36:45] [PASSED] if_args_test
[03:36:45] [PASSED] sep_comma_example
[03:36:45] ====================== [PASSED] args =======================
[03:36:45] =================== xe_pci (3 subtests) ====================
[03:36:45] ==================== check_graphics_ip  ====================
[03:36:45] [PASSED] 12.00 Xe_LP
[03:36:45] [PASSED] 12.10 Xe_LP+
[03:36:45] [PASSED] 12.55 Xe_HPG
[03:36:45] [PASSED] 12.60 Xe_HPC
[03:36:45] [PASSED] 12.70 Xe_LPG
[03:36:45] [PASSED] 12.71 Xe_LPG
[03:36:45] [PASSED] 12.74 Xe_LPG+
[03:36:45] [PASSED] 20.01 Xe2_HPG
[03:36:45] [PASSED] 20.02 Xe2_HPG
[03:36:45] [PASSED] 20.04 Xe2_LPG
[03:36:45] [PASSED] 30.00 Xe3_LPG
[03:36:45] [PASSED] 30.01 Xe3_LPG
[03:36:45] [PASSED] 30.03 Xe3_LPG
[03:36:45] [PASSED] 30.04 Xe3_LPG
[03:36:45] [PASSED] 30.05 Xe3_LPG
[03:36:45] [PASSED] 35.11 Xe3p_XPC
[03:36:45] ================ [PASSED] check_graphics_ip ================
[03:36:45] ===================== check_media_ip  ======================
[03:36:45] [PASSED] 12.00 Xe_M
[03:36:45] [PASSED] 12.55 Xe_HPM
[03:36:45] [PASSED] 13.00 Xe_LPM+
[03:36:45] [PASSED] 13.01 Xe2_HPM
[03:36:45] [PASSED] 20.00 Xe2_LPM
[03:36:45] [PASSED] 30.00 Xe3_LPM
[03:36:45] [PASSED] 30.02 Xe3_LPM
[03:36:45] [PASSED] 35.00 Xe3p_LPM
[03:36:45] [PASSED] 35.03 Xe3p_HPM
[03:36:45] ================= [PASSED] check_media_ip ==================
[03:36:45] =================== check_platform_desc  ===================
[03:36:45] [PASSED] 0x9A60 (TIGERLAKE)
[03:36:45] [PASSED] 0x9A68 (TIGERLAKE)
[03:36:45] [PASSED] 0x9A70 (TIGERLAKE)
[03:36:45] [PASSED] 0x9A40 (TIGERLAKE)
[03:36:45] [PASSED] 0x9A49 (TIGERLAKE)
[03:36:45] [PASSED] 0x9A59 (TIGERLAKE)
[03:36:45] [PASSED] 0x9A78 (TIGERLAKE)
[03:36:45] [PASSED] 0x9AC0 (TIGERLAKE)
[03:36:45] [PASSED] 0x9AC9 (TIGERLAKE)
[03:36:45] [PASSED] 0x9AD9 (TIGERLAKE)
[03:36:45] [PASSED] 0x9AF8 (TIGERLAKE)
[03:36:45] [PASSED] 0x4C80 (ROCKETLAKE)
[03:36:45] [PASSED] 0x4C8A (ROCKETLAKE)
[03:36:45] [PASSED] 0x4C8B (ROCKETLAKE)
[03:36:45] [PASSED] 0x4C8C (ROCKETLAKE)
[03:36:45] [PASSED] 0x4C90 (ROCKETLAKE)
[03:36:45] [PASSED] 0x4C9A (ROCKETLAKE)
[03:36:45] [PASSED] 0x4680 (ALDERLAKE_S)
[03:36:45] [PASSED] 0x4682 (ALDERLAKE_S)
[03:36:45] [PASSED] 0x4688 (ALDERLAKE_S)
[03:36:45] [PASSED] 0x468A (ALDERLAKE_S)
[03:36:45] [PASSED] 0x468B (ALDERLAKE_S)
[03:36:45] [PASSED] 0x4690 (ALDERLAKE_S)
[03:36:45] [PASSED] 0x4692 (ALDERLAKE_S)
[03:36:45] [PASSED] 0x4693 (ALDERLAKE_S)
[03:36:45] [PASSED] 0x46A0 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46A1 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46A2 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46A3 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46A6 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46A8 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46AA (ALDERLAKE_P)
[03:36:45] [PASSED] 0x462A (ALDERLAKE_P)
[03:36:45] [PASSED] 0x4626 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x4628 (ALDERLAKE_P)
stty: 'standard input': Inappropriate ioctl for device
[03:36:45] [PASSED] 0x46B0 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46B1 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46B2 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46B3 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46C0 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46C1 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46C2 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46C3 (ALDERLAKE_P)
[03:36:45] [PASSED] 0x46D0 (ALDERLAKE_N)
[03:36:45] [PASSED] 0x46D1 (ALDERLAKE_N)
[03:36:45] [PASSED] 0x46D2 (ALDERLAKE_N)
[03:36:45] [PASSED] 0x46D3 (ALDERLAKE_N)
[03:36:45] [PASSED] 0x46D4 (ALDERLAKE_N)
[03:36:45] [PASSED] 0xA721 (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7A1 (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7A9 (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7AC (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7AD (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA720 (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7A0 (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7A8 (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7AA (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA7AB (ALDERLAKE_P)
[03:36:45] [PASSED] 0xA780 (ALDERLAKE_S)
[03:36:45] [PASSED] 0xA781 (ALDERLAKE_S)
[03:36:45] [PASSED] 0xA782 (ALDERLAKE_S)
[03:36:45] [PASSED] 0xA783 (ALDERLAKE_S)
[03:36:45] [PASSED] 0xA788 (ALDERLAKE_S)
[03:36:45] [PASSED] 0xA789 (ALDERLAKE_S)
[03:36:45] [PASSED] 0xA78A (ALDERLAKE_S)
[03:36:45] [PASSED] 0xA78B (ALDERLAKE_S)
[03:36:45] [PASSED] 0x4905 (DG1)
[03:36:45] [PASSED] 0x4906 (DG1)
[03:36:45] [PASSED] 0x4907 (DG1)
[03:36:45] [PASSED] 0x4908 (DG1)
[03:36:45] [PASSED] 0x4909 (DG1)
[03:36:45] [PASSED] 0x56C0 (DG2)
[03:36:45] [PASSED] 0x56C2 (DG2)
[03:36:45] [PASSED] 0x56C1 (DG2)
[03:36:45] [PASSED] 0x7D51 (METEORLAKE)
[03:36:45] [PASSED] 0x7DD1 (METEORLAKE)
[03:36:45] [PASSED] 0x7D41 (METEORLAKE)
[03:36:45] [PASSED] 0x7D67 (METEORLAKE)
[03:36:45] [PASSED] 0xB640 (METEORLAKE)
[03:36:45] [PASSED] 0x56A0 (DG2)
[03:36:45] [PASSED] 0x56A1 (DG2)
[03:36:45] [PASSED] 0x56A2 (DG2)
[03:36:45] [PASSED] 0x56BE (DG2)
[03:36:45] [PASSED] 0x56BF (DG2)
[03:36:45] [PASSED] 0x5690 (DG2)
[03:36:45] [PASSED] 0x5691 (DG2)
[03:36:45] [PASSED] 0x5692 (DG2)
[03:36:45] [PASSED] 0x56A5 (DG2)
[03:36:45] [PASSED] 0x56A6 (DG2)
[03:36:45] [PASSED] 0x56B0 (DG2)
[03:36:45] [PASSED] 0x56B1 (DG2)
[03:36:45] [PASSED] 0x56BA (DG2)
[03:36:45] [PASSED] 0x56BB (DG2)
[03:36:45] [PASSED] 0x56BC (DG2)
[03:36:45] [PASSED] 0x56BD (DG2)
[03:36:45] [PASSED] 0x5693 (DG2)
[03:36:45] [PASSED] 0x5694 (DG2)
[03:36:45] [PASSED] 0x5695 (DG2)
[03:36:45] [PASSED] 0x56A3 (DG2)
[03:36:45] [PASSED] 0x56A4 (DG2)
[03:36:45] [PASSED] 0x56B2 (DG2)
[03:36:45] [PASSED] 0x56B3 (DG2)
[03:36:45] [PASSED] 0x5696 (DG2)
[03:36:45] [PASSED] 0x5697 (DG2)
[03:36:45] [PASSED] 0xB69 (PVC)
[03:36:45] [PASSED] 0xB6E (PVC)
[03:36:45] [PASSED] 0xBD4 (PVC)
[03:36:45] [PASSED] 0xBD5 (PVC)
[03:36:45] [PASSED] 0xBD6 (PVC)
[03:36:45] [PASSED] 0xBD7 (PVC)
[03:36:45] [PASSED] 0xBD8 (PVC)
[03:36:45] [PASSED] 0xBD9 (PVC)
[03:36:45] [PASSED] 0xBDA (PVC)
[03:36:45] [PASSED] 0xBDB (PVC)
[03:36:45] [PASSED] 0xBE0 (PVC)
[03:36:45] [PASSED] 0xBE1 (PVC)
[03:36:45] [PASSED] 0xBE5 (PVC)
[03:36:45] [PASSED] 0x7D40 (METEORLAKE)
[03:36:45] [PASSED] 0x7D45 (METEORLAKE)
[03:36:45] [PASSED] 0x7D55 (METEORLAKE)
[03:36:45] [PASSED] 0x7D60 (METEORLAKE)
[03:36:45] [PASSED] 0x7DD5 (METEORLAKE)
[03:36:45] [PASSED] 0x6420 (LUNARLAKE)
[03:36:45] [PASSED] 0x64A0 (LUNARLAKE)
[03:36:45] [PASSED] 0x64B0 (LUNARLAKE)
[03:36:45] [PASSED] 0xE202 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE209 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE20B (BATTLEMAGE)
[03:36:45] [PASSED] 0xE20C (BATTLEMAGE)
[03:36:45] [PASSED] 0xE20D (BATTLEMAGE)
[03:36:45] [PASSED] 0xE210 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE211 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE212 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE216 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE220 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE221 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE222 (BATTLEMAGE)
[03:36:45] [PASSED] 0xE223 (BATTLEMAGE)
[03:36:45] [PASSED] 0xB080 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB081 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB082 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB083 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB084 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB085 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB086 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB087 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB08F (PANTHERLAKE)
[03:36:45] [PASSED] 0xB090 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB0A0 (PANTHERLAKE)
[03:36:45] [PASSED] 0xB0B0 (PANTHERLAKE)
[03:36:45] [PASSED] 0xFD80 (PANTHERLAKE)
[03:36:45] [PASSED] 0xFD81 (PANTHERLAKE)
[03:36:45] [PASSED] 0xD740 (NOVALAKE_S)
[03:36:45] [PASSED] 0xD741 (NOVALAKE_S)
[03:36:45] [PASSED] 0xD742 (NOVALAKE_S)
[03:36:45] [PASSED] 0xD743 (NOVALAKE_S)
[03:36:45] [PASSED] 0xD744 (NOVALAKE_S)
[03:36:45] [PASSED] 0xD745 (NOVALAKE_S)
[03:36:45] [PASSED] 0x674C (CRESCENTISLAND)
[03:36:45] =============== [PASSED] check_platform_desc ===============
[03:36:45] ===================== [PASSED] xe_pci ======================
[03:36:45] =================== xe_rtp (2 subtests) ====================
[03:36:45] =============== xe_rtp_process_to_sr_tests  ================
[03:36:45] [PASSED] coalesce-same-reg
[03:36:45] [PASSED] no-match-no-add
[03:36:45] [PASSED] match-or
[03:36:45] [PASSED] match-or-xfail
[03:36:45] [PASSED] no-match-no-add-multiple-rules
[03:36:45] [PASSED] two-regs-two-entries
[03:36:45] [PASSED] clr-one-set-other
[03:36:45] [PASSED] set-field
[03:36:45] [PASSED] conflict-duplicate
[03:36:45] [PASSED] conflict-not-disjoint
[03:36:45] [PASSED] conflict-reg-type
[03:36:45] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[03:36:45] ================== xe_rtp_process_tests  ===================
[03:36:45] [PASSED] active1
[03:36:45] [PASSED] active2
[03:36:45] [PASSED] active-inactive
[03:36:45] [PASSED] inactive-active
[03:36:45] [PASSED] inactive-1st_or_active-inactive
[03:36:45] [PASSED] inactive-2nd_or_active-inactive
[03:36:45] [PASSED] inactive-last_or_active-inactive
[03:36:45] [PASSED] inactive-no_or_active-inactive
[03:36:45] ============== [PASSED] xe_rtp_process_tests ===============
[03:36:45] ===================== [PASSED] xe_rtp ======================
[03:36:45] ==================== xe_wa (1 subtest) =====================
[03:36:45] ======================== xe_wa_gt  =========================
[03:36:45] [PASSED] TIGERLAKE B0
[03:36:45] [PASSED] DG1 A0
[03:36:45] [PASSED] DG1 B0
[03:36:45] [PASSED] ALDERLAKE_S A0
[03:36:45] [PASSED] ALDERLAKE_S B0
[03:36:45] [PASSED] ALDERLAKE_S C0
[03:36:45] [PASSED] ALDERLAKE_S D0
[03:36:45] [PASSED] ALDERLAKE_P A0
[03:36:45] [PASSED] ALDERLAKE_P B0
[03:36:45] [PASSED] ALDERLAKE_P C0
[03:36:45] [PASSED] ALDERLAKE_S RPLS D0
[03:36:45] [PASSED] ALDERLAKE_P RPLU E0
[03:36:45] [PASSED] DG2 G10 C0
[03:36:45] [PASSED] DG2 G11 B1
[03:36:45] [PASSED] DG2 G12 A1
[03:36:45] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[03:36:45] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[03:36:45] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[03:36:45] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[03:36:45] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[03:36:45] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[03:36:45] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[03:36:45] ==================== [PASSED] xe_wa_gt =====================
[03:36:45] ====================== [PASSED] xe_wa ======================
[03:36:45] ============================================================
[03:36:45] Testing complete. Ran 512 tests: passed: 494, skipped: 18
[03:36:45] Elapsed time: 36.239s total, 4.209s configuring, 31.559s building, 0.460s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[03:36:46] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:36:47] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[03:37:12] Starting KUnit Kernel (1/1)...
[03:37:12] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[03:37:13] ============ drm_test_pick_cmdline (2 subtests) ============
[03:37:13] [PASSED] drm_test_pick_cmdline_res_1920_1080_60
[03:37:13] =============== drm_test_pick_cmdline_named  ===============
[03:37:13] [PASSED] NTSC
[03:37:13] [PASSED] NTSC-J
[03:37:13] [PASSED] PAL
[03:37:13] [PASSED] PAL-M
[03:37:13] =========== [PASSED] drm_test_pick_cmdline_named ===========
[03:37:13] ============== [PASSED] drm_test_pick_cmdline ==============
[03:37:13] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[03:37:13] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[03:37:13] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[03:37:13] =========== drm_validate_clone_mode (2 subtests) ===========
[03:37:13] ============== drm_test_check_in_clone_mode  ===============
[03:37:13] [PASSED] in_clone_mode
[03:37:13] [PASSED] not_in_clone_mode
[03:37:13] ========== [PASSED] drm_test_check_in_clone_mode ===========
[03:37:13] =============== drm_test_check_valid_clones  ===============
[03:37:13] [PASSED] not_in_clone_mode
[03:37:13] [PASSED] valid_clone
[03:37:13] [PASSED] invalid_clone
[03:37:13] =========== [PASSED] drm_test_check_valid_clones ===========
[03:37:13] ============= [PASSED] drm_validate_clone_mode =============
[03:37:13] ============= drm_validate_modeset (1 subtest) =============
[03:37:13] [PASSED] drm_test_check_connector_changed_modeset
[03:37:13] ============== [PASSED] drm_validate_modeset ===============
[03:37:13] ====== drm_test_bridge_get_current_state (2 subtests) ======
[03:37:13] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[03:37:13] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[03:37:13] ======== [PASSED] drm_test_bridge_get_current_state ========
[03:37:13] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[03:37:13] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[03:37:13] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[03:37:13] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[03:37:13] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[03:37:13] ============== drm_bridge_alloc (2 subtests) ===============
[03:37:13] [PASSED] drm_test_drm_bridge_alloc_basic
[03:37:13] [PASSED] drm_test_drm_bridge_alloc_get_put
[03:37:13] ================ [PASSED] drm_bridge_alloc =================
[03:37:13] ================== drm_buddy (8 subtests) ==================
[03:37:13] [PASSED] drm_test_buddy_alloc_limit
[03:37:13] [PASSED] drm_test_buddy_alloc_optimistic
[03:37:13] [PASSED] drm_test_buddy_alloc_pessimistic
[03:37:13] [PASSED] drm_test_buddy_alloc_pathological
[03:37:13] [PASSED] drm_test_buddy_alloc_contiguous
[03:37:13] [PASSED] drm_test_buddy_alloc_clear
[03:37:13] [PASSED] drm_test_buddy_alloc_range_bias
[03:37:13] [PASSED] drm_test_buddy_fragmentation_performance
[03:37:13] ==================== [PASSED] drm_buddy ====================
[03:37:13] ============= drm_cmdline_parser (40 subtests) =============
[03:37:13] [PASSED] drm_test_cmdline_force_d_only
[03:37:13] [PASSED] drm_test_cmdline_force_D_only_dvi
[03:37:13] [PASSED] drm_test_cmdline_force_D_only_hdmi
[03:37:13] [PASSED] drm_test_cmdline_force_D_only_not_digital
[03:37:13] [PASSED] drm_test_cmdline_force_e_only
[03:37:13] [PASSED] drm_test_cmdline_res
[03:37:13] [PASSED] drm_test_cmdline_res_vesa
[03:37:13] [PASSED] drm_test_cmdline_res_vesa_rblank
[03:37:13] [PASSED] drm_test_cmdline_res_rblank
[03:37:13] [PASSED] drm_test_cmdline_res_bpp
[03:37:13] [PASSED] drm_test_cmdline_res_refresh
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[03:37:13] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[03:37:13] [PASSED] drm_test_cmdline_res_margins_force_on
[03:37:13] [PASSED] drm_test_cmdline_res_vesa_margins
[03:37:13] [PASSED] drm_test_cmdline_name
[03:37:13] [PASSED] drm_test_cmdline_name_bpp
[03:37:13] [PASSED] drm_test_cmdline_name_option
[03:37:13] [PASSED] drm_test_cmdline_name_bpp_option
[03:37:13] [PASSED] drm_test_cmdline_rotate_0
[03:37:13] [PASSED] drm_test_cmdline_rotate_90
[03:37:13] [PASSED] drm_test_cmdline_rotate_180
[03:37:13] [PASSED] drm_test_cmdline_rotate_270
[03:37:13] [PASSED] drm_test_cmdline_hmirror
[03:37:13] [PASSED] drm_test_cmdline_vmirror
[03:37:13] [PASSED] drm_test_cmdline_margin_options
[03:37:13] [PASSED] drm_test_cmdline_multiple_options
[03:37:13] [PASSED] drm_test_cmdline_bpp_extra_and_option
[03:37:13] [PASSED] drm_test_cmdline_extra_and_option
[03:37:13] [PASSED] drm_test_cmdline_freestanding_options
[03:37:13] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[03:37:13] [PASSED] drm_test_cmdline_panel_orientation
[03:37:13] ================ drm_test_cmdline_invalid  =================
[03:37:13] [PASSED] margin_only
[03:37:13] [PASSED] interlace_only
[03:37:13] [PASSED] res_missing_x
[03:37:13] [PASSED] res_missing_y
[03:37:13] [PASSED] res_bad_y
[03:37:13] [PASSED] res_missing_y_bpp
[03:37:13] [PASSED] res_bad_bpp
[03:37:13] [PASSED] res_bad_refresh
[03:37:13] [PASSED] res_bpp_refresh_force_on_off
[03:37:13] [PASSED] res_invalid_mode
[03:37:13] [PASSED] res_bpp_wrong_place_mode
[03:37:13] [PASSED] name_bpp_refresh
[03:37:13] [PASSED] name_refresh
[03:37:13] [PASSED] name_refresh_wrong_mode
[03:37:13] [PASSED] name_refresh_invalid_mode
[03:37:13] [PASSED] rotate_multiple
[03:37:13] [PASSED] rotate_invalid_val
[03:37:13] [PASSED] rotate_truncated
[03:37:13] [PASSED] invalid_option
[03:37:13] [PASSED] invalid_tv_option
[03:37:13] [PASSED] truncated_tv_option
[03:37:13] ============ [PASSED] drm_test_cmdline_invalid =============
[03:37:13] =============== drm_test_cmdline_tv_options  ===============
[03:37:13] [PASSED] NTSC
[03:37:13] [PASSED] NTSC_443
[03:37:13] [PASSED] NTSC_J
[03:37:13] [PASSED] PAL
[03:37:13] [PASSED] PAL_M
[03:37:13] [PASSED] PAL_N
[03:37:13] [PASSED] SECAM
[03:37:13] [PASSED] MONO_525
[03:37:13] [PASSED] MONO_625
[03:37:13] =========== [PASSED] drm_test_cmdline_tv_options ===========
[03:37:13] =============== [PASSED] drm_cmdline_parser ================
[03:37:13] ========== drmm_connector_hdmi_init (20 subtests) ==========
[03:37:13] [PASSED] drm_test_connector_hdmi_init_valid
[03:37:13] [PASSED] drm_test_connector_hdmi_init_bpc_8
[03:37:13] [PASSED] drm_test_connector_hdmi_init_bpc_10
[03:37:13] [PASSED] drm_test_connector_hdmi_init_bpc_12
[03:37:13] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[03:37:13] [PASSED] drm_test_connector_hdmi_init_bpc_null
[03:37:13] [PASSED] drm_test_connector_hdmi_init_formats_empty
[03:37:13] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[03:37:13] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[03:37:13] [PASSED] supported_formats=0x9 yuv420_allowed=1
[03:37:13] [PASSED] supported_formats=0x9 yuv420_allowed=0
[03:37:13] [PASSED] supported_formats=0x3 yuv420_allowed=1
[03:37:13] [PASSED] supported_formats=0x3 yuv420_allowed=0
[03:37:13] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[03:37:13] [PASSED] drm_test_connector_hdmi_init_null_ddc
[03:37:13] [PASSED] drm_test_connector_hdmi_init_null_product
[03:37:13] [PASSED] drm_test_connector_hdmi_init_null_vendor
[03:37:13] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[03:37:13] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[03:37:13] [PASSED] drm_test_connector_hdmi_init_product_valid
[03:37:13] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[03:37:13] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[03:37:13] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[03:37:13] ========= drm_test_connector_hdmi_init_type_valid  =========
[03:37:13] [PASSED] HDMI-A
[03:37:13] [PASSED] HDMI-B
[03:37:13] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[03:37:13] ======== drm_test_connector_hdmi_init_type_invalid  ========
[03:37:13] [PASSED] Unknown
[03:37:13] [PASSED] VGA
[03:37:13] [PASSED] DVI-I
[03:37:13] [PASSED] DVI-D
[03:37:13] [PASSED] DVI-A
[03:37:13] [PASSED] Composite
[03:37:13] [PASSED] SVIDEO
[03:37:13] [PASSED] LVDS
[03:37:13] [PASSED] Component
[03:37:13] [PASSED] DIN
[03:37:13] [PASSED] DP
[03:37:13] [PASSED] TV
[03:37:13] [PASSED] eDP
[03:37:13] [PASSED] Virtual
[03:37:13] [PASSED] DSI
[03:37:13] [PASSED] DPI
[03:37:13] [PASSED] Writeback
[03:37:13] [PASSED] SPI
[03:37:13] [PASSED] USB
[03:37:13] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[03:37:13] ============ [PASSED] drmm_connector_hdmi_init =============
[03:37:13] ============= drmm_connector_init (3 subtests) =============
[03:37:13] [PASSED] drm_test_drmm_connector_init
[03:37:13] [PASSED] drm_test_drmm_connector_init_null_ddc
[03:37:13] ========= drm_test_drmm_connector_init_type_valid  =========
[03:37:13] [PASSED] Unknown
[03:37:13] [PASSED] VGA
[03:37:13] [PASSED] DVI-I
[03:37:13] [PASSED] DVI-D
[03:37:13] [PASSED] DVI-A
[03:37:13] [PASSED] Composite
[03:37:13] [PASSED] SVIDEO
[03:37:13] [PASSED] LVDS
[03:37:13] [PASSED] Component
[03:37:13] [PASSED] DIN
[03:37:13] [PASSED] DP
[03:37:13] [PASSED] HDMI-A
[03:37:13] [PASSED] HDMI-B
[03:37:13] [PASSED] TV
[03:37:13] [PASSED] eDP
[03:37:13] [PASSED] Virtual
[03:37:13] [PASSED] DSI
[03:37:13] [PASSED] DPI
[03:37:13] [PASSED] Writeback
[03:37:13] [PASSED] SPI
[03:37:13] [PASSED] USB
[03:37:13] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[03:37:13] =============== [PASSED] drmm_connector_init ===============
[03:37:13] ========= drm_connector_dynamic_init (6 subtests) ==========
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_init
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_init_properties
[03:37:13] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[03:37:13] [PASSED] Unknown
[03:37:13] [PASSED] VGA
[03:37:13] [PASSED] DVI-I
[03:37:13] [PASSED] DVI-D
[03:37:13] [PASSED] DVI-A
[03:37:13] [PASSED] Composite
[03:37:13] [PASSED] SVIDEO
[03:37:13] [PASSED] LVDS
[03:37:13] [PASSED] Component
[03:37:13] [PASSED] DIN
[03:37:13] [PASSED] DP
[03:37:13] [PASSED] HDMI-A
[03:37:13] [PASSED] HDMI-B
[03:37:13] [PASSED] TV
[03:37:13] [PASSED] eDP
[03:37:13] [PASSED] Virtual
[03:37:13] [PASSED] DSI
[03:37:13] [PASSED] DPI
[03:37:13] [PASSED] Writeback
[03:37:13] [PASSED] SPI
[03:37:13] [PASSED] USB
[03:37:13] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[03:37:13] ======== drm_test_drm_connector_dynamic_init_name  =========
[03:37:13] [PASSED] Unknown
[03:37:13] [PASSED] VGA
[03:37:13] [PASSED] DVI-I
[03:37:13] [PASSED] DVI-D
[03:37:13] [PASSED] DVI-A
[03:37:13] [PASSED] Composite
[03:37:13] [PASSED] SVIDEO
[03:37:13] [PASSED] LVDS
[03:37:13] [PASSED] Component
[03:37:13] [PASSED] DIN
[03:37:13] [PASSED] DP
[03:37:13] [PASSED] HDMI-A
[03:37:13] [PASSED] HDMI-B
[03:37:13] [PASSED] TV
[03:37:13] [PASSED] eDP
[03:37:13] [PASSED] Virtual
[03:37:13] [PASSED] DSI
[03:37:13] [PASSED] DPI
[03:37:13] [PASSED] Writeback
[03:37:13] [PASSED] SPI
[03:37:13] [PASSED] USB
[03:37:13] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[03:37:13] =========== [PASSED] drm_connector_dynamic_init ============
[03:37:13] ==== drm_connector_dynamic_register_early (4 subtests) =====
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[03:37:13] ====== [PASSED] drm_connector_dynamic_register_early =======
[03:37:13] ======= drm_connector_dynamic_register (7 subtests) ========
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[03:37:13] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[03:37:13] ========= [PASSED] drm_connector_dynamic_register ==========
[03:37:13] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[03:37:13] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[03:37:13] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[03:37:13] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[03:37:13] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[03:37:13] ========== drm_test_get_tv_mode_from_name_valid  ===========
[03:37:13] [PASSED] NTSC
[03:37:13] [PASSED] NTSC-443
[03:37:13] [PASSED] NTSC-J
[03:37:13] [PASSED] PAL
[03:37:13] [PASSED] PAL-M
[03:37:13] [PASSED] PAL-N
[03:37:13] [PASSED] SECAM
[03:37:13] [PASSED] Mono
[03:37:13] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[03:37:13] [PASSED] drm_test_get_tv_mode_from_name_truncated
[03:37:13] ============ [PASSED] drm_get_tv_mode_from_name ============
[03:37:13] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[03:37:13] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[03:37:13] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[03:37:13] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[03:37:13] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[03:37:13] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[03:37:13] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[03:37:13] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[03:37:13] [PASSED] VIC 96
[03:37:13] [PASSED] VIC 97
[03:37:13] [PASSED] VIC 101
[03:37:13] [PASSED] VIC 102
[03:37:13] [PASSED] VIC 106
[03:37:13] [PASSED] VIC 107
[03:37:13] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[03:37:13] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[03:37:13] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[03:37:13] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[03:37:13] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[03:37:13] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[03:37:13] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[03:37:13] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[03:37:13] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[03:37:13] [PASSED] Automatic
[03:37:13] [PASSED] Full
[03:37:13] [PASSED] Limited 16:235
[03:37:13] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[03:37:13] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[03:37:13] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[03:37:13] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[03:37:13] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[03:37:13] [PASSED] RGB
[03:37:13] [PASSED] YUV 4:2:0
[03:37:13] [PASSED] YUV 4:2:2
[03:37:13] [PASSED] YUV 4:4:4
[03:37:13] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[03:37:13] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[03:37:13] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[03:37:13] ============= drm_damage_helper (21 subtests) ==============
[03:37:13] [PASSED] drm_test_damage_iter_no_damage
[03:37:13] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[03:37:13] [PASSED] drm_test_damage_iter_no_damage_src_moved
[03:37:13] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[03:37:13] [PASSED] drm_test_damage_iter_no_damage_not_visible
[03:37:13] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[03:37:13] [PASSED] drm_test_damage_iter_no_damage_no_fb
[03:37:13] [PASSED] drm_test_damage_iter_simple_damage
[03:37:13] [PASSED] drm_test_damage_iter_single_damage
[03:37:13] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[03:37:13] [PASSED] drm_test_damage_iter_single_damage_outside_src
[03:37:13] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[03:37:13] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[03:37:13] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[03:37:13] [PASSED] drm_test_damage_iter_single_damage_src_moved
[03:37:13] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[03:37:13] [PASSED] drm_test_damage_iter_damage
[03:37:13] [PASSED] drm_test_damage_iter_damage_one_intersect
[03:37:13] [PASSED] drm_test_damage_iter_damage_one_outside
[03:37:13] [PASSED] drm_test_damage_iter_damage_src_moved
[03:37:13] [PASSED] drm_test_damage_iter_damage_not_visible
[03:37:13] ================ [PASSED] drm_damage_helper ================
[03:37:13] ============== drm_dp_mst_helper (3 subtests) ==============
[03:37:13] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[03:37:13] [PASSED] Clock 154000 BPP 30 DSC disabled
[03:37:13] [PASSED] Clock 234000 BPP 30 DSC disabled
[03:37:13] [PASSED] Clock 297000 BPP 24 DSC disabled
[03:37:13] [PASSED] Clock 332880 BPP 24 DSC enabled
[03:37:13] [PASSED] Clock 324540 BPP 24 DSC enabled
[03:37:13] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[03:37:13] ============== drm_test_dp_mst_calc_pbn_div  ===============
[03:37:13] [PASSED] Link rate 2000000 lane count 4
[03:37:13] [PASSED] Link rate 2000000 lane count 2
[03:37:13] [PASSED] Link rate 2000000 lane count 1
[03:37:13] [PASSED] Link rate 1350000 lane count 4
[03:37:13] [PASSED] Link rate 1350000 lane count 2
[03:37:13] [PASSED] Link rate 1350000 lane count 1
[03:37:13] [PASSED] Link rate 1000000 lane count 4
[03:37:13] [PASSED] Link rate 1000000 lane count 2
[03:37:13] [PASSED] Link rate 1000000 lane count 1
[03:37:13] [PASSED] Link rate 810000 lane count 4
[03:37:13] [PASSED] Link rate 810000 lane count 2
[03:37:13] [PASSED] Link rate 810000 lane count 1
[03:37:13] [PASSED] Link rate 540000 lane count 4
[03:37:13] [PASSED] Link rate 540000 lane count 2
[03:37:13] [PASSED] Link rate 540000 lane count 1
[03:37:13] [PASSED] Link rate 270000 lane count 4
[03:37:13] [PASSED] Link rate 270000 lane count 2
[03:37:13] [PASSED] Link rate 270000 lane count 1
[03:37:13] [PASSED] Link rate 162000 lane count 4
[03:37:13] [PASSED] Link rate 162000 lane count 2
[03:37:13] [PASSED] Link rate 162000 lane count 1
[03:37:13] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[03:37:13] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[03:37:13] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[03:37:13] [PASSED] DP_POWER_UP_PHY with port number
[03:37:13] [PASSED] DP_POWER_DOWN_PHY with port number
[03:37:13] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[03:37:13] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[03:37:13] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[03:37:13] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[03:37:13] [PASSED] DP_QUERY_PAYLOAD with port number
[03:37:13] [PASSED] DP_QUERY_PAYLOAD with VCPI
[03:37:13] [PASSED] DP_REMOTE_DPCD_READ with port number
[03:37:13] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[03:37:13] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[03:37:13] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[03:37:13] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[03:37:13] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[03:37:13] [PASSED] DP_REMOTE_I2C_READ with port number
[03:37:13] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[03:37:13] [PASSED] DP_REMOTE_I2C_READ with transactions array
[03:37:13] [PASSED] DP_REMOTE_I2C_WRITE with port number
[03:37:13] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[03:37:13] [PASSED] DP_REMOTE_I2C_WRITE with data array
[03:37:13] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[03:37:13] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[03:37:13] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[03:37:13] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[03:37:13] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[03:37:13] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[03:37:13] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[03:37:13] ================ [PASSED] drm_dp_mst_helper ================
[03:37:13] ================== drm_exec (7 subtests) ===================
[03:37:13] [PASSED] sanitycheck
[03:37:13] [PASSED] test_lock
[03:37:13] [PASSED] test_lock_unlock
[03:37:13] [PASSED] test_duplicates
[03:37:13] [PASSED] test_prepare
[03:37:13] [PASSED] test_prepare_array
[03:37:13] [PASSED] test_multiple_loops
[03:37:13] ==================== [PASSED] drm_exec =====================
[03:37:13] =========== drm_format_helper_test (17 subtests) ===========
[03:37:13] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[03:37:13] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[03:37:13] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[03:37:13] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[03:37:13] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[03:37:13] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[03:37:13] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[03:37:13] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[03:37:13] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[03:37:13] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[03:37:13] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[03:37:13] ============== drm_test_fb_xrgb8888_to_mono  ===============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[03:37:13] ==================== drm_test_fb_swab  =====================
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ================ [PASSED] drm_test_fb_swab =================
[03:37:13] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[03:37:13] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[03:37:13] [PASSED] single_pixel_source_buffer
[03:37:13] [PASSED] single_pixel_clip_rectangle
[03:37:13] [PASSED] well_known_colors
[03:37:13] [PASSED] destination_pitch
[03:37:13] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[03:37:13] ================= drm_test_fb_clip_offset  =================
[03:37:13] [PASSED] pass through
[03:37:13] [PASSED] horizontal offset
[03:37:13] [PASSED] vertical offset
[03:37:13] [PASSED] horizontal and vertical offset
[03:37:13] [PASSED] horizontal offset (custom pitch)
[03:37:13] [PASSED] vertical offset (custom pitch)
[03:37:13] [PASSED] horizontal and vertical offset (custom pitch)
[03:37:13] ============= [PASSED] drm_test_fb_clip_offset =============
[03:37:13] =================== drm_test_fb_memcpy  ====================
[03:37:13] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[03:37:13] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[03:37:13] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[03:37:13] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[03:37:13] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[03:37:13] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[03:37:13] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[03:37:13] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[03:37:13] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[03:37:13] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[03:37:13] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[03:37:13] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[03:37:13] =============== [PASSED] drm_test_fb_memcpy ================
[03:37:13] ============= [PASSED] drm_format_helper_test ==============
[03:37:13] ================= drm_format (18 subtests) =================
[03:37:13] [PASSED] drm_test_format_block_width_invalid
[03:37:13] [PASSED] drm_test_format_block_width_one_plane
[03:37:13] [PASSED] drm_test_format_block_width_two_plane
[03:37:13] [PASSED] drm_test_format_block_width_three_plane
[03:37:13] [PASSED] drm_test_format_block_width_tiled
[03:37:13] [PASSED] drm_test_format_block_height_invalid
[03:37:13] [PASSED] drm_test_format_block_height_one_plane
[03:37:13] [PASSED] drm_test_format_block_height_two_plane
[03:37:13] [PASSED] drm_test_format_block_height_three_plane
[03:37:13] [PASSED] drm_test_format_block_height_tiled
[03:37:13] [PASSED] drm_test_format_min_pitch_invalid
[03:37:13] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[03:37:13] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[03:37:13] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[03:37:13] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[03:37:13] [PASSED] drm_test_format_min_pitch_two_plane
[03:37:13] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[03:37:13] [PASSED] drm_test_format_min_pitch_tiled
[03:37:13] =================== [PASSED] drm_format ====================
[03:37:13] ============== drm_framebuffer (10 subtests) ===============
[03:37:13] ========== drm_test_framebuffer_check_src_coords  ==========
[03:37:13] [PASSED] Success: source fits into fb
[03:37:13] [PASSED] Fail: overflowing fb with x-axis coordinate
[03:37:13] [PASSED] Fail: overflowing fb with y-axis coordinate
[03:37:13] [PASSED] Fail: overflowing fb with source width
[03:37:13] [PASSED] Fail: overflowing fb with source height
[03:37:13] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[03:37:13] [PASSED] drm_test_framebuffer_cleanup
[03:37:13] =============== drm_test_framebuffer_create  ===============
[03:37:13] [PASSED] ABGR8888 normal sizes
[03:37:13] [PASSED] ABGR8888 max sizes
[03:37:13] [PASSED] ABGR8888 pitch greater than min required
[03:37:13] [PASSED] ABGR8888 pitch less than min required
[03:37:13] [PASSED] ABGR8888 Invalid width
[03:37:13] [PASSED] ABGR8888 Invalid buffer handle
[03:37:13] [PASSED] No pixel format
[03:37:13] [PASSED] ABGR8888 Width 0
[03:37:13] [PASSED] ABGR8888 Height 0
[03:37:13] [PASSED] ABGR8888 Out of bound height * pitch combination
[03:37:13] [PASSED] ABGR8888 Large buffer offset
[03:37:13] [PASSED] ABGR8888 Buffer offset for inexistent plane
[03:37:13] [PASSED] ABGR8888 Invalid flag
[03:37:13] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[03:37:13] [PASSED] ABGR8888 Valid buffer modifier
[03:37:13] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[03:37:13] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[03:37:13] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[03:37:13] [PASSED] NV12 Normal sizes
[03:37:13] [PASSED] NV12 Max sizes
[03:37:13] [PASSED] NV12 Invalid pitch
[03:37:13] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[03:37:13] [PASSED] NV12 different  modifier per-plane
[03:37:13] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[03:37:13] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[03:37:13] [PASSED] NV12 Modifier for inexistent plane
[03:37:13] [PASSED] NV12 Handle for inexistent plane
[03:37:13] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[03:37:13] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[03:37:13] [PASSED] YVU420 Normal sizes
[03:37:13] [PASSED] YVU420 Max sizes
[03:37:13] [PASSED] YVU420 Invalid pitch
[03:37:13] [PASSED] YVU420 Different pitches
[03:37:13] [PASSED] YVU420 Different buffer offsets/pitches
[03:37:13] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[03:37:13] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[03:37:13] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[03:37:13] [PASSED] YVU420 Valid modifier
[03:37:13] [PASSED] YVU420 Different modifiers per plane
[03:37:13] [PASSED] YVU420 Modifier for inexistent plane
[03:37:13] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[03:37:13] [PASSED] X0L2 Normal sizes
[03:37:13] [PASSED] X0L2 Max sizes
[03:37:13] [PASSED] X0L2 Invalid pitch
[03:37:13] [PASSED] X0L2 Pitch greater than minimum required
[03:37:13] [PASSED] X0L2 Handle for inexistent plane
[03:37:13] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[03:37:13] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[03:37:13] [PASSED] X0L2 Valid modifier
[03:37:13] [PASSED] X0L2 Modifier for inexistent plane
[03:37:13] =========== [PASSED] drm_test_framebuffer_create ===========
[03:37:13] [PASSED] drm_test_framebuffer_free
[03:37:13] [PASSED] drm_test_framebuffer_init
[03:37:13] [PASSED] drm_test_framebuffer_init_bad_format
[03:37:13] [PASSED] drm_test_framebuffer_init_dev_mismatch
[03:37:13] [PASSED] drm_test_framebuffer_lookup
[03:37:13] [PASSED] drm_test_framebuffer_lookup_inexistent
[03:37:13] [PASSED] drm_test_framebuffer_modifiers_not_supported
[03:37:13] ================= [PASSED] drm_framebuffer =================
[03:37:13] ================ drm_gem_shmem (8 subtests) ================
[03:37:13] [PASSED] drm_gem_shmem_test_obj_create
[03:37:13] [PASSED] drm_gem_shmem_test_obj_create_private
[03:37:13] [PASSED] drm_gem_shmem_test_pin_pages
[03:37:13] [PASSED] drm_gem_shmem_test_vmap
[03:37:13] [PASSED] drm_gem_shmem_test_get_sg_table
[03:37:13] [PASSED] drm_gem_shmem_test_get_pages_sgt
[03:37:13] [PASSED] drm_gem_shmem_test_madvise
[03:37:13] [PASSED] drm_gem_shmem_test_purge
[03:37:13] ================== [PASSED] drm_gem_shmem ==================
[03:37:13] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[03:37:13] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[03:37:13] [PASSED] Automatic
[03:37:13] [PASSED] Full
[03:37:13] [PASSED] Limited 16:235
[03:37:13] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[03:37:13] [PASSED] drm_test_check_disable_connector
[03:37:13] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[03:37:13] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[03:37:13] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[03:37:13] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[03:37:13] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[03:37:13] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[03:37:13] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[03:37:13] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[03:37:13] [PASSED] drm_test_check_output_bpc_dvi
[03:37:13] [PASSED] drm_test_check_output_bpc_format_vic_1
[03:37:13] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[03:37:13] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[03:37:13] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[03:37:13] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[03:37:13] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[03:37:13] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[03:37:13] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[03:37:13] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[03:37:13] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[03:37:13] [PASSED] drm_test_check_broadcast_rgb_value
[03:37:13] [PASSED] drm_test_check_bpc_8_value
[03:37:13] [PASSED] drm_test_check_bpc_10_value
[03:37:13] [PASSED] drm_test_check_bpc_12_value
[03:37:13] [PASSED] drm_test_check_format_value
[03:37:13] [PASSED] drm_test_check_tmds_char_value
[03:37:13] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[03:37:13] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[03:37:13] [PASSED] drm_test_check_mode_valid
[03:37:13] [PASSED] drm_test_check_mode_valid_reject
[03:37:13] [PASSED] drm_test_check_mode_valid_reject_rate
[03:37:13] [PASSED] drm_test_check_mode_valid_reject_max_clock
[03:37:13] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[03:37:13] ================= drm_managed (2 subtests) =================
[03:37:13] [PASSED] drm_test_managed_release_action
[03:37:13] [PASSED] drm_test_managed_run_action
[03:37:13] =================== [PASSED] drm_managed ===================
[03:37:13] =================== drm_mm (6 subtests) ====================
[03:37:13] [PASSED] drm_test_mm_init
[03:37:13] [PASSED] drm_test_mm_debug
[03:37:13] [PASSED] drm_test_mm_align32
[03:37:13] [PASSED] drm_test_mm_align64
[03:37:13] [PASSED] drm_test_mm_lowest
[03:37:13] [PASSED] drm_test_mm_highest
[03:37:13] ===================== [PASSED] drm_mm ======================
[03:37:13] ============= drm_modes_analog_tv (5 subtests) =============
[03:37:13] [PASSED] drm_test_modes_analog_tv_mono_576i
[03:37:13] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[03:37:13] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[03:37:13] [PASSED] drm_test_modes_analog_tv_pal_576i
[03:37:13] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[03:37:13] =============== [PASSED] drm_modes_analog_tv ===============
[03:37:13] ============== drm_plane_helper (2 subtests) ===============
[03:37:13] =============== drm_test_check_plane_state  ================
[03:37:13] [PASSED] clipping_simple
[03:37:13] [PASSED] clipping_rotate_reflect
[03:37:13] [PASSED] positioning_simple
[03:37:13] [PASSED] upscaling
[03:37:13] [PASSED] downscaling
[03:37:13] [PASSED] rounding1
[03:37:13] [PASSED] rounding2
[03:37:13] [PASSED] rounding3
[03:37:13] [PASSED] rounding4
[03:37:13] =========== [PASSED] drm_test_check_plane_state ============
[03:37:13] =========== drm_test_check_invalid_plane_state  ============
[03:37:13] [PASSED] positioning_invalid
[03:37:13] [PASSED] upscaling_invalid
[03:37:13] [PASSED] downscaling_invalid
[03:37:13] ======= [PASSED] drm_test_check_invalid_plane_state ========
[03:37:13] ================ [PASSED] drm_plane_helper =================
[03:37:13] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[03:37:13] ====== drm_test_connector_helper_tv_get_modes_check  =======
[03:37:13] [PASSED] None
[03:37:13] [PASSED] PAL
[03:37:13] [PASSED] NTSC
[03:37:13] [PASSED] Both, NTSC Default
[03:37:13] [PASSED] Both, PAL Default
[03:37:13] [PASSED] Both, NTSC Default, with PAL on command-line
[03:37:13] [PASSED] Both, PAL Default, with NTSC on command-line
[03:37:13] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[03:37:13] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[03:37:13] ================== drm_rect (9 subtests) ===================
[03:37:13] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[03:37:13] [PASSED] drm_test_rect_clip_scaled_not_clipped
[03:37:13] [PASSED] drm_test_rect_clip_scaled_clipped
[03:37:13] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[03:37:13] ================= drm_test_rect_intersect  =================
[03:37:13] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[03:37:13] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[03:37:13] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[03:37:13] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[03:37:13] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[03:37:13] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[03:37:13] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[03:37:13] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[03:37:13] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[03:37:13] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[03:37:13] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[03:37:13] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[03:37:13] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[03:37:13] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[03:37:13] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[03:37:13] ============= [PASSED] drm_test_rect_intersect =============
[03:37:13] ================ drm_test_rect_calc_hscale  ================
[03:37:13] [PASSED] normal use
[03:37:13] [PASSED] out of max range
[03:37:13] [PASSED] out of min range
[03:37:13] [PASSED] zero dst
[03:37:13] [PASSED] negative src
[03:37:13] [PASSED] negative dst
[03:37:13] ============ [PASSED] drm_test_rect_calc_hscale ============
[03:37:13] ================ drm_test_rect_calc_vscale  ================
[03:37:13] [PASSED] normal use
stty: 'standard input': Inappropriate ioctl for device
[03:37:13] [PASSED] out of max range
[03:37:13] [PASSED] out of min range
[03:37:13] [PASSED] zero dst
[03:37:13] [PASSED] negative src
[03:37:13] [PASSED] negative dst
[03:37:13] ============ [PASSED] drm_test_rect_calc_vscale ============
[03:37:13] ================== drm_test_rect_rotate  ===================
[03:37:13] [PASSED] reflect-x
[03:37:13] [PASSED] reflect-y
[03:37:13] [PASSED] rotate-0
[03:37:13] [PASSED] rotate-90
[03:37:13] [PASSED] rotate-180
[03:37:13] [PASSED] rotate-270
[03:37:13] ============== [PASSED] drm_test_rect_rotate ===============
[03:37:13] ================ drm_test_rect_rotate_inv  =================
[03:37:13] [PASSED] reflect-x
[03:37:13] [PASSED] reflect-y
[03:37:13] [PASSED] rotate-0
[03:37:13] [PASSED] rotate-90
[03:37:13] [PASSED] rotate-180
[03:37:13] [PASSED] rotate-270
[03:37:13] ============ [PASSED] drm_test_rect_rotate_inv =============
[03:37:13] ==================== [PASSED] drm_rect =====================
[03:37:13] ============ drm_sysfb_modeset_test (1 subtest) ============
[03:37:13] ============ drm_test_sysfb_build_fourcc_list  =============
[03:37:13] [PASSED] no native formats
[03:37:13] [PASSED] XRGB8888 as native format
[03:37:13] [PASSED] remove duplicates
[03:37:13] [PASSED] convert alpha formats
[03:37:13] [PASSED] random formats
[03:37:13] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[03:37:13] ============= [PASSED] drm_sysfb_modeset_test ==============
[03:37:13] ================== drm_fixp (2 subtests) ===================
[03:37:13] [PASSED] drm_test_int2fixp
[03:37:13] [PASSED] drm_test_sm2fixp
[03:37:13] ==================== [PASSED] drm_fixp =====================
[03:37:13] ============================================================
[03:37:13] Testing complete. Ran 624 tests: passed: 624
[03:37:13] Elapsed time: 27.302s total, 1.634s configuring, 25.248s building, 0.373s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[03:37:13] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:37:15] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[03:37:24] Starting KUnit Kernel (1/1)...
[03:37:24] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[03:37:24] ================= ttm_device (5 subtests) ==================
[03:37:24] [PASSED] ttm_device_init_basic
[03:37:24] [PASSED] ttm_device_init_multiple
[03:37:24] [PASSED] ttm_device_fini_basic
[03:37:24] [PASSED] ttm_device_init_no_vma_man
[03:37:24] ================== ttm_device_init_pools  ==================
[03:37:24] [PASSED] No DMA allocations, no DMA32 required
[03:37:24] [PASSED] DMA allocations, DMA32 required
[03:37:24] [PASSED] No DMA allocations, DMA32 required
[03:37:24] [PASSED] DMA allocations, no DMA32 required
[03:37:24] ============== [PASSED] ttm_device_init_pools ==============
[03:37:24] =================== [PASSED] ttm_device ====================
[03:37:24] ================== ttm_pool (8 subtests) ===================
[03:37:24] ================== ttm_pool_alloc_basic  ===================
[03:37:24] [PASSED] One page
[03:37:24] [PASSED] More than one page
[03:37:24] [PASSED] Above the allocation limit
[03:37:24] [PASSED] One page, with coherent DMA mappings enabled
[03:37:24] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[03:37:24] ============== [PASSED] ttm_pool_alloc_basic ===============
[03:37:24] ============== ttm_pool_alloc_basic_dma_addr  ==============
[03:37:24] [PASSED] One page
[03:37:24] [PASSED] More than one page
[03:37:24] [PASSED] Above the allocation limit
[03:37:24] [PASSED] One page, with coherent DMA mappings enabled
[03:37:24] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[03:37:24] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[03:37:24] [PASSED] ttm_pool_alloc_order_caching_match
[03:37:24] [PASSED] ttm_pool_alloc_caching_mismatch
[03:37:24] [PASSED] ttm_pool_alloc_order_mismatch
[03:37:24] [PASSED] ttm_pool_free_dma_alloc
[03:37:24] [PASSED] ttm_pool_free_no_dma_alloc
[03:37:24] [PASSED] ttm_pool_fini_basic
[03:37:24] ==================== [PASSED] ttm_pool =====================
[03:37:24] ================ ttm_resource (8 subtests) =================
[03:37:24] ================= ttm_resource_init_basic  =================
[03:37:24] [PASSED] Init resource in TTM_PL_SYSTEM
[03:37:24] [PASSED] Init resource in TTM_PL_VRAM
[03:37:24] [PASSED] Init resource in a private placement
[03:37:24] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[03:37:24] ============= [PASSED] ttm_resource_init_basic =============
[03:37:24] [PASSED] ttm_resource_init_pinned
[03:37:24] [PASSED] ttm_resource_fini_basic
[03:37:24] [PASSED] ttm_resource_manager_init_basic
[03:37:24] [PASSED] ttm_resource_manager_usage_basic
[03:37:24] [PASSED] ttm_resource_manager_set_used_basic
[03:37:24] [PASSED] ttm_sys_man_alloc_basic
[03:37:24] [PASSED] ttm_sys_man_free_basic
[03:37:24] ================== [PASSED] ttm_resource ===================
[03:37:24] =================== ttm_tt (15 subtests) ===================
[03:37:24] ==================== ttm_tt_init_basic  ====================
[03:37:24] [PASSED] Page-aligned size
[03:37:24] [PASSED] Extra pages requested
[03:37:24] ================ [PASSED] ttm_tt_init_basic ================
[03:37:24] [PASSED] ttm_tt_init_misaligned
[03:37:24] [PASSED] ttm_tt_fini_basic
[03:37:24] [PASSED] ttm_tt_fini_sg
[03:37:24] [PASSED] ttm_tt_fini_shmem
[03:37:24] [PASSED] ttm_tt_create_basic
[03:37:24] [PASSED] ttm_tt_create_invalid_bo_type
[03:37:24] [PASSED] ttm_tt_create_ttm_exists
[03:37:24] [PASSED] ttm_tt_create_failed
[03:37:24] [PASSED] ttm_tt_destroy_basic
[03:37:24] [PASSED] ttm_tt_populate_null_ttm
[03:37:24] [PASSED] ttm_tt_populate_populated_ttm
[03:37:24] [PASSED] ttm_tt_unpopulate_basic
[03:37:24] [PASSED] ttm_tt_unpopulate_empty_ttm
[03:37:24] [PASSED] ttm_tt_swapin_basic
[03:37:24] ===================== [PASSED] ttm_tt ======================
[03:37:24] =================== ttm_bo (14 subtests) ===================
[03:37:24] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[03:37:24] [PASSED] Cannot be interrupted and sleeps
[03:37:24] [PASSED] Cannot be interrupted, locks straight away
[03:37:24] [PASSED] Can be interrupted, sleeps
[03:37:24] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[03:37:24] [PASSED] ttm_bo_reserve_locked_no_sleep
[03:37:24] [PASSED] ttm_bo_reserve_no_wait_ticket
[03:37:24] [PASSED] ttm_bo_reserve_double_resv
[03:37:24] [PASSED] ttm_bo_reserve_interrupted
[03:37:24] [PASSED] ttm_bo_reserve_deadlock
[03:37:24] [PASSED] ttm_bo_unreserve_basic
[03:37:24] [PASSED] ttm_bo_unreserve_pinned
[03:37:24] [PASSED] ttm_bo_unreserve_bulk
[03:37:24] [PASSED] ttm_bo_fini_basic
[03:37:24] [PASSED] ttm_bo_fini_shared_resv
[03:37:24] [PASSED] ttm_bo_pin_basic
[03:37:24] [PASSED] ttm_bo_pin_unpin_resource
[03:37:24] [PASSED] ttm_bo_multiple_pin_one_unpin
[03:37:24] ===================== [PASSED] ttm_bo ======================
[03:37:24] ============== ttm_bo_validate (21 subtests) ===============
[03:37:24] ============== ttm_bo_init_reserved_sys_man  ===============
[03:37:24] [PASSED] Buffer object for userspace
[03:37:24] [PASSED] Kernel buffer object
[03:37:24] [PASSED] Shared buffer object
[03:37:24] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[03:37:24] ============== ttm_bo_init_reserved_mock_man  ==============
[03:37:24] [PASSED] Buffer object for userspace
[03:37:24] [PASSED] Kernel buffer object
[03:37:24] [PASSED] Shared buffer object
[03:37:24] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[03:37:24] [PASSED] ttm_bo_init_reserved_resv
[03:37:24] ================== ttm_bo_validate_basic  ==================
[03:37:24] [PASSED] Buffer object for userspace
[03:37:24] [PASSED] Kernel buffer object
[03:37:24] [PASSED] Shared buffer object
[03:37:24] ============== [PASSED] ttm_bo_validate_basic ==============
[03:37:24] [PASSED] ttm_bo_validate_invalid_placement
[03:37:24] ============= ttm_bo_validate_same_placement  ==============
[03:37:24] [PASSED] System manager
[03:37:24] [PASSED] VRAM manager
[03:37:24] ========= [PASSED] ttm_bo_validate_same_placement ==========
[03:37:24] [PASSED] ttm_bo_validate_failed_alloc
[03:37:24] [PASSED] ttm_bo_validate_pinned
[03:37:24] [PASSED] ttm_bo_validate_busy_placement
[03:37:24] ================ ttm_bo_validate_multihop  =================
[03:37:24] [PASSED] Buffer object for userspace
[03:37:24] [PASSED] Kernel buffer object
[03:37:24] [PASSED] Shared buffer object
[03:37:24] ============ [PASSED] ttm_bo_validate_multihop =============
[03:37:24] ========== ttm_bo_validate_no_placement_signaled  ==========
[03:37:24] [PASSED] Buffer object in system domain, no page vector
[03:37:24] [PASSED] Buffer object in system domain with an existing page vector
[03:37:24] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[03:37:24] ======== ttm_bo_validate_no_placement_not_signaled  ========
[03:37:24] [PASSED] Buffer object for userspace
[03:37:24] [PASSED] Kernel buffer object
[03:37:24] [PASSED] Shared buffer object
[03:37:24] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[03:37:24] [PASSED] ttm_bo_validate_move_fence_signaled
[03:37:24] ========= ttm_bo_validate_move_fence_not_signaled  =========
[03:37:24] [PASSED] Waits for GPU
[03:37:24] [PASSED] Tries to lock straight away
[03:37:24] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[03:37:24] [PASSED] ttm_bo_validate_happy_evict
[03:37:24] [PASSED] ttm_bo_validate_all_pinned_evict
[03:37:24] [PASSED] ttm_bo_validate_allowed_only_evict
[03:37:24] [PASSED] ttm_bo_validate_deleted_evict
[03:37:24] [PASSED] ttm_bo_validate_busy_domain_evict
[03:37:24] [PASSED] ttm_bo_validate_evict_gutting
[03:37:24] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[03:37:24] ================= [PASSED] ttm_bo_validate =================
[03:37:24] ============================================================
[03:37:24] Testing complete. Ran 101 tests: passed: 101
[03:37:24] Elapsed time: 11.210s total, 1.708s configuring, 9.286s building, 0.185s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 22+ messages in thread

* ✗ CI.checksparse: warning for Introduce DRM_RAS using generic netlink for RAS (rev4)
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
  2026-01-19  3:36 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev4) Patchwork
  2026-01-19  3:37 ` ✓ CI.KUnit: success " Patchwork
@ 2026-01-19  3:52 ` Patchwork
  2026-01-19  4:00 ` [PATCH v4 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Patchwork @ 2026-01-19  3:52 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev4)
URL   : https://patchwork.freedesktop.org/series/155188/
State : warning

== Summary ==

+ trap cleanup EXIT
+ KERNEL=/kernel
+ MT=/root/linux/maintainer-tools
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools /root/linux/maintainer-tools
Cloning into '/root/linux/maintainer-tools'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ make -C /root/linux/maintainer-tools
make: Entering directory '/root/linux/maintainer-tools'
cc -O2 -g -Wextra -o remap-log remap-log.c
make: Leaving directory '/root/linux/maintainer-tools'
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ /root/linux/maintainer-tools/dim sparse --fast af18785a6a8621b1a5805ba8a1b35d290cb4bcac
Sparse version: 0.6.4 (Ubuntu: 0.6.4-4ubuntu3)
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/drm_drv.c:61:1: error: bad constant expression
+drivers/gpu/drm/drm_drv.c:62:1: error: bad constant expression
+drivers/gpu/drm/drm_drv.c:63:1: error: bad constant expression
+drivers/gpu/drm/drm_drv.c:63:1: error: bad constant expression

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS
@ 2026-01-19  4:00 Riana Tauro
  2026-01-19  3:36 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev4) Patchwork
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Riana Tauro @ 2026-01-19  4:00 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro

This work is a continuation of the great work started by Aravind ([1] and [2])
in order to fulfill the RAS requirements and proposal as previously discussed
and agreed in the Linux Plumbers accelerator's bof of 2022 [3].

[1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
[2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
[3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

During the past review round, Lukas pointed out that netlink had evolved
in parallel during these years and that now, any new usage of netlink families
would require the usage of the YAML description and scripts.

With this new requirement in place, the family name is hardcoded in the yaml file,
so we are forced to have a single family name for the entire drm, and then we now
we are forced to have a registration.

So, while doing the registration, we now created the concept of drm-ras-node.
For now the only node type supported is the agreed error-counter. But that could
be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
driver.

In this first version, only querying counter is supported. But also this is expandable
to future introduction of multicast notification and also clearing the counters.

This design with multiple nodes per device is already flexible enough for driver
to decide if it wants to handle error per device, or per IP block, or per error
category. I believe this fully attend to the requested AMD feedback in the earlier
reviews.

So, my proposal is to start simple with this case as is, and then iterate over
with the drm-ras in tree so we evolve together according to various driver's RAS
needs.

I have provided a documentation and the first Xe implementation of the counter
as reference.

Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
exercises this new API, hence I hope this can be the reference code for the uAPI
usage, while we continue with the plan of introducing IGT tests and tools for this
and adjusting the internal vendor tools to open with open source developments and
changing them to support these flows.

Example:

List Nodes:

$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '0000:03:00.0',
  'node-id': 0,
  'node-name': 'correctable-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 1,
  'node-name': 'uncorrectable-errors',
  'node-type': 'error-counter'}]

Get Error counters:

$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":0}'
[{'error-id': 1, 'error-name': 'GT', 'error-value': 0},
 {'error-id': 2, 'error-name': 'SoC', 'error-value': 0}]

Query Error counter:

$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'GT', 'error-value': 0}

IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3

Rev2: Fix review comments
      Add support for GT and SOC errors

Rev3: Add uAPI for errors and nodes
      Update documentation
       
Rev4: Use only correctable and uncorrectable error nodes
      use REG_BIT
      remove redundant error strings

Riana Tauro (3):
  drm/xe/xe_drm_ras: Add support for drm ras
  drm/xe/xe_hw_error: Add support for GT hardware errors
  drm/xe/xe_hw_error: Add support for PVC SOC errors

Rodrigo Vivi (1):
  drm/ras: Introduce the DRM RAS infrastructure over generic netlink

 Documentation/gpu/drm-ras.rst              | 109 ++++++
 Documentation/gpu/index.rst                |   1 +
 Documentation/netlink/specs/drm_ras.yaml   | 130 +++++++
 drivers/gpu/drm/Kconfig                    |   9 +
 drivers/gpu/drm/Makefile                   |   1 +
 drivers/gpu/drm/drm_drv.c                  |   6 +
 drivers/gpu/drm/drm_ras.c                  | 351 +++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c      |  42 ++
 drivers/gpu/drm/drm_ras_nl.c               |  54 +++
 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  77 +++-
 drivers/gpu/drm/xe/xe_device_types.h       |   4 +
 drivers/gpu/drm/xe/xe_drm_ras.c            | 176 +++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h            |  15 +
 drivers/gpu/drm/xe/xe_drm_ras_types.h      |  49 +++
 drivers/gpu/drm/xe/xe_hw_error.c           | 431 +++++++++++++++++++--
 include/drm/drm_ras.h                      |  76 ++++
 include/drm/drm_ras_genl_family.h          |  17 +
 include/drm/drm_ras_nl.h                   |  24 ++
 include/uapi/drm/drm_ras.h                 |  49 +++
 include/uapi/drm/xe_drm.h                  |  79 ++++
 21 files changed, 1663 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (2 preceding siblings ...)
  2026-01-19  3:52 ` ✗ CI.checksparse: warning " Patchwork
@ 2026-01-19  4:00 ` Riana Tauro
  2026-01-22 21:51   ` Zack McKevitt
  2026-01-19  4:00 ` [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Riana Tauro @ 2026-01-19  4:00 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Eric Dumazet,
	netdev, Riana Tauro

From: Rodrigo Vivi <rodrigo.vivi@intel.com>

Introduces the DRM RAS infrastructure over generic netlink.

The new interface allows drivers to expose RAS nodes and their
associated error counters to userspace in a structured and extensible
way. Each drm_ras node can register its own set of error counters, which
are then discoverable and queryable through netlink operations. This
lays the groundwork for reporting and managing hardware error states
in a unified manner across different DRM drivers.

Currently is only supports error-counter nodes. But it can be
extended later.

The registration is also no tied to any drm node, so it can be
used by accel devices as well.

It uses the new and mandatory YAML description format stored in
Documentation/netlink/specs/. This forces a single generic netlink
family namespace for the entire drm: "drm-ras".
But multiple-endpoints are supported within the single family.

Any modification to this API needs to be applied to
Documentation/netlink/specs/drm_ras.yaml before regenerating the
code:

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
 > include/uapi/drm/drm_ras.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
 > include/drm/drm_ras_nl.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
 > drivers/gpu/drm/drm_ras_nl.c

Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: fix doc and memory leak
    use xe_for_each_start
    use standard genlmsg_iput (Jakub Kicinski)

v3: add documentation to index
    modify documentation to mention uAPI requirements (Rodrigo)

v4: fix typo (Zack)
---
 Documentation/gpu/drm-ras.rst            | 109 +++++++
 Documentation/gpu/index.rst              |   1 +
 Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
 drivers/gpu/drm/Kconfig                  |   9 +
 drivers/gpu/drm/Makefile                 |   1 +
 drivers/gpu/drm/drm_drv.c                |   6 +
 drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
 drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
 include/drm/drm_ras.h                    |  76 +++++
 include/drm/drm_ras_genl_family.h        |  17 ++
 include/drm/drm_ras_nl.h                 |  24 ++
 include/uapi/drm/drm_ras.h               |  49 ++++
 13 files changed, 869 insertions(+)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
new file mode 100644
index 000000000000..cec60cf5d17d
--- /dev/null
+++ b/Documentation/gpu/drm-ras.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================
+DRM RAS over Generic Netlink
+============================
+
+The DRM RAS (Reliability, Availability, Serviceability) interface provides a
+standardized way for GPU/accelerator drivers to expose error counters and
+other reliability nodes to user space via Generic Netlink. This allows
+diagnostic tools, monitoring daemons, or test infrastructure to query hardware
+health in a uniform way across different DRM drivers.
+
+Key Goals:
+
+* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
+  data center monitoring and reliability operations.
+* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
+  specifications and centralize all RAS-related communication in one namespace.
+* Support a basic error counter interface, addressing the immediate, essential
+  monitoring needs.
+* Offer a flexible, future-proof interface that can be extended to support
+  additional types of RAS data in the future.
+* Allow multiple nodes per driver, enabling drivers to register separate
+  nodes for different IP blocks, sub-blocks, or other logical subdivisions
+  as applicable.
+
+Nodes
+=====
+
+Nodes are logical abstractions representing an error source or block within
+the device. Currently, only error counter nodes is supported.
+
+Drivers are responsible for registering and unregistering nodes via the
+`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
+
+Node Management
+-------------------
+
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :doc: DRM RAS Node Management
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :internal:
+
+Generic Netlink Usage
+=====================
+
+The interface is implemented as a Generic Netlink family named ``drm-ras``.
+User space tools can:
+
+* List registered nodes with the ``get-nodes`` command.
+* List all error counters in an node with the ``get-error-counters`` command.
+* Query error counters using the ``query-error-counter`` command.
+
+YAML-based Interface
+--------------------
+
+The interface is described in a YAML specification:
+
+:ref:`Documentation/netlink/specs/drm_ras.yaml`
+
+This YAML is used to auto-generate user space bindings via
+``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
+attributes and operations.
+
+Usage Notes
+-----------
+
+* User space must first enumerate nodes to obtain their IDs.
+* Node IDs or Node names can be used for all further queries, such as error counters.
+* Error counters can be queried by either the Error ID or Error name.
+* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
+* The interface supports future extension by adding new node types and
+  additional attributes.
+
+Example: List nodes using ynl
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras  --dump list-nodes
+    [{'device-name': '0000:03:00.0',
+    'node-id': 0,
+    'node-name': 'correctable-errors',
+    'node-type': 'error-counter'},
+    {'device-name': '0000:03:00.0',
+     'node-id': 1,
+    'node-name': 'nonfatal-errors',
+    'node-type': 'error-counter'},
+    {'device-name': '0000:03:00.0',
+    'node-id': 2,
+    'node-name': 'fatal-errors',
+    'node-type': 'error-counter'}]
+
+Example: List all error counters using ynl
+
+.. code-block:: bash
+
+
+   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
+   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
+   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
+
+
+Example: Query an error counter for a given node
+
+.. code-block:: bash
+
+   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
+   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
+
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index 7dcb15850afd..60c73fdcfeed 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -9,6 +9,7 @@ GPU Driver Developer's Guide
    drm-mm
    drm-kms
    drm-kms-helpers
+   drm-ras
    drm-uapi
    drm-usage-stats
    driver-uapi
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
new file mode 100644
index 000000000000..be0e379c5bc9
--- /dev/null
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -0,0 +1,130 @@
+# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+---
+name: drm-ras
+protocol: genetlink
+uapi-header: drm/drm_ras.h
+
+doc: >-
+  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
+  Provides a standardized mechanism for DRM drivers to register "nodes"
+  representing hardware/software components capable of reporting error counters.
+  Userspace tools can query the list of nodes or individual error counters
+  via the Generic Netlink interface.
+
+definitions:
+  -
+    type: enum
+    name: node-type
+    value-start: 1
+    entries: [error-counter]
+    doc: >-
+         Type of the node. Currently, only error-counter nodes are
+         supported, which expose reliability counters for a hardware/software
+         component.
+
+attribute-sets:
+  -
+    name: node-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc: >-
+             Unique identifier for the node.
+             Assigned dynamically by the DRM RAS core upon registration.
+      -
+        name: device-name
+        type: string
+        doc: >-
+             Device name chosen by the driver at registration.
+             Can be a PCI BDF, UUID, or module name if unique.
+      -
+        name: node-name
+        type: string
+        doc: >-
+             Node name chosen by the driver at registration.
+             Can be an IP block name, or any name that identifies the
+             RAS node inside the device.
+      -
+        name: node-type
+        type: u32
+        doc: Type of this node, identifying its function.
+        enum: node-type
+  -
+    name: error-counter-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc:  Node ID targeted by this error counter operation.
+      -
+        name: error-id
+        type: u32
+        doc: Unique identifier for a specific error counter within an node.
+      -
+        name: error-name
+        type: string
+        doc: Name of the error.
+      -
+        name: error-value
+        type: u32
+        doc: Current value of the requested error counter.
+
+operations:
+  list:
+    -
+      name: list-nodes
+      doc: >-
+           Retrieve the full list of currently registered DRM RAS nodes.
+           Each node includes its dynamically assigned ID, name, and type.
+           **Important:** User space must call this operation first to obtain
+           the node IDs. These IDs are required for all subsequent
+           operations on nodes, such as querying error counters.
+      attribute-set: node-attrs
+      flags: [admin-perm]
+      dump:
+        reply:
+          attributes:
+            - node-id
+            - device-name
+            - node-name
+            - node-type
+    -
+      name: get-error-counters
+      doc: >-
+           Retrieve the full list of error counters for a given node.
+           The response include the id, the name, and even the current
+           value of each counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      dump:
+        request:
+          attributes:
+            - node-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+    -
+      name: query-error-counter
+      doc: >-
+           Query the information of a specific error counter for a given node.
+           Users must provide the node ID and the error counter ID.
+           The response contains the id, the name, and the current value
+           of the counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes:
+            - node-id
+            - error-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+
+kernel-family:
+  headers: ["drm/drm_ras_nl.h"]
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index a33b90251530..f378e77048c8 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
 	  Smaller QR code are easier to read, but will contain less debugging
 	  data. Default is 40.
 
+config DRM_RAS
+	bool "DRM RAS support"
+	depends on DRM
+	help
+	  Enables the DRM RAS (Reliability, Availability and Serviceability)
+	  support for DRM drivers. This provides a Generic Netlink interface
+	  for error reporting and queries.
+	  If in doubt, say "N".
+
 config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
         bool "Enable refcount backtrace history in the DP MST helpers"
 	depends on STACKTRACE_SUPPORT
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 0deee72ef935..2eea3f54db53 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
 drm-$(CONFIG_DRM_PANIC) += drm_panic.o
 drm-$(CONFIG_DRM_DRAW) += drm_draw.o
 drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
+drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
 obj-$(CONFIG_DRM)	+= drm.o
 
 obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 2915118436ce..6b965c3d3307 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -53,6 +53,7 @@
 #include <drm/drm_panic.h>
 #include <drm/drm_print.h>
 #include <drm/drm_privacy_screen_machine.h>
+#include <drm/drm_ras_genl_family.h>
 
 #include "drm_crtc_internal.h"
 #include "drm_internal.h"
@@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
 
 static void drm_core_exit(void)
 {
+	drm_ras_genl_family_unregister();
 	drm_privacy_screen_lookup_exit();
 	drm_panic_exit();
 	accel_core_exit();
@@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
 
 	drm_privacy_screen_lookup_init();
 
+	ret = drm_ras_genl_family_register();
+	if (ret < 0)
+		goto error;
+
 	drm_core_init_complete = true;
 
 	DRM_DEBUG("Initialized\n");
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
new file mode 100644
index 000000000000..7bc77ea24fe2
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/xarray.h>
+#include <net/genetlink.h>
+
+#include <drm/drm_ras.h>
+
+/**
+ * DOC: DRM RAS Node Management
+ *
+ * This module provides the infrastructure to manage RAS (Reliability,
+ * Availability, and Serviceability) nodes for DRM drivers. Each
+ * DRM driver may register one or more RAS nodes, which represent
+ * logical components capable of reporting error counters and other
+ * reliability metrics.
+ *
+ * The nodes are stored in a global xarray `drm_ras_xa` to allow
+ * efficient lookup by ID. Nodes can be registered or unregistered
+ * dynamically at runtime.
+ *
+ * A Generic Netlink family `drm_ras` exposes three main operations to
+ * userspace:
+ *
+ * 1. LIST_NODES: Dump all currently registered RAS nodes.
+ *    The user receives an array of node IDs, names, and types.
+ *
+ * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
+ *    The user receives an array of error IDs, names, and current value.
+ *
+ * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
+ *    Userspace must provide the node ID and the counter ID, and
+ *    receives the ID, the error name, and its current value.
+ *
+ * Node registration:
+ * - drm_ras_node_register(): Registers a new node and assigns
+ *   it a unique ID in the xarray.
+ * - drm_ras_node_unregister(): Removes a previously registered
+ *   node from the xarray.
+ *
+ * Node type:
+ * - ERROR_COUNTER:
+ *     + Currently, only error counters are supported.
+ *     + The driver must implement the query_error_counter() callback to provide
+ *       the name and the value of the error counter.
+ *     + The driver must provide a error_counter_range.last value informing the
+ *       last valid error ID.
+ *     + The driver can provide a error_counter_range.first value informing the
+ *       frst valid error ID.
+ *     + The error counters in the driver doesn't need to be contiguous, but the
+ *       driver must return -ENOENT to the query_error_counter as an indication
+ *       that the ID should be skipped and not listed in the netlink API.
+ *
+ * Netlink handlers:
+ * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
+ *   operation, iterating over the xarray.
+ * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
+ *   operation, iterating over the know valid error_counter_range.
+ * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
+ *   operation, fetching a counter value from a specific node.
+ */
+
+static DEFINE_XARRAY_ALLOC(drm_ras_xa);
+
+/*
+ * The netlink callback context carries dump state across multiple dumpit calls
+ */
+struct drm_ras_ctx {
+	/* Which xarray id to restart the dump from */
+	unsigned long restart;
+};
+
+/**
+ * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all registered RAS nodes in the global xarray and appends
+ * their attributes (ID, name, type) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	unsigned long id;
+	int ret;
+
+	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
+		hdr = genlmsg_iput(skb, info);
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+				     node->device_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+				     node->node_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+				  node->type);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE)
+		ctx->restart = id;
+
+	return ret;
+}
+
+static int get_node_error_counter(u32 node_id, u32 error_id,
+				  const char **name, u32 *value)
+{
+	struct drm_ras_node *node;
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node || !node->query_error_counter)
+		return -ENOENT;
+
+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->query_error_counter(node, error_id, name, value);
+}
+
+static int msg_reply_value(struct sk_buff *msg, u32 error_id,
+			   const char *error_name, u32 value)
+{
+	int ret;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+	if (ret)
+		return ret;
+
+	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+			     error_name);
+	if (ret)
+		return ret;
+
+	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+			   value);
+}
+
+static int doit_reply_value(struct genl_info *info, u32 node_id,
+			    u32 error_id)
+{
+	struct sk_buff *msg;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 value;
+	int ret;
+
+	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_iput(msg, info);
+	if (!hdr) {
+		nlmsg_free(msg);
+		return -EMSGSIZE;
+	}
+
+	ret = get_node_error_counter(node_id, error_id,
+				     &error_name, &value);
+	if (ret)
+		return ret;
+
+	ret = msg_reply_value(msg, error_id, error_name, value);
+	if (ret) {
+		genlmsg_cancel(msg, hdr);
+		nlmsg_free(msg);
+		return ret;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_reply(msg, info);
+}
+
+/**
+ * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all error counters in a given Node and appends
+ * their attributes (ID, name, value) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 node_id, error_id, value;
+	int ret;
+
+	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node)
+		return -ENOENT;
+
+	for (error_id = max(node->error_counter_range.first, ctx->restart);
+	     error_id <= node->error_counter_range.last;
+	     error_id++) {
+		ret = get_node_error_counter(node_id, error_id,
+					     &error_name, &value);
+		/*
+		 * For non-contiguous range, driver return -ENOENT as indication
+		 * to skip this ID when listing all errors.
+		 */
+		if (ret == -ENOENT)
+			continue;
+		if (ret)
+			return ret;
+
+		hdr = genlmsg_iput(skb, info);
+
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = msg_reply_value(skb, error_id, error_name, value);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE)
+		ctx->restart = error_id;
+
+	return ret;
+}
+
+/**
+ * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * retrieves the current value of the corresponding error counter. Sends the
+ * result back to the requesting user via the standard Genl reply.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	return doit_reply_value(info, node_id, error_id);
+}
+
+/**
+ * drm_ras_node_register() - Register a new RAS node
+ * @node: Node structure to register
+ *
+ * Adds the given RAS node to the global node xarray and assigns it
+ * a unique ID. Both @node->name and @node->type must be valid.
+ *
+ * Return: 0 on success, or negative errno on failure:
+ */
+int drm_ras_node_register(struct drm_ras_node *node)
+{
+	if (!node->device_name || !node->node_name)
+		return -EINVAL;
+
+	/* Currently, only Error Counter Endpoinnts are supported */
+	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
+		return -EINVAL;
+
+	/* Mandatorty entries for Error Counter Node */
+	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
+	    (!node->error_counter_range.last || !node->query_error_counter))
+		return -EINVAL;
+
+	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
+}
+EXPORT_SYMBOL(drm_ras_node_register);
+
+/**
+ * drm_ras_node_unregister() - Unregister a previously registered node
+ * @node: Node structure to unregister
+ *
+ * Removes the given node from the global node xarray using its ID.
+ */
+void drm_ras_node_unregister(struct drm_ras_node *node)
+{
+	xa_erase(&drm_ras_xa, node->id);
+}
+EXPORT_SYMBOL(drm_ras_node_unregister);
diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
new file mode 100644
index 000000000000..2d818b8c3808
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_genl_family.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <drm/drm_ras_genl_family.h>
+#include <drm/drm_ras_nl.h>
+
+/* Track family registration so the drm_exit can be called at any time */
+static bool registered;
+
+/**
+ * drm_ras_genl_family_register() - Register drm-ras genl family
+ *
+ * Only to be called one at drm_drv_init()
+ */
+int drm_ras_genl_family_register(void)
+{
+	int ret;
+
+	registered = false;
+
+	ret = genl_register_family(&drm_ras_nl_family);
+	if (ret)
+		return ret;
+
+	registered = true;
+	return 0;
+}
+
+/**
+ * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
+ *
+ * To be called one at drm_drv_exit() at any moment, but only once.
+ */
+void drm_ras_genl_family_unregister(void)
+{
+	if (registered) {
+		genl_unregister_family(&drm_ras_nl_family);
+		registered = false;
+	}
+}
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
new file mode 100644
index 000000000000..fcd1392410e4
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel source */
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
+static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+};
+
+/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
+/* Ops table for drm_ras */
+static const struct genl_split_ops drm_ras_nl_ops[] = {
+	{
+		.cmd	= DRM_RAS_CMD_LIST_NODES,
+		.dumpit	= drm_ras_nl_list_nodes_dumpit,
+		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
+		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
+		.policy		= drm_ras_get_error_counters_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+		.doit		= drm_ras_nl_query_error_counter_doit,
+		.policy		= drm_ras_query_error_counter_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+};
+
+struct genl_family drm_ras_nl_family __ro_after_init = {
+	.name		= DRM_RAS_FAMILY_NAME,
+	.version	= DRM_RAS_FAMILY_VERSION,
+	.netnsok	= true,
+	.parallel_ops	= true,
+	.module		= THIS_MODULE,
+	.split_ops	= drm_ras_nl_ops,
+	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
+};
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
new file mode 100644
index 000000000000..bba47a282ef8
--- /dev/null
+++ b/include/drm/drm_ras.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_H__
+#define __DRM_RAS_H__
+
+#include "drm_ras_nl.h"
+
+/**
+ * struct drm_ras_node - A DRM RAS Node
+ */
+struct drm_ras_node {
+	/** @id: Unique identifier for the node. Dynamically assigned. */
+	u32 id;
+	/**
+	 * @device_name: Human-readable name of the device. Given by the driver.
+	 */
+	const char *device_name;
+	/** @node_name: Human-readable name of the node. Given by the driver. */
+	const char *node_name;
+	/** @type: Type of the node (enum drm_ras_node_type). */
+	enum drm_ras_node_type type;
+
+	/* Error-Counter Related Callback and Variables */
+
+	/** @error_counter_range: Range of valid Error IDs for this node. */
+	struct {
+		/** @first: First valid Error ID. */
+		u32 first;
+		/** @last: Last valid Error ID. Mandatory entry. */
+		u32 last;
+	} error_counter_range;
+
+	/**
+	 * @query_error_counter:
+	 *
+	 * This callback is used by drm-ras to query a specific error counter.
+	 * counters supported by this node. Used for input check and to
+	 * iterate in all counters.
+	 *
+	 * Driver should expect query_error_counters() to be called with
+	 * error_id from `error_counter_range.first` to
+	 * `error_counter_range.last`.
+	 *
+	 * The @query_error_counter is a mandatory callback for
+	 * error_counter_node.
+	 *
+	 * Returns: 0 on success,
+	 *          -ENOENT when error_id is not supported as an indication that
+	 *                  drm_ras should silently skip this entry. Used for
+	 *                  supporting non-contiguous error ranges.
+	 *                  Driver is responsible for maintaining the list of
+	 *                  supported error IDs in the range of first to last.
+	 *          Other negative values on errors that should terminate the
+	 *          netlink query.
+	 */
+	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
+				   const char **name, u32 *val);
+
+	/** @priv: Driver private data */
+	void *priv;
+};
+
+struct drm_device;
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_node_register(struct drm_ras_node *ep);
+void drm_ras_node_unregister(struct drm_ras_node *ep);
+#else
+static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
+static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
new file mode 100644
index 000000000000..5931b53429f1
--- /dev/null
+++ b/include/drm/drm_ras_genl_family.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_GENL_FAMILY_H__
+#define __DRM_RAS_GENL_FAMILY_H__
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_genl_family_register(void);
+void drm_ras_genl_family_unregister(void);
+#else
+static inline int drm_ras_genl_family_register(void) { return 0; }
+static inline void drm_ras_genl_family_unregister(void) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
new file mode 100644
index 000000000000..9613b7d9ffdb
--- /dev/null
+++ b/include/drm/drm_ras_nl.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel header */
+
+#ifndef _LINUX_DRM_RAS_GEN_H
+#define _LINUX_DRM_RAS_GEN_H
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb);
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb);
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info);
+
+extern struct genl_family drm_ras_nl_family;
+
+#endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
new file mode 100644
index 000000000000..3415ba345ac8
--- /dev/null
+++ b/include/uapi/drm/drm_ras.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN uapi header */
+
+#ifndef _UAPI_LINUX_DRM_RAS_H
+#define _UAPI_LINUX_DRM_RAS_H
+
+#define DRM_RAS_FAMILY_NAME	"drm-ras"
+#define DRM_RAS_FAMILY_VERSION	1
+
+/*
+ * Type of the node. Currently, only error-counter nodes are supported, which
+ * expose reliability counters for a hardware/software component.
+ */
+enum drm_ras_node_type {
+	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
+};
+
+enum {
+	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+
+	__DRM_RAS_A_NODE_ATTRS_MAX,
+	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+
+	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_CMD_LIST_NODES = 1,
+	DRM_RAS_CMD_GET_ERROR_COUNTERS,
+	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+
+	__DRM_RAS_CMD_MAX,
+	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
+};
+
+#endif /* _UAPI_LINUX_DRM_RAS_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (3 preceding siblings ...)
  2026-01-19  4:00 ` [PATCH v4 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2026-01-19  4:00 ` Riana Tauro
  2026-01-20 17:01   ` Raag Jadav
  2026-01-19  4:00 ` [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Riana Tauro @ 2026-01-19  4:00 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro

Allocate correctable, uncorrectable nodes for every xe device
Each node contains error classes, counters and respective
query counter functions.

Add basic functionality to create and register drm nodes.
Below operations can be performed using Generic netlink DRM RAS interface

List Nodes:

$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '0000:03:00.0',
  'node-id': 0,
  'node-name': 'correctable-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 1,
  'node-name': 'uncorrectable-errors',
  'node-type': 'error-counter'}]

Get Error counters:

$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":0}'
[{'error-id': 1, 'error-name': 'GT', 'error-value': 0},
 {'error-id': 2, 'error-name': 'SoC', 'error-value': 0}]

Query Error counter:

$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'GT', 'error-value': 0}

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
    Add documentation
    Modify commit message

v3: remove 'error' from counters
    use drmm_kcalloc
    add a for_each for severity
    differentitate error classes and severity in uapi
    Use GT instead of Core Compute(Raag)
    Use correctable and uncorrectable in uapi (Pratik / Aravind)
---
 drivers/gpu/drm/xe/Makefile           |   1 +
 drivers/gpu/drm/xe/xe_device_types.h  |   4 +
 drivers/gpu/drm/xe/xe_drm_ras.c       | 176 ++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h       |  15 +++
 drivers/gpu/drm/xe/xe_drm_ras_types.h |  49 +++++++
 drivers/gpu/drm/xe/xe_hw_error.c      |  63 ++++-----
 include/uapi/drm/xe_drm.h             |  79 ++++++++++++
 7 files changed, 358 insertions(+), 29 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index b39cbb756232..b25564649492 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -41,6 +41,7 @@ xe-y += xe_bb.o \
 	xe_device_sysfs.o \
 	xe_dma_buf.o \
 	xe_drm_client.o \
+	xe_drm_ras.o \
 	xe_eu_stall.o \
 	xe_exec.o \
 	xe_exec_queue.o \
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 34feef79fa4e..2e863fcb2f08 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -13,6 +13,7 @@
 #include <drm/ttm/ttm_device.h>
 
 #include "xe_devcoredump_types.h"
+#include "xe_drm_ras_types.h"
 #include "xe_heci_gsc.h"
 #include "xe_late_bind_fw_types.h"
 #include "xe_lmtt_types.h"
@@ -674,6 +675,9 @@ struct xe_device {
 	/** @pmu: performance monitoring unit */
 	struct xe_pmu pmu;
 
+	/** @ras: RAS structure for device */
+	struct xe_drm_ras ras;
+
 	/** @i2c: I2C host controller */
 	struct xe_i2c *i2c;
 
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
new file mode 100644
index 000000000000..a665f53ac191
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#include <drm/drm_managed.h>
+#include <drm/drm_print.h>
+#include <drm/drm_ras.h>
+#include <linux/bitmap.h>
+
+#include "xe_device_types.h"
+#include "xe_drm_ras.h"
+
+static const char * const errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+
+static int hw_query_error_counter(struct xe_drm_ras_counter *info,
+				  u32 error_id, const char **name, u32 *val)
+{
+	if (error_id < DRM_XE_RAS_ERROR_CLASS_GT || error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)
+		return -EINVAL;
+
+	if (!info[error_id].name)
+		return -ENOENT;
+
+	*name = info[error_id].name;
+	*val = atomic64_read(&info[error_id].counter);
+
+	return 0;
+}
+
+static int query_uncorrectable_error_counters(struct drm_ras_node *ep,
+					      u32 error_id, const char **name,
+					      u32 *val)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE];
+
+	return hw_query_error_counter(info, error_id, name, val);
+}
+
+static int query_correctable_error_counters(struct drm_ras_node *ep,
+					    u32 error_id, const char **name,
+					    u32 *val)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE];
+
+	return hw_query_error_counter(info, error_id, name, val);
+}
+
+static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
+{
+	struct xe_drm_ras_counter *counter;
+	int i;
+
+	counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERROR_CLASS_MAX,
+			       sizeof(struct xe_drm_ras_counter), GFP_KERNEL);
+	if (!counter)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < DRM_XE_RAS_ERROR_CLASS_MAX; i++) {
+		if (!errors[i])
+			continue;
+
+		counter[i].name = errors[i];
+		atomic64_set(&counter[i].counter, 0);
+	}
+
+	return counter;
+}
+
+static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
+			      const enum drm_xe_ras_error_severity severity)
+{
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+	struct xe_drm_ras *ras = &xe->ras;
+	const char *device_name;
+
+	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
+				pci_domain_nr(pdev->bus), pdev->bus->number,
+				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+	node->device_name = device_name;
+	node->node_name = error_severity[severity];
+	node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
+	node->error_counter_range.first = DRM_XE_RAS_ERROR_CLASS_GT;
+	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
+	node->priv = xe;
+
+	ras->info[severity] = allocate_and_copy_counters(xe);
+	if (IS_ERR(ras->info[severity]))
+		return PTR_ERR(ras->info[severity]);
+
+	if (severity == DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE)
+		node->query_error_counter = query_correctable_error_counters;
+	else
+		node->query_error_counter = query_uncorrectable_error_counters;
+
+	return 0;
+}
+
+static int register_nodes(struct xe_device *xe)
+{
+	struct xe_drm_ras *ras = &xe->ras;
+	int i;
+
+	for_each_error_severity(i) {
+		struct drm_ras_node *node = &ras->node[i];
+		int ret;
+
+		ret = assign_node_params(xe, node, i);
+		if (ret)
+			return ret;
+
+		ret = drm_ras_node_register(node);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static void xe_drm_ras_unregister_nodes(void *arg)
+{
+	struct xe_device *xe = arg;
+	struct xe_drm_ras *ras = &xe->ras;
+	int i;
+
+	for_each_error_severity(i) {
+		struct drm_ras_node *node = &ras->node[i];
+
+		drm_ras_node_unregister(node);
+
+		if (i == 0)
+			kfree(node->device_name);
+	}
+}
+
+/**
+ * xe_drm_ras_allocate_nodes - Allocate DRM RAS nodes
+ * @xe: xe device instance
+ *
+ * Allocate and register DRM RAS nodes per device
+ *
+ * Return: 0 on success, error code on failure
+ */
+int xe_drm_ras_allocate_nodes(struct xe_device *xe)
+{
+	struct xe_drm_ras *ras = &xe->ras;
+	struct drm_ras_node *node;
+	int err;
+
+	node = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX, sizeof(struct drm_ras_node),
+			    GFP_KERNEL);
+	if (!node)
+		return -ENOMEM;
+
+	ras->node = node;
+
+	err = register_nodes(xe);
+	if (err) {
+		drm_err(&xe->drm, "Failed to register drm ras node\n");
+		return err;
+	}
+
+	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
+	if (err) {
+		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
+		return err;
+	}
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
new file mode 100644
index 000000000000..2d714342e4e5
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+#ifndef XE_DRM_RAS_H_
+#define XE_DRM_RAS_H_
+
+struct xe_device;
+
+#define for_each_error_severity(i)	\
+	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++)
+
+int xe_drm_ras_allocate_nodes(struct xe_device *xe);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
new file mode 100644
index 000000000000..528c708e57da
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef _XE_DRM_RAS_TYPES_H_
+#define _XE_DRM_RAS_TYPES_H_
+
+#include <drm/xe_drm.h>
+#include <linux/atomic.h>
+
+struct drm_ras_node;
+
+/* Error categories reported by hardware */
+enum hardware_error {
+	HARDWARE_ERROR_CORRECTABLE = 0,
+	HARDWARE_ERROR_NONFATAL = 1,
+	HARDWARE_ERROR_FATAL = 2,
+	HARDWARE_ERROR_MAX,
+};
+
+/**
+ * struct xe_drm_ras_counter - XE RAS counter
+ *
+ * This structure contains error class and counter information
+ */
+struct xe_drm_ras_counter {
+	/** @name: error class name */
+	const char *name;
+
+	/** @counter: count of error */
+	atomic64_t counter;
+};
+
+/**
+ * struct xe_drm_ras - XE DRM RAS structure
+ *
+ * This structure has details of error counters
+ */
+struct xe_drm_ras {
+	/** @node: DRM RAS node */
+	struct drm_ras_node *node;
+
+	/** @info: info array for all types of errors */
+	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
+
+};
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 8c65291f36fc..b42495d3015a 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -10,20 +10,14 @@
 #include "regs/xe_irq_regs.h"
 
 #include "xe_device.h"
+#include "xe_drm_ras.h"
 #include "xe_hw_error.h"
 #include "xe_mmio.h"
 #include "xe_survivability_mode.h"
 
 #define  HEC_UNCORR_FW_ERR_BITS 4
 extern struct fault_attr inject_csc_hw_error;
-
-/* Error categories reported by hardware */
-enum hardware_error {
-	HARDWARE_ERROR_CORRECTABLE = 0,
-	HARDWARE_ERROR_NONFATAL = 1,
-	HARDWARE_ERROR_FATAL = 2,
-	HARDWARE_ERROR_MAX,
-};
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
 
 static const char * const hec_uncorrected_fw_errors[] = {
 	"Fatal",
@@ -32,23 +26,17 @@ static const char * const hec_uncorrected_fw_errors[] = {
 	"Data Corruption"
 };
 
-static const char *hw_error_to_str(const enum hardware_error hw_err)
+static bool fault_inject_csc_hw_error(void)
 {
-	switch (hw_err) {
-	case HARDWARE_ERROR_CORRECTABLE:
-		return "CORRECTABLE";
-	case HARDWARE_ERROR_NONFATAL:
-		return "NONFATAL";
-	case HARDWARE_ERROR_FATAL:
-		return "FATAL";
-	default:
-		return "UNKNOWN";
-	}
+	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
 }
 
-static bool fault_inject_csc_hw_error(void)
+static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_err)
 {
-	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
+	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+		return DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE;
+
+	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
 }
 
 static void csc_hw_error_work(struct work_struct *work)
@@ -64,7 +52,8 @@ static void csc_hw_error_work(struct work_struct *work)
 
 static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
-	const char *hw_err_str = hw_error_to_str(hw_err);
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_mmio *mmio = &tile->mmio;
 	u32 base, err_bit, err_src;
@@ -77,8 +66,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	lockdep_assert_held(&xe->irq.lock);
 	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
 	if (!err_src) {
-		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
-				    tile->id, hw_err_str);
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s HEC_ERR_STATUS register blank\n",
+				    tile->id, severity_str);
 		return;
 	}
 
@@ -86,8 +75,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
 		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
 			drm_err_ratelimited(&xe->drm, HW_ERR
-					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
-					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
+					    "HEC FW %s error reported, bit[%d] is set\n",
+					     hec_uncorrected_fw_errors[err_bit],
 					     err_bit);
 
 			schedule_work(&tile->csc_hw_error_work);
@@ -99,7 +88,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
-	const char *hw_err_str = hw_error_to_str(hw_err);
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
 	unsigned long flags;
 	u32 err_src;
@@ -110,8 +100,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 	spin_lock_irqsave(&xe->irq.lock, flags);
 	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
 	if (!err_src) {
-		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
-				    tile->id, hw_err_str);
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s DEV_ERR_STAT register blank!\n",
+				    tile->id, severity_str);
 		goto unlock;
 	}
 
@@ -146,6 +136,20 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 			hw_error_source_handler(tile, hw_err);
 }
 
+static int hw_error_info_init(struct xe_device *xe)
+{
+	int ret;
+
+	if (xe->info.platform != XE_PVC)
+		return 0;
+
+	ret = xe_drm_ras_allocate_nodes(xe);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
 /*
  * Process hardware errors during boot
  */
@@ -178,5 +182,6 @@ void xe_hw_error_init(struct xe_device *xe)
 
 	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
 
+	hw_error_info_init(xe);
 	process_hw_errors(xe);
 }
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 077e66a682e2..5a08d46f29a7 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -2357,6 +2357,85 @@ struct drm_xe_exec_queue_set_property {
 	__u64 reserved[2];
 };
 
+/**
+ * DOC: Xe DRM RAS
+ *
+ * The enums and strings defined below map to the attributes of the DRM RAS Netlink Interface.
+ * Refer to Documentation/netlink/specs/drm_ras.yaml for complete interface specification.
+ *
+ * Node Registration
+ * =================
+ *
+ * The driver registers DRM RAS nodes for each error severity level.
+ * enum drm_xe_ras_error_severity defines the node-id, while DRM_XE_RAS_ERROR_SEVERITY_NAMES maps
+ * node-id to node-name.
+ *
+ * Error Classification
+ * ====================
+ *
+ * Each node contains a list of error counters. Each error is identified by a error-id and
+ * an error-name. enum drm_xe_ras_error_class defines the error-id, while
+ * DRM_XE_RAS_ERROR_CLASS_NAMES maps error-id to error-name.
+ *
+ * User Interface
+ * ==============
+ *
+ * To retrieve error values of a error counter, userspace applications should
+ * follow the below steps:
+ *
+ * 1. Use command LIST_NODES to enumerate all available nodes
+ * 2. Select node by node-id or node-name
+ * 3. Use command GET_ERROR_COUNTERS to list errors of specific node
+ * 4. Query specific error values using either error-id or error-name
+ *
+ * .. code-block:: C
+ *
+ *	// Lookup tables for ID-to-name resolution
+ *	static const char *nodes[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+ *	static const char *errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
+ *
+ */
+
+/**
+ * enum drm_xe_ras_error_severity - DRM RAS error severity.
+ */
+enum drm_xe_ras_error_severity {
+	/** @DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE: Correctable Error */
+	DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE = 0,
+	/** @DRM_XE_RAS_ERROR_UNCORRECTABLE: Uncorrectable Error */
+	DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE,
+	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
+	DRM_XE_RAS_ERROR_SEVERITY_MAX /* non-ABI */
+};
+
+/**
+ * enum drm_xe_ras_error_class - DRM RAS error classes.
+ */
+enum drm_xe_ras_error_class {
+	/** @DRM_XE_RAS_ERROR_CLASS_GT: GT Error */
+	DRM_XE_RAS_ERROR_CLASS_GT = 1,
+	/** @DRM_XE_RAS_ERROR_CLASS_SOC: SoC Error */
+	DRM_XE_RAS_ERROR_CLASS_SOC,
+	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
+	DRM_XE_RAS_ERROR_CLASS_MAX	/* non-ABI */
+};
+
+/*
+ * Error severity to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_SEVERITY_NAMES {					\
+	[DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE] = "correctable-errors",		\
+	[DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE] = "uncorrectable-errors",	\
+}
+
+/*
+ * Error class to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_CLASS_NAMES {					\
+	[DRM_XE_RAS_ERROR_CLASS_GT] = "GT",				\
+	[DRM_XE_RAS_ERROR_CLASS_SOC] = "SoC"				\
+}
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (4 preceding siblings ...)
  2026-01-19  4:00 ` [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
@ 2026-01-19  4:00 ` Riana Tauro
  2026-01-19  9:06   ` kernel test robot
  2026-01-21  7:09   ` Raag Jadav
  2026-01-19  4:00 ` [PATCH v4 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Riana Tauro @ 2026-01-19  4:00 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro, Himal Prasad Ghimiray

PVC supports GT error reporting via vector registers along with
error status register. Add support to report these errors and
update respective counters. Incase of Subslice error reported
by vector register, process the error status register
for applicable bits.

Incorporate the counter inside the driver itself and start
using the drm_ras generic netlink to report them.

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)

v3: use REG_BIT
    do not use _ffs
    use a single function for GT errors
    remove redundant errors from logs (Raag)
    use only correctable/uncorrectable error severity (Pratik/Aravind)
---
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  53 +++++-
 drivers/gpu/drm/xe/xe_hw_error.c           | 182 +++++++++++++++++++--
 2 files changed, 220 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index c146b9ef44eb..5eeb0be27300 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -6,15 +6,60 @@
 #ifndef _XE_HW_ERROR_REGS_H_
 #define _XE_HW_ERROR_REGS_H_
 
-#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
-#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
+#define HEC_UNCORR_ERR_STATUS(base)		XE_REG((base) + 0x118)
+#define   UNCORR_FW_REPORTED_ERR		REG_BIT(6)
 
-#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
+#define HEC_UNCORR_FW_ERR_DW0(base)		XE_REG((base) + 0x124)
+
+#define ERR_STAT_GT_COR				0x100160
+#define   EU_GRF_COR_ERR			REG_BIT(15)
+#define   EU_IC_COR_ERR				REG_BIT(14)
+#define   SLM_COR_ERR				REG_BIT(13)
+#define   GUC_COR_ERR				REG_BIT(1)
+
+#define ERR_STAT_GT_NONFATAL			0x100164
+#define ERR_STAT_GT_FATAL			0x100168
+#define   EU_GRF_FAT_ERR			REG_BIT(15)
+#define   SLM_FAT_ERR				REG_BIT(13)
+#define   GUC_FAT_ERR				REG_BIT(6)
+#define   FPU_FAT_ERR				REG_BIT(3)
+
+#define ERR_STAT_GT_REG(x)			XE_REG(_PICK_EVEN((x), \
+								  ERR_STAT_GT_COR, \
+								  ERR_STAT_GT_NONFATAL))
+
+#define PVC_COR_ERR_MASK			(GUC_COR_ERR | SLM_COR_ERR | EU_IC_COR_ERR | \
+						 EU_GRF_COR_ERR)
+
+#define PVC_FAT_ERR_MASK			(FPU_FAT_ERR | GUC_FAT_ERR | EU_GRF_FAT_ERR | \
+						 SLM_FAT_ERR)
 
 #define DEV_ERR_STAT_NONFATAL			0x100178
 #define DEV_ERR_STAT_CORRECTABLE		0x10017c
 #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
 								  DEV_ERR_STAT_CORRECTABLE, \
 								  DEV_ERR_STAT_NONFATAL))
-#define   XE_CSC_ERROR				BIT(17)
+
+#define   XE_CSC_ERROR				17
+#define   XE_GT_ERROR				0
+
+#define ERR_STAT_GT_FATAL_VECTOR_0		0x100260
+#define ERR_STAT_GT_FATAL_VECTOR_1		0x100264
+
+#define ERR_STAT_GT_FATAL_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
+								  ERR_STAT_GT_FATAL_VECTOR_0, \
+								  ERR_STAT_GT_FATAL_VECTOR_1))
+
+#define ERR_STAT_GT_COR_VECTOR_0		0x1002a0
+#define ERR_STAT_GT_COR_VECTOR_1		0x1002a4
+
+#define ERR_STAT_GT_COR_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
+								  ERR_STAT_GT_COR_VECTOR_0, \
+								  ERR_STAT_GT_COR_VECTOR_1))
+#define ERR_STAT_GT_COR_VECTOR_LEN		4
+
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+						ERR_STAT_GT_COR_VECTOR_REG(x) : \
+						ERR_STAT_GT_FATAL_VECTOR_REG(x))
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index b42495d3015a..bd0cf61741ca 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,7 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <linux/bitmap.h>
 #include <linux/fault-inject.h>
 
 #include "regs/xe_gsc_regs.h"
@@ -15,7 +16,10 @@
 #include "xe_mmio.h"
 #include "xe_survivability_mode.h"
 
-#define  HEC_UNCORR_FW_ERR_BITS 4
+#define  GT_HW_ERROR_MAX_ERR_BITS	16
+#define  HEC_UNCORR_FW_ERR_BITS 	4
+#define  XE_RAS_REG_SIZE		32
+
 extern struct fault_attr inject_csc_hw_error;
 static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
 
@@ -26,10 +30,21 @@ static const char * const hec_uncorrected_fw_errors[] = {
 	"Data Corruption"
 };
 
-static bool fault_inject_csc_hw_error(void)
-{
-	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
-}
+static const unsigned long xe_hw_error_map[] = {
+	[XE_GT_ERROR] = DRM_XE_RAS_ERROR_CLASS_GT,
+};
+
+enum gt_vector_regs {
+	ERR_STAT_GT_VECTOR0 = 0,
+	ERR_STAT_GT_VECTOR1,
+	ERR_STAT_GT_VECTOR2,
+	ERR_STAT_GT_VECTOR3,
+	ERR_STAT_GT_VECTOR4,
+	ERR_STAT_GT_VECTOR5,
+	ERR_STAT_GT_VECTOR6,
+	ERR_STAT_GT_VECTOR7,
+	ERR_STAT_GT_VECTOR_MAX,
+};
 
 static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_err)
 {
@@ -39,6 +54,11 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
 	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
 }
 
+static bool fault_inject_csc_hw_error(void)
+{
+	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
+}
+
 static void csc_hw_error_work(struct work_struct *work)
 {
 	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
@@ -86,15 +106,121 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
 }
 
+static void log_hw_error(struct xe_tile *tile, const char *name,
+			 const enum drm_xe_ras_error_severity severity)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+
+	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
+		drm_err_ratelimited(&xe->drm, "%s %s detected\n", name, severity_str);
+	else
+		drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
+}
+
+static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
+		       const enum drm_xe_ras_error_severity severity)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+
+	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
+		drm_err_ratelimited(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+				    name, severity_str, i, err);
+	else
+		drm_warn(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+			 name, severity_str, i, err);
+}
+
+static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
+				u32 error_id)
+{
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	struct xe_mmio *mmio = &tile->mmio;
+	unsigned long err_stat = 0;
+	int i, len;
+
+	if (xe->info.platform != XE_PVC)
+		return;
+
+	if (hw_err == HARDWARE_ERROR_NONFATAL) {
+		atomic64_inc(&info[error_id].counter);
+		log_hw_error(tile, info[error_id].name, severity);
+		return;
+	}
+
+	len = (hw_err == HARDWARE_ERROR_CORRECTABLE) ? ERR_STAT_GT_COR_VECTOR_LEN
+						     : ERR_STAT_GT_VECTOR_MAX;
+
+	for (i = 0; i < len; i++) {
+		u32 vector, val;
+
+		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i));
+		if (!vector)
+			continue;
+
+		switch (i) {
+		case ERR_STAT_GT_VECTOR0:
+		case ERR_STAT_GT_VECTOR1:
+			u32 errbit;
+
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			log_gt_err(tile, "Subslice", i, vector, severity);
+
+			/* Read Error Status Register once */
+			if (err_stat)
+				break;
+
+			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
+			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
+				if (hw_err == HARDWARE_ERROR_CORRECTABLE &&
+				    (BIT(errbit) & PVC_COR_ERR_MASK))
+					atomic64_inc(&info[error_id].counter);
+				if (hw_err == HARDWARE_ERROR_FATAL &&
+				    (BIT(errbit) & PVC_FAT_ERR_MASK))
+					atomic64_inc(&info[error_id].counter);
+			}
+			if (err_stat)
+				xe_mmio_write32(mmio, ERR_STAT_GT_REG(hw_err), err_stat);
+			break;
+		case ERR_STAT_GT_VECTOR2:
+		case ERR_STAT_GT_VECTOR3:
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			log_gt_err(tile, "L3 BANK", i, vector, severity);
+			break;
+		case ERR_STAT_GT_VECTOR6:
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			log_gt_err(tile, "TLB", i, vector, severity);
+			break;
+		case ERR_STAT_GT_VECTOR7:
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			break;
+		default:
+			log_gt_err(tile, "Undefined", i, vector, severity);
+		}
+
+		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
+	}
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
 	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
-	unsigned long flags;
-	u32 err_src;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	unsigned long flags, err_src;
+	u32 err_bit;
 
-	if (xe->info.platform != XE_BATTLEMAGE)
+	if (!IS_DGFX(xe))
 		return;
 
 	spin_lock_irqsave(&xe->irq.lock, flags);
@@ -105,11 +231,44 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 		goto unlock;
 	}
 
-	if (err_src & XE_CSC_ERROR)
+	/*
+	 * On encountering CSC firmware errors, the graphics device is non-recoverable.
+	 * The only way to recover from these errors is firmware flash. The device will
+	 * enter Runtime Survivability mode when such errors are detected.
+	 */
+	if (err_src & XE_CSC_ERROR) {
 		csc_hw_error_handler(tile, hw_err);
+		goto clear_reg;
+	}
 
-	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+	if (!info) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Errors undefined\n");
+		goto clear_reg;
+	}
+
+	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
+		u32 error_id = xe_hw_error_map[err_bit];
+		const char *name;
+
+		name = info[error_id].name;
+		if (!name)
+			goto clear_reg;
 
+		if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE) {
+			drm_err_ratelimited(&xe->drm, HW_ERR
+					    "TILE%d reported %s %s, bit[%d] is set\n",
+					    tile->id, name, severity_str, err_bit);
+		} else {
+			drm_warn(&xe->drm, HW_ERR
+				 "TILE%d reported %s %s, bit[%d] is set\n",
+				 tile->id, name, severity_str, err_bit);
+		}
+		if (err_bit == XE_GT_ERROR)
+			gt_hw_error_handler(tile, hw_err, error_id);
+	}
+
+clear_reg:
+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
 unlock:
 	spin_unlock_irqrestore(&xe->irq.lock, flags);
 }
@@ -131,9 +290,10 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 	if (fault_inject_csc_hw_error())
 		schedule_work(&tile->csc_hw_error_work);
 
-	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
 		if (master_ctl & ERROR_IRQ(hw_err))
 			hw_error_source_handler(tile, hw_err);
+	}
 }
 
 static int hw_error_info_init(struct xe_device *xe)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (5 preceding siblings ...)
  2026-01-19  4:00 ` [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
@ 2026-01-19  4:00 ` Riana Tauro
  2026-01-23 10:33   ` Raag Jadav
  2026-01-19  4:11 ` ✗ Xe.CI.BAT: failure for Introduce DRM_RAS using generic netlink for RAS (rev4) Patchwork
  2026-01-19  5:33 ` ✗ Xe.CI.Full: " Patchwork
  8 siblings, 1 reply; 22+ messages in thread
From: Riana Tauro @ 2026-01-19  4:00 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro, Himal Prasad Ghimiray

Report the SOC nonfatal/fatal hardware error and update the counters.

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)

v3: reorder and align arrays
    remove redundant string err
    use REG_BIT
    fix aesthic review comments (Raag)
    use only correctable/uncorrectable error severity (Aravind)
---
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
 drivers/gpu/drm/xe/xe_hw_error.c           | 200 ++++++++++++++++++++-
 2 files changed, 223 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index 5eeb0be27300..b9e072f9e56c 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -41,6 +41,7 @@
 								  DEV_ERR_STAT_NONFATAL))
 
 #define   XE_CSC_ERROR				17
+#define   XE_SOC_ERROR				16
 #define   XE_GT_ERROR				0
 
 #define ERR_STAT_GT_FATAL_VECTOR_0		0x100260
@@ -62,4 +63,27 @@
 						ERR_STAT_GT_COR_VECTOR_REG(x) : \
 						ERR_STAT_GT_FATAL_VECTOR_REG(x))
 
+#define SOC_PVC_MASTER_BASE			0x282000
+#define SOC_PVC_SLAVE_BASE			0x283000
+
+#define SOC_GCOERRSTS				0x200
+#define SOC_GNFERRSTS				0x210
+#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
+								  (base) + SOC_GCOERRSTS, \
+								  (base) + SOC_GNFERRSTS))
+#define   SOC_SLAVE_IEH				REG_BIT(1)
+#define   SOC_IEH0_LOCAL_ERR_STATUS		REG_BIT(0)
+#define   SOC_IEH1_LOCAL_ERR_STATUS		REG_BIT(0)
+
+#define SOC_GSYSEVTCTL				0x264
+#define SOC_GSYSEVTCTL_REG(base, slave_base, x)	XE_REG(_PICK_EVEN((x), \
+								  (base) + SOC_GSYSEVTCTL, \
+								  (slave_base) + SOC_GSYSEVTCTL))
+
+#define SOC_LERRUNCSTS				0x280
+#define SOC_LERRCORSTS				0x294
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)	XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+						      (base) + SOC_LERRCORSTS : \
+						      (base) + SOC_LERRUNCSTS)
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index bd0cf61741ca..d1c30bb199d3 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -19,6 +19,7 @@
 #define  GT_HW_ERROR_MAX_ERR_BITS	16
 #define  HEC_UNCORR_FW_ERR_BITS 	4
 #define  XE_RAS_REG_SIZE		32
+#define  XE_SOC_NUM_IEH 		2
 
 extern struct fault_attr inject_csc_hw_error;
 static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
@@ -31,7 +32,8 @@ static const char * const hec_uncorrected_fw_errors[] = {
 };
 
 static const unsigned long xe_hw_error_map[] = {
-	[XE_GT_ERROR] = DRM_XE_RAS_ERROR_CLASS_GT,
+	[XE_GT_ERROR]	= DRM_XE_RAS_ERROR_CLASS_GT,
+	[XE_SOC_ERROR]	= DRM_XE_RAS_ERROR_CLASS_SOC,
 };
 
 enum gt_vector_regs {
@@ -54,6 +56,92 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
 	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
 }
 
+static const char * const pvc_master_global_err_reg[] = {
+	[0 ... 1]	= "Undefined",
+	[2]		= "HBM SS0: Channel0",
+	[3]		= "HBM SS0: Channel1",
+	[4]		= "HBM SS0: Channel2",
+	[5]		= "HBM SS0: Channel3",
+	[6]		= "HBM SS0: Channel4",
+	[7]		= "HBM SS0: Channel5",
+	[8]		= "HBM SS0: Channel6",
+	[9]		= "HBM SS0: Channel7",
+	[10]		= "HBM SS1: Channel0",
+	[11]		= "HBM SS1: Channel1",
+	[12]		= "HBM SS1: Channel2",
+	[13]		= "HBM SS1: Channel3",
+	[14]		= "HBM SS1: Channel4",
+	[15]		= "HBM SS1: Channel5",
+	[16]		= "HBM SS1: Channel6",
+	[17]		= "HBM SS1: Channel7",
+	[18 ... 31]	= "Undefined",
+};
+
+static const char * const pvc_slave_global_err_reg[] = {
+	[0]		= "Undefined",
+	[1]		= "HBM SS2: Channel0",
+	[2]		= "HBM SS2: Channel1",
+	[3]		= "HBM SS2: Channel2",
+	[4]		= "HBM SS2: Channel3",
+	[5]		= "HBM SS2: Channel4",
+	[6]		= "HBM SS2: Channel5",
+	[7]		= "HBM SS2: Channel6",
+	[8]		= "HBM SS2: Channel7",
+	[9]		= "HBM SS3: Channel0",
+	[10]		= "HBM SS3: Channel1",
+	[11]		= "HBM SS3: Channel2",
+	[12]		= "HBM SS3: Channel3",
+	[13]		= "HBM SS3: Channel4",
+	[14]		= "HBM SS3: Channel5",
+	[15]		= "HBM SS3: Channel6",
+	[16]		= "HBM SS3: Channel7",
+	[17]		= "Undefined",
+	[18]		= "ANR MDFI",
+	[19 ... 31]	= "Undefined",
+};
+
+static const char * const pvc_slave_local_fatal_err_reg[] = {
+	[0]		= "Local IEH: Malformed PCIe AER",
+	[1]		= "Local IEH: Malformed PCIe ERR",
+	[2]		= "Local IEH: UR conditions in IEH",
+	[3]		= "Local IEH: From SERR Sources",
+	[4 ... 19]	= "Undefined",
+	[20]		= "Malformed MCA error packet (HBM/Punit)",
+	[21 ... 31]	= "Undefined",
+};
+
+static const char * const pvc_master_local_fatal_err_reg[] = {
+	[0]		= "Local IEH: Malformed IOSF PCIe AER",
+	[1]		= "Local IEH: Malformed IOSF PCIe ERR",
+	[2]		= "Local IEH: UR RESPONSE",
+	[3]		= "Local IEH: From SERR SPI controller",
+	[4]		= "Base Die MDFI T2T",
+	[5]		= "Undefined",
+	[6]		= "Base Die MDFI T2C",
+	[7]		= "Undefined",
+	[8]		= "Invalid CSC PSF Command Parity",
+	[9]		= "Invalid CSC PSF Unexpected Completion",
+	[10]		= "Invalid CSC PSF Unsupported Request",
+	[11]		= "Invalid PCIe PSF Command Parity",
+	[12]		= "PCIe PSF Unexpected Completion",
+	[13]		= "PCIe PSF Unsupported Request",
+	[14 ... 19]	= "Undefined",
+	[20]		= "Malformed MCA error packet (HBM/Punit)",
+	[21 ... 31]	= "Undefined",
+};
+
+static const char * const pvc_master_local_nonfatal_err_reg[] = {
+	[0 ... 3]	= "Undefined",
+	[4]		= "Base Die MDFI T2T",
+	[5]		= "Undefined",
+	[6]		= "Base Die MDFI T2C",
+	[7]		= "Undefined",
+	[8]		= "Invalid CSC PSF Command Parity",
+	[9]		= "Invalid CSC PSF Unexpected Completion",
+	[10]		= "Invalid PCIe PSF Command Parity",
+	[11 ... 31]	= "Undefined",
+};
+
 static bool fault_inject_csc_hw_error(void)
 {
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
@@ -132,6 +220,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
 			 name, severity_str, i, err);
 }
 
+static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
+			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	const char *name;
+
+	name = reg_info[err_bit];
+
+	if (strcmp(name, "Undefined")) {
+		if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
+			drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
+		else
+			drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
+		atomic64_inc(&info[index].counter);
+	}
+}
+
 static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
 				u32 error_id)
 {
@@ -210,6 +318,93 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	}
 }
 
+static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
+				 u32 error_id)
+{
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_mmio *mmio = &tile->mmio;
+	unsigned long master_global_errstat, slave_global_errstat;
+	unsigned long master_local_errstat, slave_local_errstat;
+	u32 base, slave_base, regbit;
+	int i;
+
+	if (xe->info.platform != XE_PVC)
+		return;
+
+	base = SOC_PVC_MASTER_BASE;
+	slave_base = SOC_PVC_SLAVE_BASE;
+
+	/* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
+	for (i = 0; i < XE_SOC_NUM_IEH; i++)
+		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i), ~REG_BIT(hw_err));
+
+	if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err), REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err), REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
+				REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
+				REG_GENMASK(31, 0));
+		goto unmask_gsysevtctl;
+	}
+
+	/*
+	 * Read the master global IEH error register if BIT 1 is set then process
+	 * the slave IEH first. If BIT 0 in global error register is set then process
+	 * the corresponding local error registers
+	 */
+	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err));
+	if (master_global_errstat & SOC_SLAVE_IEH) {
+		slave_global_errstat = xe_mmio_read32(mmio,
+						      SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err));
+		if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
+			slave_local_errstat = xe_mmio_read32(mmio,
+							     SOC_LOCAL_ERR_STAT_REG(slave_base,
+										    hw_err));
+
+			if (hw_err == HARDWARE_ERROR_FATAL) {
+				for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
+					log_soc_error(tile, pvc_slave_local_fatal_err_reg,
+						      severity, regbit, error_id);
+			}
+
+			xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
+					slave_local_errstat);
+		}
+
+		for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
+			log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
+
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
+				slave_global_errstat);
+	}
+
+	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
+		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err));
+
+		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
+			const char * const *reg_info = (hw_err == HARDWARE_ERROR_FATAL) ?
+						       pvc_master_local_fatal_err_reg :
+						       pvc_master_local_nonfatal_err_reg;
+
+			log_soc_error(tile, reg_info, severity, regbit, error_id);
+		}
+
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err), master_local_errstat);
+	}
+
+	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
+		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
+
+	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err), master_global_errstat);
+
+unmask_gsysevtctl:
+	for (i = 0; i < XE_SOC_NUM_IEH; i++)
+		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
+				(DRM_XE_RAS_ERROR_SEVERITY_MAX << 1) + 1);
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
@@ -263,8 +458,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 				 "TILE%d reported %s %s, bit[%d] is set\n",
 				 tile->id, name, severity_str, err_bit);
 		}
+
 		if (err_bit == XE_GT_ERROR)
 			gt_hw_error_handler(tile, hw_err, error_id);
+		if (err_bit == XE_SOC_ERROR)
+			soc_hw_error_handler(tile, hw_err, error_id);
 	}
 
 clear_reg:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* ✗ Xe.CI.BAT: failure for Introduce DRM_RAS using generic netlink for RAS (rev4)
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (6 preceding siblings ...)
  2026-01-19  4:00 ` [PATCH v4 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
@ 2026-01-19  4:11 ` Patchwork
  2026-01-19  5:33 ` ✗ Xe.CI.Full: " Patchwork
  8 siblings, 0 replies; 22+ messages in thread
From: Patchwork @ 2026-01-19  4:11 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 3375 bytes --]

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev4)
URL   : https://patchwork.freedesktop.org/series/155188/
State : failure

== Summary ==

CI Bug Log - changes from xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac_BAT -> xe-pw-155188v4_BAT
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with xe-pw-155188v4_BAT absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in xe-pw-155188v4_BAT, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (12 -> 12)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in xe-pw-155188v4_BAT:

### IGT changes ###

#### Possible regressions ####

  * igt@xe_module_load@load:
    - bat-dg2-oem2:       [PASS][1] -> [DMESG-WARN][2]
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/bat-dg2-oem2/igt@xe_module_load@load.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/bat-dg2-oem2/igt@xe_module_load@load.html
    - bat-atsm-2:         [PASS][3] -> [DMESG-WARN][4]
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/bat-atsm-2/igt@xe_module_load@load.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/bat-atsm-2/igt@xe_module_load@load.html
    - bat-bmg-2:          [PASS][5] -> [DMESG-WARN][6]
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/bat-bmg-2/igt@xe_module_load@load.html
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/bat-bmg-2/igt@xe_module_load@load.html
    - bat-bmg-3:          [PASS][7] -> [DMESG-WARN][8]
   [7]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/bat-bmg-3/igt@xe_module_load@load.html
   [8]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/bat-bmg-3/igt@xe_module_load@load.html
    - bat-bmg-1:          [PASS][9] -> [DMESG-WARN][10]
   [9]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/bat-bmg-1/igt@xe_module_load@load.html
   [10]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/bat-bmg-1/igt@xe_module_load@load.html

  
Known issues
------------

  Here are the changes found in xe-pw-155188v4_BAT that come from known issues:

### IGT changes ###

#### Possible fixes ####

  * igt@xe_waitfence@engine:
    - bat-bmg-2:          [FAIL][11] -> [PASS][12]
   [11]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/bat-bmg-2/igt@xe_waitfence@engine.html
   [12]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/bat-bmg-2/igt@xe_waitfence@engine.html

  


Build changes
-------------

  * Linux: xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac -> xe-pw-155188v4

  IGT_8704: 8704
  xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac: af18785a6a8621b1a5805ba8a1b35d290cb4bcac
  xe-pw-155188v4: 155188v4

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/index.html

[-- Attachment #2: Type: text/html, Size: 4014 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* ✗ Xe.CI.Full: failure for Introduce DRM_RAS using generic netlink for RAS (rev4)
  2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (7 preceding siblings ...)
  2026-01-19  4:11 ` ✗ Xe.CI.BAT: failure for Introduce DRM_RAS using generic netlink for RAS (rev4) Patchwork
@ 2026-01-19  5:33 ` Patchwork
  8 siblings, 0 replies; 22+ messages in thread
From: Patchwork @ 2026-01-19  5:33 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 28255 bytes --]

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev4)
URL   : https://patchwork.freedesktop.org/series/155188/
State : failure

== Summary ==

CI Bug Log - changes from xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac_FULL -> xe-pw-155188v4_FULL
====================================================

Summary
-------

  **WARNING**

  Minor unknown changes coming with xe-pw-155188v4_FULL need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in xe-pw-155188v4_FULL, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (2 -> 2)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in xe-pw-155188v4_FULL:

### IGT changes ###

#### Warnings ####

  * igt@xe_module_load@load:
    - shard-bmg:          ([PASS][1], [PASS][2], [PASS][3], [PASS][4], [PASS][5], [PASS][6], [PASS][7], [PASS][8], [PASS][9], [PASS][10], [PASS][11], [PASS][12], [PASS][13], [PASS][14], [PASS][15], [PASS][16], [PASS][17], [PASS][18], [PASS][19], [PASS][20], [PASS][21], [PASS][22], [PASS][23], [PASS][24], [SKIP][25], [PASS][26]) ([Intel XE#2457]) -> ([PASS][27], [PASS][28], [PASS][29], [PASS][30], [PASS][31], [PASS][32], [PASS][33], [PASS][34], [PASS][35], [DMESG-WARN][36], [PASS][37], [PASS][38], [PASS][39], [DMESG-WARN][40], [DMESG-WARN][41], [PASS][42], [PASS][43], [DMESG-WARN][44], [DMESG-WARN][45], [DMESG-WARN][46], [DMESG-WARN][47], [DMESG-WARN][48], [DMESG-WARN][49], [DMESG-WARN][50], [DMESG-WARN][51])
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-10/igt@xe_module_load@load.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-2/igt@xe_module_load@load.html
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-2/igt@xe_module_load@load.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-2/igt@xe_module_load@load.html
   [5]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-10/igt@xe_module_load@load.html
   [6]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-8/igt@xe_module_load@load.html
   [7]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-8/igt@xe_module_load@load.html
   [8]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-7/igt@xe_module_load@load.html
   [9]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-7/igt@xe_module_load@load.html
   [10]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-7/igt@xe_module_load@load.html
   [11]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-1/igt@xe_module_load@load.html
   [12]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-8/igt@xe_module_load@load.html
   [13]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-1/igt@xe_module_load@load.html
   [14]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-10/igt@xe_module_load@load.html
   [15]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-10/igt@xe_module_load@load.html
   [16]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-3/igt@xe_module_load@load.html
   [17]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-3/igt@xe_module_load@load.html
   [18]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-2/igt@xe_module_load@load.html
   [19]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-3/igt@xe_module_load@load.html
   [20]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-9/igt@xe_module_load@load.html
   [21]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-9/igt@xe_module_load@load.html
   [22]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-9/igt@xe_module_load@load.html
   [23]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-9/igt@xe_module_load@load.html
   [24]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-2/igt@xe_module_load@load.html
   [25]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-10/igt@xe_module_load@load.html
   [26]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-1/igt@xe_module_load@load.html
   [27]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@xe_module_load@load.html
   [28]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@xe_module_load@load.html
   [29]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-1/igt@xe_module_load@load.html
   [30]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-3/igt@xe_module_load@load.html
   [31]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-7/igt@xe_module_load@load.html
   [32]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-7/igt@xe_module_load@load.html
   [33]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-3/igt@xe_module_load@load.html
   [34]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-1/igt@xe_module_load@load.html
   [35]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_module_load@load.html
   [36]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-2/igt@xe_module_load@load.html
   [37]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_module_load@load.html
   [38]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-1/igt@xe_module_load@load.html
   [39]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-2/igt@xe_module_load@load.html
   [40]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [41]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [42]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_module_load@load.html
   [43]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_module_load@load.html
   [44]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [45]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [46]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [47]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [48]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [49]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [50]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html
   [51]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-9/igt@xe_module_load@load.html

  
Known issues
------------

  Here are the changes found in xe-pw-155188v4_FULL that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_big_fb@4-tiled-32bpp-rotate-90:
    - shard-bmg:          NOTRUN -> [SKIP][52] ([Intel XE#2327]) +4 other tests skip
   [52]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_big_fb@4-tiled-32bpp-rotate-90.html

  * igt@kms_big_fb@y-tiled-8bpp-rotate-90:
    - shard-bmg:          NOTRUN -> [SKIP][53] ([Intel XE#1124]) +4 other tests skip
   [53]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_big_fb@y-tiled-8bpp-rotate-90.html

  * igt@kms_big_fb@y-tiled-addfb:
    - shard-bmg:          NOTRUN -> [SKIP][54] ([Intel XE#2328])
   [54]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_big_fb@y-tiled-addfb.html

  * igt@kms_big_fb@y-tiled-addfb-size-offset-overflow:
    - shard-bmg:          NOTRUN -> [SKIP][55] ([Intel XE#607])
   [55]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_big_fb@y-tiled-addfb-size-offset-overflow.html

  * igt@kms_bw@connected-linear-tiling-4-displays-2160x1440p:
    - shard-bmg:          NOTRUN -> [SKIP][56] ([Intel XE#2314] / [Intel XE#2894])
   [56]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_bw@connected-linear-tiling-4-displays-2160x1440p.html

  * igt@kms_bw@linear-tiling-4-displays-2560x1440p:
    - shard-bmg:          NOTRUN -> [SKIP][57] ([Intel XE#367]) +1 other test skip
   [57]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_bw@linear-tiling-4-displays-2560x1440p.html

  * igt@kms_ccs@crc-primary-basic-yf-tiled-ccs:
    - shard-bmg:          NOTRUN -> [SKIP][58] ([Intel XE#2887]) +5 other tests skip
   [58]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_ccs@crc-primary-basic-yf-tiled-ccs.html

  * igt@kms_ccs@crc-primary-suspend-4-tiled-lnl-ccs@pipe-d-hdmi-a-3:
    - shard-bmg:          NOTRUN -> [SKIP][59] ([Intel XE#2652] / [Intel XE#787]) +26 other tests skip
   [59]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_ccs@crc-primary-suspend-4-tiled-lnl-ccs@pipe-d-hdmi-a-3.html

  * igt@kms_ccs@crc-primary-suspend-yf-tiled-ccs:
    - shard-bmg:          NOTRUN -> [SKIP][60] ([Intel XE#3432]) +2 other tests skip
   [60]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_ccs@crc-primary-suspend-yf-tiled-ccs.html

  * igt@kms_chamelium_color@ctm-0-50:
    - shard-bmg:          NOTRUN -> [SKIP][61] ([Intel XE#2325]) +2 other tests skip
   [61]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_chamelium_color@ctm-0-50.html

  * igt@kms_chamelium_hpd@hdmi-hpd-after-suspend:
    - shard-bmg:          NOTRUN -> [SKIP][62] ([Intel XE#2252]) +4 other tests skip
   [62]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_chamelium_hpd@hdmi-hpd-after-suspend.html

  * igt@kms_content_protection@type1:
    - shard-bmg:          NOTRUN -> [SKIP][63] ([Intel XE#2341])
   [63]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_content_protection@type1.html

  * igt@kms_cursor_crc@cursor-offscreen-64x21:
    - shard-bmg:          NOTRUN -> [SKIP][64] ([Intel XE#2320])
   [64]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_cursor_crc@cursor-offscreen-64x21.html

  * igt@kms_cursor_crc@cursor-sliding-512x170:
    - shard-bmg:          NOTRUN -> [SKIP][65] ([Intel XE#2321]) +1 other test skip
   [65]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_cursor_crc@cursor-sliding-512x170.html

  * igt@kms_dirtyfb@psr-dirtyfb-ioctl:
    - shard-bmg:          NOTRUN -> [SKIP][66] ([Intel XE#1508])
   [66]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_dirtyfb@psr-dirtyfb-ioctl.html

  * igt@kms_dsc@dsc-with-bpc-formats:
    - shard-bmg:          NOTRUN -> [SKIP][67] ([Intel XE#2244])
   [67]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_dsc@dsc-with-bpc-formats.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling:
    - shard-bmg:          NOTRUN -> [SKIP][68] ([Intel XE#2293] / [Intel XE#2380]) +1 other test skip
   [68]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling@pipe-a-valid-mode:
    - shard-bmg:          NOTRUN -> [SKIP][69] ([Intel XE#2293]) +1 other test skip
   [69]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling@pipe-a-valid-mode.html

  * igt@kms_frontbuffer_tracking@drrs-abgr161616f-draw-render:
    - shard-bmg:          NOTRUN -> [SKIP][70] ([Intel XE#7061]) +2 other tests skip
   [70]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_frontbuffer_tracking@drrs-abgr161616f-draw-render.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-mmap-wc:
    - shard-bmg:          NOTRUN -> [SKIP][71] ([Intel XE#4141]) +2 other tests skip
   [71]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-spr-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc:
    - shard-bmg:          NOTRUN -> [SKIP][72] ([Intel XE#2311]) +17 other tests skip
   [72]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-cur-indfb-draw-mmap-wc.html

  * igt@kms_frontbuffer_tracking@psr-2p-scndscrn-indfb-pgflip-blt:
    - shard-bmg:          NOTRUN -> [SKIP][73] ([Intel XE#2313]) +20 other tests skip
   [73]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-indfb-pgflip-blt.html

  * igt@kms_hdr@bpc-switch-dpms@pipe-a-dp-2:
    - shard-bmg:          [PASS][74] -> [ABORT][75] ([Intel XE#6740]) +1 other test abort
   [74]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-10/igt@kms_hdr@bpc-switch-dpms@pipe-a-dp-2.html
   [75]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_hdr@bpc-switch-dpms@pipe-a-dp-2.html

  * igt@kms_joiner@basic-force-ultra-joiner:
    - shard-bmg:          NOTRUN -> [SKIP][76] ([Intel XE#6911])
   [76]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_joiner@basic-force-ultra-joiner.html

  * igt@kms_pipe_stress@stress-xrgb8888-yftiled:
    - shard-bmg:          NOTRUN -> [SKIP][77] ([Intel XE#6912])
   [77]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_pipe_stress@stress-xrgb8888-yftiled.html

  * igt@kms_plane_multiple@tiling-y:
    - shard-bmg:          NOTRUN -> [SKIP][78] ([Intel XE#5020])
   [78]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_plane_multiple@tiling-y.html

  * igt@kms_pm_backlight@basic-brightness:
    - shard-bmg:          NOTRUN -> [SKIP][79] ([Intel XE#870])
   [79]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_pm_backlight@basic-brightness.html

  * igt@kms_pm_rpm@modeset-lpsp-stress:
    - shard-bmg:          NOTRUN -> [SKIP][80] ([Intel XE#1439] / [Intel XE#3141] / [Intel XE#836])
   [80]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_pm_rpm@modeset-lpsp-stress.html

  * igt@kms_psr2_sf@psr2-overlay-plane-move-continuous-exceed-sf:
    - shard-bmg:          NOTRUN -> [SKIP][81] ([Intel XE#1406] / [Intel XE#1489]) +3 other tests skip
   [81]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_psr2_sf@psr2-overlay-plane-move-continuous-exceed-sf.html

  * igt@kms_psr@fbc-pr-primary-page-flip:
    - shard-bmg:          NOTRUN -> [SKIP][82] ([Intel XE#1406] / [Intel XE#2234] / [Intel XE#2850]) +8 other tests skip
   [82]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_psr@fbc-pr-primary-page-flip.html

  * igt@kms_rotation_crc@bad-pixel-format:
    - shard-bmg:          NOTRUN -> [SKIP][83] ([Intel XE#3414] / [Intel XE#3904]) +2 other tests skip
   [83]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_rotation_crc@bad-pixel-format.html

  * igt@kms_sharpness_filter@filter-rotations:
    - shard-bmg:          NOTRUN -> [SKIP][84] ([Intel XE#6503])
   [84]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_sharpness_filter@filter-rotations.html

  * igt@kms_vrr@cmrr@pipe-a-edp-1:
    - shard-lnl:          [PASS][85] -> [FAIL][86] ([Intel XE#4459]) +1 other test fail
   [85]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-lnl-1/igt@kms_vrr@cmrr@pipe-a-edp-1.html
   [86]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-lnl-1/igt@kms_vrr@cmrr@pipe-a-edp-1.html

  * igt@kms_vrr@seamless-rr-switch-vrr:
    - shard-bmg:          NOTRUN -> [SKIP][87] ([Intel XE#1499])
   [87]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@kms_vrr@seamless-rr-switch-vrr.html

  * igt@xe_compute@ccs-mode-compute-kernel:
    - shard-bmg:          NOTRUN -> [SKIP][88] ([Intel XE#6599])
   [88]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_compute@ccs-mode-compute-kernel.html

  * igt@xe_eudebug@basic-vm-bind-vm-destroy:
    - shard-bmg:          NOTRUN -> [SKIP][89] ([Intel XE#4837]) +3 other tests skip
   [89]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_eudebug@basic-vm-bind-vm-destroy.html

  * igt@xe_eudebug_online@single-step:
    - shard-bmg:          NOTRUN -> [SKIP][90] ([Intel XE#4837] / [Intel XE#6665]) +2 other tests skip
   [90]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_eudebug_online@single-step.html

  * igt@xe_exec_basic@multigpu-many-execqueues-many-vm-null-defer-mmap:
    - shard-bmg:          NOTRUN -> [SKIP][91] ([Intel XE#2322]) +4 other tests skip
   [91]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_exec_basic@multigpu-many-execqueues-many-vm-null-defer-mmap.html

  * igt@xe_exec_fault_mode@many-userptr-invalidate-race:
    - shard-lnl:          [PASS][92] -> [DMESG-WARN][93] ([Intel XE#7063]) +9 other tests dmesg-warn
   [92]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-lnl-5/igt@xe_exec_fault_mode@many-userptr-invalidate-race.html
   [93]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-lnl-4/igt@xe_exec_fault_mode@many-userptr-invalidate-race.html

  * igt@xe_exec_multi_queue@few-execs-preempt-mode-dyn-priority-smem:
    - shard-bmg:          NOTRUN -> [SKIP][94] ([Intel XE#6874]) +17 other tests skip
   [94]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@xe_exec_multi_queue@few-execs-preempt-mode-dyn-priority-smem.html

  * igt@xe_exec_reset@gt-reset-stress:
    - shard-lnl:          [PASS][95] -> [DMESG-WARN][96] ([Intel XE#7023])
   [95]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-lnl-2/igt@xe_exec_reset@gt-reset-stress.html
   [96]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-lnl-5/igt@xe_exec_reset@gt-reset-stress.html

  * igt@xe_exec_system_allocator@many-64k-mmap-free-huge:
    - shard-bmg:          NOTRUN -> [SKIP][97] ([Intel XE#5007]) +1 other test skip
   [97]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@xe_exec_system_allocator@many-64k-mmap-free-huge.html

  * igt@xe_exec_system_allocator@many-execqueues-mmap-huge-nomemset:
    - shard-bmg:          NOTRUN -> [SKIP][98] ([Intel XE#4943]) +14 other tests skip
   [98]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@xe_exec_system_allocator@many-execqueues-mmap-huge-nomemset.html

  * igt@xe_exec_system_allocator@twice-large-mmap-free-nomemset:
    - shard-lnl:          [PASS][99] -> [DMESG-WARN][100] ([Intel XE#4537] / [Intel XE#7063])
   [99]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-lnl-5/igt@xe_exec_system_allocator@twice-large-mmap-free-nomemset.html
   [100]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-lnl-4/igt@xe_exec_system_allocator@twice-large-mmap-free-nomemset.html

  * igt@xe_fault_injection@exec-queue-create-fail-xe_pxp_exec_queue_add:
    - shard-bmg:          NOTRUN -> [SKIP][101] ([Intel XE#6281])
   [101]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_fault_injection@exec-queue-create-fail-xe_pxp_exec_queue_add.html

  * igt@xe_media_fill@media-fill:
    - shard-bmg:          NOTRUN -> [SKIP][102] ([Intel XE#2459] / [Intel XE#2596])
   [102]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@xe_media_fill@media-fill.html

  * igt@xe_module_load@force-load:
    - shard-bmg:          NOTRUN -> [SKIP][103] ([Intel XE#2457])
   [103]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_module_load@force-load.html

  * igt@xe_pm@d3hot-i2c:
    - shard-bmg:          NOTRUN -> [SKIP][104] ([Intel XE#5742])
   [104]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_pm@d3hot-i2c.html

  * igt@xe_query@multigpu-query-mem-usage:
    - shard-bmg:          NOTRUN -> [SKIP][105] ([Intel XE#944])
   [105]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-8/igt@xe_query@multigpu-query-mem-usage.html

  
#### Possible fixes ####

  * igt@kms_flip@flip-vs-expired-vblank@c-edp1:
    - shard-lnl:          [FAIL][106] ([Intel XE#301] / [Intel XE#3149]) -> [PASS][107] +1 other test pass
   [106]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-lnl-5/igt@kms_flip@flip-vs-expired-vblank@c-edp1.html
   [107]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-lnl-4/igt@kms_flip@flip-vs-expired-vblank@c-edp1.html

  * igt@kms_vblank@query-busy:
    - shard-bmg:          [ABORT][108] ([Intel XE#5545]) -> [PASS][109] +3 other tests pass
   [108]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-2/igt@kms_vblank@query-busy.html
   [109]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-10/igt@kms_vblank@query-busy.html

  
#### Warnings ####

  * igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-shrfb-pgflip-blt:
    - shard-bmg:          [SKIP][110] ([Intel XE#2312]) -> [SKIP][111] ([Intel XE#2311])
   [110]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac/shard-bmg-3/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-shrfb-pgflip-blt.html
   [111]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/shard-bmg-3/igt@kms_frontbuffer_tracking@fbcdrrs-2p-scndscrn-shrfb-pgflip-blt.html

  
  [Intel XE#1124]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1124
  [Intel XE#1406]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1406
  [Intel XE#1439]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1439
  [Intel XE#1489]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1489
  [Intel XE#1499]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1499
  [Intel XE#1508]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1508
  [Intel XE#2234]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2234
  [Intel XE#2244]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2244
  [Intel XE#2252]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2252
  [Intel XE#2293]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2293
  [Intel XE#2311]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2311
  [Intel XE#2312]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2312
  [Intel XE#2313]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2313
  [Intel XE#2314]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2314
  [Intel XE#2320]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2320
  [Intel XE#2321]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2321
  [Intel XE#2322]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2322
  [Intel XE#2325]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2325
  [Intel XE#2327]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2327
  [Intel XE#2328]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2328
  [Intel XE#2341]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2341
  [Intel XE#2380]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2380
  [Intel XE#2457]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2457
  [Intel XE#2459]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2459
  [Intel XE#2596]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2596
  [Intel XE#2652]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2652
  [Intel XE#2850]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2850
  [Intel XE#2887]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2887
  [Intel XE#2894]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2894
  [Intel XE#301]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/301
  [Intel XE#3141]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3141
  [Intel XE#3149]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3149
  [Intel XE#3414]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3414
  [Intel XE#3432]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3432
  [Intel XE#367]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/367
  [Intel XE#3904]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/3904
  [Intel XE#4141]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4141
  [Intel XE#4459]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4459
  [Intel XE#4537]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4537
  [Intel XE#4837]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4837
  [Intel XE#4943]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/4943
  [Intel XE#5007]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5007
  [Intel XE#5020]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5020
  [Intel XE#5545]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5545
  [Intel XE#5742]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/5742
  [Intel XE#607]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/607
  [Intel XE#6281]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6281
  [Intel XE#6503]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6503
  [Intel XE#6599]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6599
  [Intel XE#6665]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6665
  [Intel XE#6740]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6740
  [Intel XE#6874]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6874
  [Intel XE#6911]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6911
  [Intel XE#6912]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6912
  [Intel XE#7023]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/7023
  [Intel XE#7061]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/7061
  [Intel XE#7063]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/7063
  [Intel XE#787]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/787
  [Intel XE#836]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/836
  [Intel XE#870]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/870
  [Intel XE#944]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/944


Build changes
-------------

  * Linux: xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac -> xe-pw-155188v4

  IGT_8704: 8704
  xe-4406-af18785a6a8621b1a5805ba8a1b35d290cb4bcac: af18785a6a8621b1a5805ba8a1b35d290cb4bcac
  xe-pw-155188v4: 155188v4

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v4/index.html

[-- Attachment #2: Type: text/html, Size: 30659 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2026-01-19  4:00 ` [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
@ 2026-01-19  9:06   ` kernel test robot
  2026-01-21  7:09   ` Raag Jadav
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-01-19  9:06 UTC (permalink / raw)
  To: Riana Tauro, intel-xe, dri-devel
  Cc: oe-kbuild-all, aravind.iddamsetty, anshuman.gupta, rodrigo.vivi,
	joonas.lahtinen, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
	ravi.kishore.koppuravuri, raag.jadav, Riana Tauro,
	Himal Prasad Ghimiray

Hi Riana,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm-xe/drm-xe-next]
[also build test ERROR on drm/drm-next next-20260116]
[cannot apply to drm-misc/drm-misc-next linus/master v6.19-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Riana-Tauro/drm-ras-Introduce-the-DRM-RAS-infrastructure-over-generic-netlink/20260119-113326
base:   https://gitlab.freedesktop.org/drm/xe/kernel.git drm-xe-next
patch link:    https://lore.kernel.org/r/20260119040023.2821518-9-riana.tauro%40intel.com
patch subject: [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
config: alpha-randconfig-r072-20260119 (https://download.01.org/0day-ci/archive/20260119/202601191630.EAGtgUQy-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 8.5.0
smatch version: v0.5.0-8985-g2614ff1a
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260119/202601191630.EAGtgUQy-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601191630.EAGtgUQy-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/xe/xe_hw_error.c: In function 'gt_hw_error_handler':
>> drivers/gpu/drm/xe/xe_hw_error.c:168:4: error: a label can only be part of a statement and a declaration is not a statement
       u32 errbit;
       ^~~


vim +168 drivers/gpu/drm/xe/xe_hw_error.c

   134	
   135	static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
   136					u32 error_id)
   137	{
   138		const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
   139		struct xe_device *xe = tile_to_xe(tile);
   140		struct xe_drm_ras *ras = &xe->ras;
   141		struct xe_drm_ras_counter *info = ras->info[severity];
   142		struct xe_mmio *mmio = &tile->mmio;
   143		unsigned long err_stat = 0;
   144		int i, len;
   145	
   146		if (xe->info.platform != XE_PVC)
   147			return;
   148	
   149		if (hw_err == HARDWARE_ERROR_NONFATAL) {
   150			atomic64_inc(&info[error_id].counter);
   151			log_hw_error(tile, info[error_id].name, severity);
   152			return;
   153		}
   154	
   155		len = (hw_err == HARDWARE_ERROR_CORRECTABLE) ? ERR_STAT_GT_COR_VECTOR_LEN
   156							     : ERR_STAT_GT_VECTOR_MAX;
   157	
   158		for (i = 0; i < len; i++) {
   159			u32 vector, val;
   160	
   161			vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i));
   162			if (!vector)
   163				continue;
   164	
   165			switch (i) {
   166			case ERR_STAT_GT_VECTOR0:
   167			case ERR_STAT_GT_VECTOR1:
 > 168				u32 errbit;
   169	
   170				val = hweight32(vector);
   171				atomic64_add(val, &info[error_id].counter);
   172				log_gt_err(tile, "Subslice", i, vector, severity);
   173	
   174				/* Read Error Status Register once */
   175				if (err_stat)
   176					break;
   177	
   178				err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
   179				for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
   180					if (hw_err == HARDWARE_ERROR_CORRECTABLE &&
   181					    (BIT(errbit) & PVC_COR_ERR_MASK))
   182						atomic64_inc(&info[error_id].counter);
   183					if (hw_err == HARDWARE_ERROR_FATAL &&
   184					    (BIT(errbit) & PVC_FAT_ERR_MASK))
   185						atomic64_inc(&info[error_id].counter);
   186				}
   187				if (err_stat)
   188					xe_mmio_write32(mmio, ERR_STAT_GT_REG(hw_err), err_stat);
   189				break;
   190			case ERR_STAT_GT_VECTOR2:
   191			case ERR_STAT_GT_VECTOR3:
   192				val = hweight32(vector);
   193				atomic64_add(val, &info[error_id].counter);
   194				log_gt_err(tile, "L3 BANK", i, vector, severity);
   195				break;
   196			case ERR_STAT_GT_VECTOR6:
   197				val = hweight32(vector);
   198				atomic64_add(val, &info[error_id].counter);
   199				log_gt_err(tile, "TLB", i, vector, severity);
   200				break;
   201			case ERR_STAT_GT_VECTOR7:
   202				val = hweight32(vector);
   203				atomic64_add(val, &info[error_id].counter);
   204				break;
   205			default:
   206				log_gt_err(tile, "Undefined", i, vector, severity);
   207			}
   208	
   209			xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
   210		}
   211	}
   212	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-19  4:00 ` [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
@ 2026-01-20 17:01   ` Raag Jadav
  2026-01-28  6:51     ` Riana Tauro
  0 siblings, 1 reply; 22+ messages in thread
From: Raag Jadav @ 2026-01-20 17:01 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri

On Mon, Jan 19, 2026 at 09:30:24AM +0530, Riana Tauro wrote:
> Allocate correctable, uncorrectable nodes for every xe device
> Each node contains error classes, counters and respective
> query counter functions.

...

> +static int hw_query_error_counter(struct xe_drm_ras_counter *info,
> +				  u32 error_id, const char **name, u32 *val)
> +{
> +	if (error_id < DRM_XE_RAS_ERROR_CLASS_GT || error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)

This looks like it can be in_range().

> +		return -EINVAL;
> +
> +	if (!info[error_id].name)
> +		return -ENOENT;
> +
> +	*name = info[error_id].name;
> +	*val = atomic64_read(&info[error_id].counter);
> +
> +	return 0;
> +}
> +
> +static int query_uncorrectable_error_counters(struct drm_ras_node *ep,

This is named as 'counters' but I only see a single call here. What am
I missing?

> +					      u32 error_id, const char **name,
> +					      u32 *val)

Can this be less lines?

> +{
> +	struct xe_device *xe = ep->priv;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE];
> +
> +	return hw_query_error_counter(info, error_id, name, val);
> +}
> +
> +static int query_correctable_error_counters(struct drm_ras_node *ep,

Same as above.

> +					    u32 error_id, const char **name,
> +					    u32 *val)

Same as above.

> +{
> +	struct xe_device *xe = ep->priv;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE];
> +
> +	return hw_query_error_counter(info, error_id, name, val);
> +}
> +
> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
> +{
> +	struct xe_drm_ras_counter *counter;
> +	int i;
> +
> +	counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERROR_CLASS_MAX,
> +			       sizeof(struct xe_drm_ras_counter), GFP_KERNEL);

I'd make this robust against type changes, i.e. sizeof(*counter).

> +	if (!counter)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < DRM_XE_RAS_ERROR_CLASS_MAX; i++) {
> +		if (!errors[i])
> +			continue;
> +
> +		counter[i].name = errors[i];
> +		atomic64_set(&counter[i].counter, 0);

Doesn't drmm_kcalloc() already take care of this?

> +	}
> +
> +	return counter;
> +}
> +
> +static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
> +			      const enum drm_xe_ras_error_severity severity)
> +{
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	const char *device_name;
> +
> +	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
> +				pci_domain_nr(pdev->bus), pdev->bus->number,
> +				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +
> +	node->device_name = device_name;
> +	node->node_name = error_severity[severity];
> +	node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
> +	node->error_counter_range.first = DRM_XE_RAS_ERROR_CLASS_GT;
> +	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
> +	node->priv = xe;
> +
> +	ras->info[severity] = allocate_and_copy_counters(xe);
> +	if (IS_ERR(ras->info[severity]))
> +		return PTR_ERR(ras->info[severity]);
> +
> +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE)
> +		node->query_error_counter = query_correctable_error_counters;
> +	else
> +		node->query_error_counter = query_uncorrectable_error_counters;

Shouldn't this have explicit severity check, atleast for future proofing?

> +
> +	return 0;
> +}
> +
> +static int register_nodes(struct xe_device *xe)
> +{
> +	struct xe_drm_ras *ras = &xe->ras;
> +	int i;
> +
> +	for_each_error_severity(i) {
> +		struct drm_ras_node *node = &ras->node[i];
> +		int ret;
> +
> +		ret = assign_node_params(xe, node, i);
> +		if (ret)
> +			return ret;
> +
> +		ret = drm_ras_node_register(node);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void xe_drm_ras_unregister_nodes(void *arg)
> +{
> +	struct xe_device *xe = arg;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	int i;
> +
> +	for_each_error_severity(i) {
> +		struct drm_ras_node *node = &ras->node[i];
> +
> +		drm_ras_node_unregister(node);
> +
> +		if (i == 0)
> +			kfree(node->device_name);

Aren't we allocating this for each node?

> +	}
> +}
> +
> +/**
> + * xe_drm_ras_allocate_nodes - Allocate DRM RAS nodes
> + * @xe: xe device instance
> + *
> + * Allocate and register DRM RAS nodes per device
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
> +{
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct drm_ras_node *node;
> +	int err;
> +
> +	node = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX, sizeof(struct drm_ras_node),

Ditto for robust against type changes.

> +			    GFP_KERNEL);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	ras->node = node;
> +
> +	err = register_nodes(xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to register drm ras node\n");
> +		return err;
> +	}
> +
> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
> +		return err;
> +	}
> +
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
> new file mode 100644
> index 000000000000..2d714342e4e5
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_drm_ras.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +#ifndef XE_DRM_RAS_H_
> +#define XE_DRM_RAS_H_
> +
> +struct xe_device;
> +
> +#define for_each_error_severity(i)	\
> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++)
> +
> +int xe_drm_ras_allocate_nodes(struct xe_device *xe);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
> new file mode 100644
> index 000000000000..528c708e57da
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_DRM_RAS_TYPES_H_
> +#define _XE_DRM_RAS_TYPES_H_
> +
> +#include <drm/xe_drm.h>
> +#include <linux/atomic.h>
> +
> +struct drm_ras_node;
> +
> +/* Error categories reported by hardware */
> +enum hardware_error {
> +	HARDWARE_ERROR_CORRECTABLE = 0,
> +	HARDWARE_ERROR_NONFATAL = 1,
> +	HARDWARE_ERROR_FATAL = 2,
> +	HARDWARE_ERROR_MAX,
> +};
> +
> +/**
> + * struct xe_drm_ras_counter - XE RAS counter
> + *
> + * This structure contains error class and counter information
> + */
> +struct xe_drm_ras_counter {
> +	/** @name: error class name */
> +	const char *name;
> +
> +	/** @counter: count of error */
> +	atomic64_t counter;
> +};
> +
> +/**
> + * struct xe_drm_ras - XE DRM RAS structure
> + *
> + * This structure has details of error counters
> + */
> +struct xe_drm_ras {
> +	/** @node: DRM RAS node */
> +	struct drm_ras_node *node;
> +
> +	/** @info: info array for all types of errors */
> +	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
> +

Nit: Redundant blank line.

> +};
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 8c65291f36fc..b42495d3015a 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -10,20 +10,14 @@
>  #include "regs/xe_irq_regs.h"
>  
>  #include "xe_device.h"
> +#include "xe_drm_ras.h"
>  #include "xe_hw_error.h"
>  #include "xe_mmio.h"
>  #include "xe_survivability_mode.h"
>  
>  #define  HEC_UNCORR_FW_ERR_BITS 4
>  extern struct fault_attr inject_csc_hw_error;
> -
> -/* Error categories reported by hardware */
> -enum hardware_error {
> -	HARDWARE_ERROR_CORRECTABLE = 0,
> -	HARDWARE_ERROR_NONFATAL = 1,
> -	HARDWARE_ERROR_FATAL = 2,
> -	HARDWARE_ERROR_MAX,
> -};
> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;

This is unrelated to uapi changes, shouldn't we split this into a separate
patch?

...

> +/**
> + * enum drm_xe_ras_error_severity - DRM RAS error severity.
> + */
> +enum drm_xe_ras_error_severity {
> +	/** @DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE: Correctable Error */
> +	DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE = 0,

DRM_XE_RAS_ERR_SEV_*? (and same for this entire file)

> +	/** @DRM_XE_RAS_ERROR_UNCORRECTABLE: Uncorrectable Error */

Match with actual name.

> +	DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE,
> +	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
> +	DRM_XE_RAS_ERROR_SEVERITY_MAX /* non-ABI */
> +};
> +
> +/**
> + * enum drm_xe_ras_error_class - DRM RAS error classes.
> + */
> +enum drm_xe_ras_error_class {
> +	/** @DRM_XE_RAS_ERROR_CLASS_GT: GT Error */
> +	DRM_XE_RAS_ERROR_CLASS_GT = 1,
> +	/** @DRM_XE_RAS_ERROR_CLASS_SOC: SoC Error */
> +	DRM_XE_RAS_ERROR_CLASS_SOC,
> +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
> +	DRM_XE_RAS_ERROR_CLASS_MAX	/* non-ABI */

I don't find 'CLASS' to be much translatable since it can inherently mean
anything, but I'm not sure if this to match with spec naming.

PS: I've used 'COMP' for component in my series[1], but upto you.
Also, please help review it in case I've missed anything.

[1] https://lore.kernel.org/intel-xe/20260116093432.914040-1-raag.jadav@intel.com/

Raag

> +};
> +
> +/*
> + * Error severity to name mapping.
> + */
> +#define DRM_XE_RAS_ERROR_SEVERITY_NAMES {					\
> +	[DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE] = "correctable-errors",		\
> +	[DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE] = "uncorrectable-errors",	\
> +}
> +
> +/*
> + * Error class to name mapping.
> + */
> +#define DRM_XE_RAS_ERROR_CLASS_NAMES {					\
> +	[DRM_XE_RAS_ERROR_CLASS_GT] = "GT",				\
> +	[DRM_XE_RAS_ERROR_CLASS_SOC] = "SoC"				\
> +}
> +
>  #if defined(__cplusplus)
>  }
>  #endif
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2026-01-19  4:00 ` [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
  2026-01-19  9:06   ` kernel test robot
@ 2026-01-21  7:09   ` Raag Jadav
  2026-01-27  8:29     ` Riana Tauro
  1 sibling, 1 reply; 22+ messages in thread
From: Raag Jadav @ 2026-01-21  7:09 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

On Mon, Jan 19, 2026 at 09:30:25AM +0530, Riana Tauro wrote:
> PVC supports GT error reporting via vector registers along with
> error status register. Add support to report these errors and
> update respective counters. Incase of Subslice error reported
> by vector register, process the error status register
> for applicable bits.
> 
> Incorporate the counter inside the driver itself and start
> using the drm_ras generic netlink to report them.
> 
> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: Add ID's and names as uAPI (Rodrigo)
> 
> v3: use REG_BIT
>     do not use _ffs
>     use a single function for GT errors
>     remove redundant errors from logs (Raag)
>     use only correctable/uncorrectable error severity (Pratik/Aravind)
> ---
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  53 +++++-
>  drivers/gpu/drm/xe/xe_hw_error.c           | 182 +++++++++++++++++++--
>  2 files changed, 220 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index c146b9ef44eb..5eeb0be27300 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -6,15 +6,60 @@
>  #ifndef _XE_HW_ERROR_REGS_H_
>  #define _XE_HW_ERROR_REGS_H_
>  
> -#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
> -#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
> +#define HEC_UNCORR_ERR_STATUS(base)		XE_REG((base) + 0x118)
> +#define   UNCORR_FW_REPORTED_ERR		REG_BIT(6)
>  
> -#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
> +#define HEC_UNCORR_FW_ERR_DW0(base)		XE_REG((base) + 0x124)
> +
> +#define ERR_STAT_GT_COR				0x100160
> +#define   EU_GRF_COR_ERR			REG_BIT(15)
> +#define   EU_IC_COR_ERR				REG_BIT(14)
> +#define   SLM_COR_ERR				REG_BIT(13)
> +#define   GUC_COR_ERR				REG_BIT(1)
> +
> +#define ERR_STAT_GT_NONFATAL			0x100164
> +#define ERR_STAT_GT_FATAL			0x100168
> +#define   EU_GRF_FAT_ERR			REG_BIT(15)
> +#define   SLM_FAT_ERR				REG_BIT(13)
> +#define   GUC_FAT_ERR				REG_BIT(6)
> +#define   FPU_FAT_ERR				REG_BIT(3)
> +
> +#define ERR_STAT_GT_REG(x)			XE_REG(_PICK_EVEN((x), \
> +								  ERR_STAT_GT_COR, \
> +								  ERR_STAT_GT_NONFATAL))

Shouldn't this be FATAL?

> +#define PVC_COR_ERR_MASK			(GUC_COR_ERR | SLM_COR_ERR | EU_IC_COR_ERR | \
> +						 EU_GRF_COR_ERR)
> +
> +#define PVC_FAT_ERR_MASK			(FPU_FAT_ERR | GUC_FAT_ERR | EU_GRF_FAT_ERR | \
> +						 SLM_FAT_ERR)
>  
>  #define DEV_ERR_STAT_NONFATAL			0x100178
>  #define DEV_ERR_STAT_CORRECTABLE		0x10017c
>  #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>  								  DEV_ERR_STAT_CORRECTABLE, \
>  								  DEV_ERR_STAT_NONFATAL))
> -#define   XE_CSC_ERROR				BIT(17)
> +
> +#define   XE_CSC_ERROR				17
> +#define   XE_GT_ERROR				0
> +
> +#define ERR_STAT_GT_FATAL_VECTOR_0		0x100260
> +#define ERR_STAT_GT_FATAL_VECTOR_1		0x100264
> +
> +#define ERR_STAT_GT_FATAL_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
> +								  ERR_STAT_GT_FATAL_VECTOR_0, \
> +								  ERR_STAT_GT_FATAL_VECTOR_1))
> +
> +#define ERR_STAT_GT_COR_VECTOR_0		0x1002a0
> +#define ERR_STAT_GT_COR_VECTOR_1		0x1002a4
> +
> +#define ERR_STAT_GT_COR_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
> +								  ERR_STAT_GT_COR_VECTOR_0, \
> +								  ERR_STAT_GT_COR_VECTOR_1))
> +#define ERR_STAT_GT_COR_VECTOR_LEN		4

Now this makes me question about FATAL_VECTOR_LEN, perhaps we should add
it? Since we already have enums for it, I'm wondering if we should reuse
them here instead of having separate raw values?

> +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
> +						ERR_STAT_GT_COR_VECTOR_REG(x) : \
> +						ERR_STAT_GT_FATAL_VECTOR_REG(x))
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index b42495d3015a..bd0cf61741ca 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -3,6 +3,7 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <linux/bitmap.h>
>  #include <linux/fault-inject.h>
>  
>  #include "regs/xe_gsc_regs.h"
> @@ -15,7 +16,10 @@
>  #include "xe_mmio.h"
>  #include "xe_survivability_mode.h"
>  
> -#define  HEC_UNCORR_FW_ERR_BITS 4
> +#define  GT_HW_ERROR_MAX_ERR_BITS	16
> +#define  HEC_UNCORR_FW_ERR_BITS 	4
> +#define  XE_RAS_REG_SIZE		32

This looks like it can be BITS_PER_TYPE(). Also, why do we need a separate
macro?

>  extern struct fault_attr inject_csc_hw_error;
>  static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>  
> @@ -26,10 +30,21 @@ static const char * const hec_uncorrected_fw_errors[] = {
>  	"Data Corruption"
>  };
>  
> -static bool fault_inject_csc_hw_error(void)
> -{
> -	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> -}
> +static const unsigned long xe_hw_error_map[] = {
> +	[XE_GT_ERROR] = DRM_XE_RAS_ERROR_CLASS_GT,
> +};
> +
> +enum gt_vector_regs {
> +	ERR_STAT_GT_VECTOR0 = 0,
> +	ERR_STAT_GT_VECTOR1,
> +	ERR_STAT_GT_VECTOR2,
> +	ERR_STAT_GT_VECTOR3,
> +	ERR_STAT_GT_VECTOR4,
> +	ERR_STAT_GT_VECTOR5,
> +	ERR_STAT_GT_VECTOR6,
> +	ERR_STAT_GT_VECTOR7,
> +	ERR_STAT_GT_VECTOR_MAX,

This is guaranteed last member, so redundant comma.

> +};
>  
>  static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_err)
>  {
> @@ -39,6 +54,11 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
>  	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
>  }
>  
> +static bool fault_inject_csc_hw_error(void)
> +{
> +	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> +}
> +
>  static void csc_hw_error_work(struct work_struct *work)
>  {
>  	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
> @@ -86,15 +106,121 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>  	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>  }
>  
> +static void log_hw_error(struct xe_tile *tile, const char *name,
> +			 const enum drm_xe_ras_error_severity severity)
> +{
> +	const char *severity_str = error_severity[severity];
> +	struct xe_device *xe = tile_to_xe(tile);
> +
> +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)

If we have FATAL case in the future, should we come back refactoring this?
Perhaps the reverse logic would be a bit more future proof.

> +		drm_err_ratelimited(&xe->drm, "%s %s detected\n", name, severity_str);
> +	else
> +		drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
> +}
> +
> +static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
> +		       const enum drm_xe_ras_error_severity severity)
> +{
> +	const char *severity_str = error_severity[severity];
> +	struct xe_device *xe = tile_to_xe(tile);
> +
> +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)

Ditto.

> +		drm_err_ratelimited(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
> +				    name, severity_str, i, err);
> +	else
> +		drm_warn(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
> +			 name, severity_str, i, err);
> +}
> +
> +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> +				u32 error_id)
> +{
> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	struct xe_mmio *mmio = &tile->mmio;
> +	unsigned long err_stat = 0;
> +	int i, len;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return;
> +
> +	if (hw_err == HARDWARE_ERROR_NONFATAL) {
> +		atomic64_inc(&info[error_id].counter);
> +		log_hw_error(tile, info[error_id].name, severity);
> +		return;
> +	}
> +
> +	len = (hw_err == HARDWARE_ERROR_CORRECTABLE) ? ERR_STAT_GT_COR_VECTOR_LEN
> +						     : ERR_STAT_GT_VECTOR_MAX;
> +
> +	for (i = 0; i < len; i++) {
> +		u32 vector, val;
> +
> +		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i));
> +		if (!vector)
> +			continue;
> +
> +		switch (i) {
> +		case ERR_STAT_GT_VECTOR0:
> +		case ERR_STAT_GT_VECTOR1:
> +			u32 errbit;

With this I think you'll need braces to make the compiler happy, so either
add them or move this to the top.

> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			log_gt_err(tile, "Subslice", i, vector, severity);
> +
> +			/* Read Error Status Register once */

Why? Can you please elaborate?

> +			if (err_stat)
> +				break;
> +
> +			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
> +			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
> +				if (hw_err == HARDWARE_ERROR_CORRECTABLE &&
> +				    (BIT(errbit) & PVC_COR_ERR_MASK))

I'm wondering if this can be a (hw_err ? x) macro for this? Perhaps it'll
help remove the duplication.

> +					atomic64_inc(&info[error_id].counter);
> +				if (hw_err == HARDWARE_ERROR_FATAL &&
> +				    (BIT(errbit) & PVC_FAT_ERR_MASK))
> +					atomic64_inc(&info[error_id].counter);
> +			}
> +			if (err_stat)
> +				xe_mmio_write32(mmio, ERR_STAT_GT_REG(hw_err), err_stat);
> +			break;
> +		case ERR_STAT_GT_VECTOR2:
> +		case ERR_STAT_GT_VECTOR3:
> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			log_gt_err(tile, "L3 BANK", i, vector, severity);
> +			break;
> +		case ERR_STAT_GT_VECTOR6:
> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			log_gt_err(tile, "TLB", i, vector, severity);
> +			break;
> +		case ERR_STAT_GT_VECTOR7:
> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			break;
> +		default:
> +			log_gt_err(tile, "Undefined", i, vector, severity);
> +		}
> +
> +		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
> +	}
> +}
> +
>  static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>  {
>  	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>  	const char *severity_str = error_severity[severity];
>  	struct xe_device *xe = tile_to_xe(tile);
> -	unsigned long flags;
> -	u32 err_src;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	unsigned long flags, err_src;
> +	u32 err_bit;
>  
> -	if (xe->info.platform != XE_BATTLEMAGE)
> +	if (!IS_DGFX(xe))
>  		return;
>  
>  	spin_lock_irqsave(&xe->irq.lock, flags);

I'm wondering if we really need this? We're already inside irq handler so
what are we protecting here?

> @@ -105,11 +231,44 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>  		goto unlock;
>  	}
>  
> -	if (err_src & XE_CSC_ERROR)
> +	/*
> +	 * On encountering CSC firmware errors, the graphics device is non-recoverable.

... "so bail immediately."

> +	 * The only way to recover from these errors is firmware flash. The device will
> +	 * enter Runtime Survivability mode when such errors are detected.
> +	 */
> +	if (err_src & XE_CSC_ERROR) {
>  		csc_hw_error_handler(tile, hw_err);
> +		goto clear_reg;
> +	}
>  
> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
> +	if (!info) {
> +		drm_err_ratelimited(&xe->drm, HW_ERR "Errors undefined\n");
> +		goto clear_reg;
> +	}
> +
> +	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
> +		u32 error_id = xe_hw_error_map[err_bit];

Does this need bounds checking against ARRAY_SIZE()?

> +		const char *name;
> +
> +		name = info[error_id].name;
> +		if (!name)
> +			goto clear_reg;

Shouldn't we atleast give the next id a try?

> +		if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE) {

Ditto for logging per severity.

Raag

> +			drm_err_ratelimited(&xe->drm, HW_ERR
> +					    "TILE%d reported %s %s, bit[%d] is set\n",
> +					    tile->id, name, severity_str, err_bit);
> +		} else {
> +			drm_warn(&xe->drm, HW_ERR
> +				 "TILE%d reported %s %s, bit[%d] is set\n",
> +				 tile->id, name, severity_str, err_bit);
> +		}
> +		if (err_bit == XE_GT_ERROR)
> +			gt_hw_error_handler(tile, hw_err, error_id);
> +	}
> +
> +clear_reg:
> +	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>  unlock:
>  	spin_unlock_irqrestore(&xe->irq.lock, flags);
>  }
> @@ -131,9 +290,10 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>  	if (fault_inject_csc_hw_error())
>  		schedule_work(&tile->csc_hw_error_work);
>  
> -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
> +	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
>  		if (master_ctl & ERROR_IRQ(hw_err))
>  			hw_error_source_handler(tile, hw_err);
> +	}
>  }
>  
>  static int hw_error_info_init(struct xe_device *xe)
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-19  4:00 ` [PATCH v4 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2026-01-22 21:51   ` Zack McKevitt
  2026-02-02  6:20     ` Riana Tauro
  0 siblings, 1 reply; 22+ messages in thread
From: Zack McKevitt @ 2026-01-22 21:51 UTC (permalink / raw)
  To: Riana Tauro, intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Lijo Lazar, Hawking Zhang, Jakub Kicinski,
	David S. Miller, Paolo Abeni, Eric Dumazet, netdev

Hi Riana and Rodrigo,

Thanks for incorporating the various pieces of feedback. I think this 
looks good from our end.

Reviewed-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>

Zack

On 1/18/2026 9:00 PM, Riana Tauro wrote:
> From: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> Introduces the DRM RAS infrastructure over generic netlink.
> 
> The new interface allows drivers to expose RAS nodes and their
> associated error counters to userspace in a structured and extensible
> way. Each drm_ras node can register its own set of error counters, which
> are then discoverable and queryable through netlink operations. This
> lays the groundwork for reporting and managing hardware error states
> in a unified manner across different DRM drivers.
> 
> Currently is only supports error-counter nodes. But it can be
> extended later.
> 
> The registration is also no tied to any drm node, so it can be
> used by accel devices as well.
> 
> It uses the new and mandatory YAML description format stored in
> Documentation/netlink/specs/. This forces a single generic netlink
> family namespace for the entire drm: "drm-ras".
> But multiple-endpoints are supported within the single family.
> 
> Any modification to this API needs to be applied to
> Documentation/netlink/specs/drm_ras.yaml before regenerating the
> code:
> 
> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>   Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
>   > include/uapi/drm/drm_ras.h
> 
> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>   Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
>   > include/drm/drm_ras_nl.h
> 
> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>   Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
>   > drivers/gpu/drm/drm_ras_nl.c
> 
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: netdev@vger.kernel.org
> Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: fix doc and memory leak
>      use xe_for_each_start
>      use standard genlmsg_iput (Jakub Kicinski)
> 
> v3: add documentation to index
>      modify documentation to mention uAPI requirements (Rodrigo)
> 
> v4: fix typo (Zack)
> ---
>   Documentation/gpu/drm-ras.rst            | 109 +++++++
>   Documentation/gpu/index.rst              |   1 +
>   Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
>   drivers/gpu/drm/Kconfig                  |   9 +
>   drivers/gpu/drm/Makefile                 |   1 +
>   drivers/gpu/drm/drm_drv.c                |   6 +
>   drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
>   drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
>   drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
>   include/drm/drm_ras.h                    |  76 +++++
>   include/drm/drm_ras_genl_family.h        |  17 ++
>   include/drm/drm_ras_nl.h                 |  24 ++
>   include/uapi/drm/drm_ras.h               |  49 ++++
>   13 files changed, 869 insertions(+)
>   create mode 100644 Documentation/gpu/drm-ras.rst
>   create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>   create mode 100644 drivers/gpu/drm/drm_ras.c
>   create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>   create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>   create mode 100644 include/drm/drm_ras.h
>   create mode 100644 include/drm/drm_ras_genl_family.h
>   create mode 100644 include/drm/drm_ras_nl.h
>   create mode 100644 include/uapi/drm/drm_ras.h
> 
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> new file mode 100644
> index 000000000000..cec60cf5d17d
> --- /dev/null
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -0,0 +1,109 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +============================
> +DRM RAS over Generic Netlink
> +============================
> +
> +The DRM RAS (Reliability, Availability, Serviceability) interface provides a
> +standardized way for GPU/accelerator drivers to expose error counters and
> +other reliability nodes to user space via Generic Netlink. This allows
> +diagnostic tools, monitoring daemons, or test infrastructure to query hardware
> +health in a uniform way across different DRM drivers.
> +
> +Key Goals:
> +
> +* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
> +  data center monitoring and reliability operations.
> +* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
> +  specifications and centralize all RAS-related communication in one namespace.
> +* Support a basic error counter interface, addressing the immediate, essential
> +  monitoring needs.
> +* Offer a flexible, future-proof interface that can be extended to support
> +  additional types of RAS data in the future.
> +* Allow multiple nodes per driver, enabling drivers to register separate
> +  nodes for different IP blocks, sub-blocks, or other logical subdivisions
> +  as applicable.
> +
> +Nodes
> +=====
> +
> +Nodes are logical abstractions representing an error source or block within
> +the device. Currently, only error counter nodes is supported.
> +
> +Drivers are responsible for registering and unregistering nodes via the
> +`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
> +
> +Node Management
> +-------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
> +   :doc: DRM RAS Node Management
> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
> +   :internal:
> +
> +Generic Netlink Usage
> +=====================
> +
> +The interface is implemented as a Generic Netlink family named ``drm-ras``.
> +User space tools can:
> +
> +* List registered nodes with the ``get-nodes`` command.
> +* List all error counters in an node with the ``get-error-counters`` command.
> +* Query error counters using the ``query-error-counter`` command.
> +
> +YAML-based Interface
> +--------------------
> +
> +The interface is described in a YAML specification:
> +
> +:ref:`Documentation/netlink/specs/drm_ras.yaml`
> +
> +This YAML is used to auto-generate user space bindings via
> +``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
> +attributes and operations.
> +
> +Usage Notes
> +-----------
> +
> +* User space must first enumerate nodes to obtain their IDs.
> +* Node IDs or Node names can be used for all further queries, such as error counters.
> +* Error counters can be queried by either the Error ID or Error name.
> +* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
> +* The interface supports future extension by adding new node types and
> +  additional attributes.
> +
> +Example: List nodes using ynl
> +
> +.. code-block:: bash
> +
> +    sudo ynl --family drm_ras  --dump list-nodes
> +    [{'device-name': '0000:03:00.0',
> +    'node-id': 0,
> +    'node-name': 'correctable-errors',
> +    'node-type': 'error-counter'},
> +    {'device-name': '0000:03:00.0',
> +     'node-id': 1,
> +    'node-name': 'nonfatal-errors',
> +    'node-type': 'error-counter'},
> +    {'device-name': '0000:03:00.0',
> +    'node-id': 2,
> +    'node-name': 'fatal-errors',
> +    'node-type': 'error-counter'}]
> +
> +Example: List all error counters using ynl
> +
> +.. code-block:: bash
> +
> +
> +   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> +   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
> +   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
> +
> +
> +Example: Query an error counter for a given node
> +
> +.. code-block:: bash
> +
> +   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
> +   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
> +
> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
> index 7dcb15850afd..60c73fdcfeed 100644
> --- a/Documentation/gpu/index.rst
> +++ b/Documentation/gpu/index.rst
> @@ -9,6 +9,7 @@ GPU Driver Developer's Guide
>      drm-mm
>      drm-kms
>      drm-kms-helpers
> +   drm-ras
>      drm-uapi
>      drm-usage-stats
>      driver-uapi
> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> new file mode 100644
> index 000000000000..be0e379c5bc9
> --- /dev/null
> +++ b/Documentation/netlink/specs/drm_ras.yaml
> @@ -0,0 +1,130 @@
> +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
> +---
> +name: drm-ras
> +protocol: genetlink
> +uapi-header: drm/drm_ras.h
> +
> +doc: >-
> +  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
> +  Provides a standardized mechanism for DRM drivers to register "nodes"
> +  representing hardware/software components capable of reporting error counters.
> +  Userspace tools can query the list of nodes or individual error counters
> +  via the Generic Netlink interface.
> +
> +definitions:
> +  -
> +    type: enum
> +    name: node-type
> +    value-start: 1
> +    entries: [error-counter]
> +    doc: >-
> +         Type of the node. Currently, only error-counter nodes are
> +         supported, which expose reliability counters for a hardware/software
> +         component.
> +
> +attribute-sets:
> +  -
> +    name: node-attrs
> +    attributes:
> +      -
> +        name: node-id
> +        type: u32
> +        doc: >-
> +             Unique identifier for the node.
> +             Assigned dynamically by the DRM RAS core upon registration.
> +      -
> +        name: device-name
> +        type: string
> +        doc: >-
> +             Device name chosen by the driver at registration.
> +             Can be a PCI BDF, UUID, or module name if unique.
> +      -
> +        name: node-name
> +        type: string
> +        doc: >-
> +             Node name chosen by the driver at registration.
> +             Can be an IP block name, or any name that identifies the
> +             RAS node inside the device.
> +      -
> +        name: node-type
> +        type: u32
> +        doc: Type of this node, identifying its function.
> +        enum: node-type
> +  -
> +    name: error-counter-attrs
> +    attributes:
> +      -
> +        name: node-id
> +        type: u32
> +        doc:  Node ID targeted by this error counter operation.
> +      -
> +        name: error-id
> +        type: u32
> +        doc: Unique identifier for a specific error counter within an node.
> +      -
> +        name: error-name
> +        type: string
> +        doc: Name of the error.
> +      -
> +        name: error-value
> +        type: u32
> +        doc: Current value of the requested error counter.
> +
> +operations:
> +  list:
> +    -
> +      name: list-nodes
> +      doc: >-
> +           Retrieve the full list of currently registered DRM RAS nodes.
> +           Each node includes its dynamically assigned ID, name, and type.
> +           **Important:** User space must call this operation first to obtain
> +           the node IDs. These IDs are required for all subsequent
> +           operations on nodes, such as querying error counters.
> +      attribute-set: node-attrs
> +      flags: [admin-perm]
> +      dump:
> +        reply:
> +          attributes:
> +            - node-id
> +            - device-name
> +            - node-name
> +            - node-type
> +    -
> +      name: get-error-counters
> +      doc: >-
> +           Retrieve the full list of error counters for a given node.
> +           The response include the id, the name, and even the current
> +           value of each counter.
> +      attribute-set: error-counter-attrs
> +      flags: [admin-perm]
> +      dump:
> +        request:
> +          attributes:
> +            - node-id
> +        reply:
> +          attributes:
> +            - error-id
> +            - error-name
> +            - error-value
> +    -
> +      name: query-error-counter
> +      doc: >-
> +           Query the information of a specific error counter for a given node.
> +           Users must provide the node ID and the error counter ID.
> +           The response contains the id, the name, and the current value
> +           of the counter.
> +      attribute-set: error-counter-attrs
> +      flags: [admin-perm]
> +      do:
> +        request:
> +          attributes:
> +            - node-id
> +            - error-id
> +        reply:
> +          attributes:
> +            - error-id
> +            - error-name
> +            - error-value
> +
> +kernel-family:
> +  headers: ["drm/drm_ras_nl.h"]
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index a33b90251530..f378e77048c8 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
>   	  Smaller QR code are easier to read, but will contain less debugging
>   	  data. Default is 40.
>   
> +config DRM_RAS
> +	bool "DRM RAS support"
> +	depends on DRM
> +	help
> +	  Enables the DRM RAS (Reliability, Availability and Serviceability)
> +	  support for DRM drivers. This provides a Generic Netlink interface
> +	  for error reporting and queries.
> +	  If in doubt, say "N".
> +
>   config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
>           bool "Enable refcount backtrace history in the DP MST helpers"
>   	depends on STACKTRACE_SUPPORT
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index 0deee72ef935..2eea3f54db53 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
>   drm-$(CONFIG_DRM_PANIC) += drm_panic.o
>   drm-$(CONFIG_DRM_DRAW) += drm_draw.o
>   drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
> +drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
>   obj-$(CONFIG_DRM)	+= drm.o
>   
>   obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 2915118436ce..6b965c3d3307 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -53,6 +53,7 @@
>   #include <drm/drm_panic.h>
>   #include <drm/drm_print.h>
>   #include <drm/drm_privacy_screen_machine.h>
> +#include <drm/drm_ras_genl_family.h>
>   
>   #include "drm_crtc_internal.h"
>   #include "drm_internal.h"
> @@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
>   
>   static void drm_core_exit(void)
>   {
> +	drm_ras_genl_family_unregister();
>   	drm_privacy_screen_lookup_exit();
>   	drm_panic_exit();
>   	accel_core_exit();
> @@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
>   
>   	drm_privacy_screen_lookup_init();
>   
> +	ret = drm_ras_genl_family_register();
> +	if (ret < 0)
> +		goto error;
> +
>   	drm_core_init_complete = true;
>   
>   	DRM_DEBUG("Initialized\n");
> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> new file mode 100644
> index 000000000000..7bc77ea24fe2
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -0,0 +1,351 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/xarray.h>
> +#include <net/genetlink.h>
> +
> +#include <drm/drm_ras.h>
> +
> +/**
> + * DOC: DRM RAS Node Management
> + *
> + * This module provides the infrastructure to manage RAS (Reliability,
> + * Availability, and Serviceability) nodes for DRM drivers. Each
> + * DRM driver may register one or more RAS nodes, which represent
> + * logical components capable of reporting error counters and other
> + * reliability metrics.
> + *
> + * The nodes are stored in a global xarray `drm_ras_xa` to allow
> + * efficient lookup by ID. Nodes can be registered or unregistered
> + * dynamically at runtime.
> + *
> + * A Generic Netlink family `drm_ras` exposes three main operations to
> + * userspace:
> + *
> + * 1. LIST_NODES: Dump all currently registered RAS nodes.
> + *    The user receives an array of node IDs, names, and types.
> + *
> + * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
> + *    The user receives an array of error IDs, names, and current value.
> + *
> + * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
> + *    Userspace must provide the node ID and the counter ID, and
> + *    receives the ID, the error name, and its current value.
> + *
> + * Node registration:
> + * - drm_ras_node_register(): Registers a new node and assigns
> + *   it a unique ID in the xarray.
> + * - drm_ras_node_unregister(): Removes a previously registered
> + *   node from the xarray.
> + *
> + * Node type:
> + * - ERROR_COUNTER:
> + *     + Currently, only error counters are supported.
> + *     + The driver must implement the query_error_counter() callback to provide
> + *       the name and the value of the error counter.
> + *     + The driver must provide a error_counter_range.last value informing the
> + *       last valid error ID.
> + *     + The driver can provide a error_counter_range.first value informing the
> + *       frst valid error ID.
> + *     + The error counters in the driver doesn't need to be contiguous, but the
> + *       driver must return -ENOENT to the query_error_counter as an indication
> + *       that the ID should be skipped and not listed in the netlink API.
> + *
> + * Netlink handlers:
> + * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
> + *   operation, iterating over the xarray.
> + * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
> + *   operation, iterating over the know valid error_counter_range.
> + * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
> + *   operation, fetching a counter value from a specific node.
> + */
> +
> +static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> +
> +/*
> + * The netlink callback context carries dump state across multiple dumpit calls
> + */
> +struct drm_ras_ctx {
> +	/* Which xarray id to restart the dump from */
> +	unsigned long restart;
> +};
> +
> +/**
> + * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
> + * @skb: Netlink message buffer
> + * @cb: Callback context for multi-part dumps
> + *
> + * Iterates over all registered RAS nodes in the global xarray and appends
> + * their attributes (ID, name, type) to the given netlink message buffer.
> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
> + * multi-part dump support. On buffer overflow, updates the context to resume
> + * from the last node on the next invocation.
> + *
> + * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
> + *          the buffer filled up (requires multi-part continuation), or
> + *          a negative error code on failure.
> + */
> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
> +				 struct netlink_callback *cb)
> +{
> +	const struct genl_info *info = genl_info_dump(cb);
> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
> +	struct drm_ras_node *node;
> +	struct nlattr *hdr;
> +	unsigned long id;
> +	int ret;
> +
> +	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
> +		hdr = genlmsg_iput(skb, info);
> +		if (!hdr) {
> +			ret = -EMSGSIZE;
> +			break;
> +		}
> +
> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> +				     node->device_name);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> +				     node->node_name);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> +				  node->type);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		genlmsg_end(skb, hdr);
> +	}
> +
> +	if (ret == -EMSGSIZE)
> +		ctx->restart = id;
> +
> +	return ret;
> +}
> +
> +static int get_node_error_counter(u32 node_id, u32 error_id,
> +				  const char **name, u32 *value)
> +{
> +	struct drm_ras_node *node;
> +
> +	node = xa_load(&drm_ras_xa, node_id);
> +	if (!node || !node->query_error_counter)
> +		return -ENOENT;
> +
> +	if (error_id < node->error_counter_range.first ||
> +	    error_id > node->error_counter_range.last)
> +		return -EINVAL;
> +
> +	return node->query_error_counter(node, error_id, name, value);
> +}
> +
> +static int msg_reply_value(struct sk_buff *msg, u32 error_id,
> +			   const char *error_name, u32 value)
> +{
> +	int ret;
> +
> +	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
> +	if (ret)
> +		return ret;
> +
> +	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> +			     error_name);
> +	if (ret)
> +		return ret;
> +
> +	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
> +			   value);
> +}
> +
> +static int doit_reply_value(struct genl_info *info, u32 node_id,
> +			    u32 error_id)
> +{
> +	struct sk_buff *msg;
> +	struct nlattr *hdr;
> +	const char *error_name;
> +	u32 value;
> +	int ret;
> +
> +	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +
> +	hdr = genlmsg_iput(msg, info);
> +	if (!hdr) {
> +		nlmsg_free(msg);
> +		return -EMSGSIZE;
> +	}
> +
> +	ret = get_node_error_counter(node_id, error_id,
> +				     &error_name, &value);
> +	if (ret)
> +		return ret;
> +
> +	ret = msg_reply_value(msg, error_id, error_name, value);
> +	if (ret) {
> +		genlmsg_cancel(msg, hdr);
> +		nlmsg_free(msg);
> +		return ret;
> +	}
> +
> +	genlmsg_end(msg, hdr);
> +
> +	return genlmsg_reply(msg, info);
> +}
> +
> +/**
> + * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
> + * @skb: Netlink message buffer
> + * @cb: Callback context for multi-part dumps
> + *
> + * Iterates over all error counters in a given Node and appends
> + * their attributes (ID, name, value) to the given netlink message buffer.
> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
> + * multi-part dump support. On buffer overflow, updates the context to resume
> + * from the last node on the next invocation.
> + *
> + * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
> + *          the buffer filled up (requires multi-part continuation), or
> + *          a negative error code on failure.
> + */
> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
> +					 struct netlink_callback *cb)
> +{
> +	const struct genl_info *info = genl_info_dump(cb);
> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
> +	struct drm_ras_node *node;
> +	struct nlattr *hdr;
> +	const char *error_name;
> +	u32 node_id, error_id, value;
> +	int ret;
> +
> +	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
> +		return -EINVAL;
> +
> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> +
> +	node = xa_load(&drm_ras_xa, node_id);
> +	if (!node)
> +		return -ENOENT;
> +
> +	for (error_id = max(node->error_counter_range.first, ctx->restart);
> +	     error_id <= node->error_counter_range.last;
> +	     error_id++) {
> +		ret = get_node_error_counter(node_id, error_id,
> +					     &error_name, &value);
> +		/*
> +		 * For non-contiguous range, driver return -ENOENT as indication
> +		 * to skip this ID when listing all errors.
> +		 */
> +		if (ret == -ENOENT)
> +			continue;
> +		if (ret)
> +			return ret;
> +
> +		hdr = genlmsg_iput(skb, info);
> +
> +		if (!hdr) {
> +			ret = -EMSGSIZE;
> +			break;
> +		}
> +
> +		ret = msg_reply_value(skb, error_id, error_name, value);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		genlmsg_end(skb, hdr);
> +	}
> +
> +	if (ret == -EMSGSIZE)
> +		ctx->restart = error_id;
> +
> +	return ret;
> +}
> +
> +/**
> + * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
> + * @skb: Netlink message buffer
> + * @info: Generic Netlink info containing attributes of the request
> + *
> + * Extracts the node ID and error ID from the netlink attributes and
> + * retrieves the current value of the corresponding error counter. Sends the
> + * result back to the requesting user via the standard Genl reply.
> + *
> + * Return: 0 on success, or negative errno on failure.
> + */
> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
> +					struct genl_info *info)
> +{
> +	u32 node_id, error_id;
> +
> +	if (!info->attrs ||
> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
> +		return -EINVAL;
> +
> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> +	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> +
> +	return doit_reply_value(info, node_id, error_id);
> +}
> +
> +/**
> + * drm_ras_node_register() - Register a new RAS node
> + * @node: Node structure to register
> + *
> + * Adds the given RAS node to the global node xarray and assigns it
> + * a unique ID. Both @node->name and @node->type must be valid.
> + *
> + * Return: 0 on success, or negative errno on failure:
> + */
> +int drm_ras_node_register(struct drm_ras_node *node)
> +{
> +	if (!node->device_name || !node->node_name)
> +		return -EINVAL;
> +
> +	/* Currently, only Error Counter Endpoinnts are supported */
> +	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
> +		return -EINVAL;
> +
> +	/* Mandatorty entries for Error Counter Node */
> +	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
> +	    (!node->error_counter_range.last || !node->query_error_counter))
> +		return -EINVAL;
> +
> +	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
> +}
> +EXPORT_SYMBOL(drm_ras_node_register);
> +
> +/**
> + * drm_ras_node_unregister() - Unregister a previously registered node
> + * @node: Node structure to unregister
> + *
> + * Removes the given node from the global node xarray using its ID.
> + */
> +void drm_ras_node_unregister(struct drm_ras_node *node)
> +{
> +	xa_erase(&drm_ras_xa, node->id);
> +}
> +EXPORT_SYMBOL(drm_ras_node_unregister);
> diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
> new file mode 100644
> index 000000000000..2d818b8c3808
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_ras_genl_family.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include <drm/drm_ras_genl_family.h>
> +#include <drm/drm_ras_nl.h>
> +
> +/* Track family registration so the drm_exit can be called at any time */
> +static bool registered;
> +
> +/**
> + * drm_ras_genl_family_register() - Register drm-ras genl family
> + *
> + * Only to be called one at drm_drv_init()
> + */
> +int drm_ras_genl_family_register(void)
> +{
> +	int ret;
> +
> +	registered = false;
> +
> +	ret = genl_register_family(&drm_ras_nl_family);
> +	if (ret)
> +		return ret;
> +
> +	registered = true;
> +	return 0;
> +}
> +
> +/**
> + * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
> + *
> + * To be called one at drm_drv_exit() at any moment, but only once.
> + */
> +void drm_ras_genl_family_unregister(void)
> +{
> +	if (registered) {
> +		genl_unregister_family(&drm_ras_nl_family);
> +		registered = false;
> +	}
> +}
> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> new file mode 100644
> index 000000000000..fcd1392410e4
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_ras_nl.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
> +/* Do not edit directly, auto-generated from: */
> +/*	Documentation/netlink/specs/drm_ras.yaml */
> +/* YNL-GEN kernel source */
> +
> +#include <net/netlink.h>
> +#include <net/genetlink.h>
> +
> +#include <uapi/drm/drm_ras.h>
> +#include <drm/drm_ras_nl.h>
> +
> +/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
> +static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> +};
> +
> +/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
> +static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> +};
> +
> +/* Ops table for drm_ras */
> +static const struct genl_split_ops drm_ras_nl_ops[] = {
> +	{
> +		.cmd	= DRM_RAS_CMD_LIST_NODES,
> +		.dumpit	= drm_ras_nl_list_nodes_dumpit,
> +		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> +	},
> +	{
> +		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
> +		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
> +		.policy		= drm_ras_get_error_counters_nl_policy,
> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> +	},
> +	{
> +		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
> +		.doit		= drm_ras_nl_query_error_counter_doit,
> +		.policy		= drm_ras_query_error_counter_nl_policy,
> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> +	},
> +};
> +
> +struct genl_family drm_ras_nl_family __ro_after_init = {
> +	.name		= DRM_RAS_FAMILY_NAME,
> +	.version	= DRM_RAS_FAMILY_VERSION,
> +	.netnsok	= true,
> +	.parallel_ops	= true,
> +	.module		= THIS_MODULE,
> +	.split_ops	= drm_ras_nl_ops,
> +	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
> +};
> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> new file mode 100644
> index 000000000000..bba47a282ef8
> --- /dev/null
> +++ b/include/drm/drm_ras.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef __DRM_RAS_H__
> +#define __DRM_RAS_H__
> +
> +#include "drm_ras_nl.h"
> +
> +/**
> + * struct drm_ras_node - A DRM RAS Node
> + */
> +struct drm_ras_node {
> +	/** @id: Unique identifier for the node. Dynamically assigned. */
> +	u32 id;
> +	/**
> +	 * @device_name: Human-readable name of the device. Given by the driver.
> +	 */
> +	const char *device_name;
> +	/** @node_name: Human-readable name of the node. Given by the driver. */
> +	const char *node_name;
> +	/** @type: Type of the node (enum drm_ras_node_type). */
> +	enum drm_ras_node_type type;
> +
> +	/* Error-Counter Related Callback and Variables */
> +
> +	/** @error_counter_range: Range of valid Error IDs for this node. */
> +	struct {
> +		/** @first: First valid Error ID. */
> +		u32 first;
> +		/** @last: Last valid Error ID. Mandatory entry. */
> +		u32 last;
> +	} error_counter_range;
> +
> +	/**
> +	 * @query_error_counter:
> +	 *
> +	 * This callback is used by drm-ras to query a specific error counter.
> +	 * counters supported by this node. Used for input check and to
> +	 * iterate in all counters.
> +	 *
> +	 * Driver should expect query_error_counters() to be called with
> +	 * error_id from `error_counter_range.first` to
> +	 * `error_counter_range.last`.
> +	 *
> +	 * The @query_error_counter is a mandatory callback for
> +	 * error_counter_node.
> +	 *
> +	 * Returns: 0 on success,
> +	 *          -ENOENT when error_id is not supported as an indication that
> +	 *                  drm_ras should silently skip this entry. Used for
> +	 *                  supporting non-contiguous error ranges.
> +	 *                  Driver is responsible for maintaining the list of
> +	 *                  supported error IDs in the range of first to last.
> +	 *          Other negative values on errors that should terminate the
> +	 *          netlink query.
> +	 */
> +	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
> +				   const char **name, u32 *val);
> +
> +	/** @priv: Driver private data */
> +	void *priv;
> +};
> +
> +struct drm_device;
> +
> +#if IS_ENABLED(CONFIG_DRM_RAS)
> +int drm_ras_node_register(struct drm_ras_node *ep);
> +void drm_ras_node_unregister(struct drm_ras_node *ep);
> +#else
> +static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
> +static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
> +#endif
> +
> +#endif
> diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
> new file mode 100644
> index 000000000000..5931b53429f1
> --- /dev/null
> +++ b/include/drm/drm_ras_genl_family.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef __DRM_RAS_GENL_FAMILY_H__
> +#define __DRM_RAS_GENL_FAMILY_H__
> +
> +#if IS_ENABLED(CONFIG_DRM_RAS)
> +int drm_ras_genl_family_register(void);
> +void drm_ras_genl_family_unregister(void);
> +#else
> +static inline int drm_ras_genl_family_register(void) { return 0; }
> +static inline void drm_ras_genl_family_unregister(void) { }
> +#endif
> +
> +#endif
> diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
> new file mode 100644
> index 000000000000..9613b7d9ffdb
> --- /dev/null
> +++ b/include/drm/drm_ras_nl.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
> +/* Do not edit directly, auto-generated from: */
> +/*	Documentation/netlink/specs/drm_ras.yaml */
> +/* YNL-GEN kernel header */
> +
> +#ifndef _LINUX_DRM_RAS_GEN_H
> +#define _LINUX_DRM_RAS_GEN_H
> +
> +#include <net/netlink.h>
> +#include <net/genetlink.h>
> +
> +#include <uapi/drm/drm_ras.h>
> +#include <drm/drm_ras_nl.h>
> +
> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
> +				 struct netlink_callback *cb);
> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
> +					 struct netlink_callback *cb);
> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
> +					struct genl_info *info);
> +
> +extern struct genl_family drm_ras_nl_family;
> +
> +#endif /* _LINUX_DRM_RAS_GEN_H */
> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> new file mode 100644
> index 000000000000..3415ba345ac8
> --- /dev/null
> +++ b/include/uapi/drm/drm_ras.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
> +/* Do not edit directly, auto-generated from: */
> +/*	Documentation/netlink/specs/drm_ras.yaml */
> +/* YNL-GEN uapi header */
> +
> +#ifndef _UAPI_LINUX_DRM_RAS_H
> +#define _UAPI_LINUX_DRM_RAS_H
> +
> +#define DRM_RAS_FAMILY_NAME	"drm-ras"
> +#define DRM_RAS_FAMILY_VERSION	1
> +
> +/*
> + * Type of the node. Currently, only error-counter nodes are supported, which
> + * expose reliability counters for a hardware/software component.
> + */
> +enum drm_ras_node_type {
> +	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
> +};
> +
> +enum {
> +	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
> +	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> +	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> +	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> +
> +	__DRM_RAS_A_NODE_ATTRS_MAX,
> +	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
> +};
> +
> +enum {
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
> +
> +	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
> +};
> +
> +enum {
> +	DRM_RAS_CMD_LIST_NODES = 1,
> +	DRM_RAS_CMD_GET_ERROR_COUNTERS,
> +	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
> +
> +	__DRM_RAS_CMD_MAX,
> +	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
> +};
> +
> +#endif /* _UAPI_LINUX_DRM_RAS_H */


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors
  2026-01-19  4:00 ` [PATCH v4 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
@ 2026-01-23 10:33   ` Raag Jadav
  2026-01-27  9:43     ` Riana Tauro
  0 siblings, 1 reply; 22+ messages in thread
From: Raag Jadav @ 2026-01-23 10:33 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

On Mon, Jan 19, 2026 at 09:30:26AM +0530, Riana Tauro wrote:
> Report the SOC nonfatal/fatal hardware error and update the counters.
> 
> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: Add ID's and names as uAPI (Rodrigo)
> 
> v3: reorder and align arrays
>     remove redundant string err
>     use REG_BIT
>     fix aesthic review comments (Raag)
>     use only correctable/uncorrectable error severity (Aravind)
> ---
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
>  drivers/gpu/drm/xe/xe_hw_error.c           | 200 ++++++++++++++++++++-
>  2 files changed, 223 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index 5eeb0be27300..b9e072f9e56c 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -41,6 +41,7 @@
>  								  DEV_ERR_STAT_NONFATAL))
>  
>  #define   XE_CSC_ERROR				17
> +#define   XE_SOC_ERROR				16
>  #define   XE_GT_ERROR				0
>  
>  #define ERR_STAT_GT_FATAL_VECTOR_0		0x100260
> @@ -62,4 +63,27 @@
>  						ERR_STAT_GT_COR_VECTOR_REG(x) : \
>  						ERR_STAT_GT_FATAL_VECTOR_REG(x))
>  
> +#define SOC_PVC_MASTER_BASE			0x282000
> +#define SOC_PVC_SLAVE_BASE			0x283000
> +
> +#define SOC_GCOERRSTS				0x200
> +#define SOC_GNFERRSTS				0x210
> +#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
> +								  (base) + SOC_GCOERRSTS, \
> +								  (base) + SOC_GNFERRSTS))
> +#define   SOC_SLAVE_IEH				REG_BIT(1)
> +#define   SOC_IEH0_LOCAL_ERR_STATUS		REG_BIT(0)
> +#define   SOC_IEH1_LOCAL_ERR_STATUS		REG_BIT(0)
> +
> +#define SOC_GSYSEVTCTL				0x264
> +#define SOC_GSYSEVTCTL_REG(base, slave_base, x)	XE_REG(_PICK_EVEN((x), \

Can we add 'master' for consistency? This gets me confused with
other macros where 'base' can mean either one.

> +								  (base) + SOC_GSYSEVTCTL, \
> +								  (slave_base) + SOC_GSYSEVTCTL))
> +
> +#define SOC_LERRUNCSTS				0x280
> +#define SOC_LERRCORSTS				0x294
> +#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)	XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
> +						      (base) + SOC_LERRCORSTS : \
> +						      (base) + SOC_LERRUNCSTS)

Nit: Perhaps an additional whitespace is needed? ;)

>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index bd0cf61741ca..d1c30bb199d3 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -19,6 +19,7 @@
>  #define  GT_HW_ERROR_MAX_ERR_BITS	16
>  #define  HEC_UNCORR_FW_ERR_BITS 	4
>  #define  XE_RAS_REG_SIZE		32
> +#define  XE_SOC_NUM_IEH 		2
>  
>  extern struct fault_attr inject_csc_hw_error;
>  static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
> @@ -31,7 +32,8 @@ static const char * const hec_uncorrected_fw_errors[] = {
>  };
>  
>  static const unsigned long xe_hw_error_map[] = {
> -	[XE_GT_ERROR] = DRM_XE_RAS_ERROR_CLASS_GT,
> +	[XE_GT_ERROR]	= DRM_XE_RAS_ERROR_CLASS_GT,
> +	[XE_SOC_ERROR]	= DRM_XE_RAS_ERROR_CLASS_SOC,
>  };
>  
>  enum gt_vector_regs {
> @@ -54,6 +56,92 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
>  	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
>  }
>  
> +static const char * const pvc_master_global_err_reg[] = {
> +	[0 ... 1]	= "Undefined",
> +	[2]		= "HBM SS0: Channel0",
> +	[3]		= "HBM SS0: Channel1",
> +	[4]		= "HBM SS0: Channel2",
> +	[5]		= "HBM SS0: Channel3",
> +	[6]		= "HBM SS0: Channel4",
> +	[7]		= "HBM SS0: Channel5",
> +	[8]		= "HBM SS0: Channel6",
> +	[9]		= "HBM SS0: Channel7",
> +	[10]		= "HBM SS1: Channel0",
> +	[11]		= "HBM SS1: Channel1",
> +	[12]		= "HBM SS1: Channel2",
> +	[13]		= "HBM SS1: Channel3",
> +	[14]		= "HBM SS1: Channel4",
> +	[15]		= "HBM SS1: Channel5",
> +	[16]		= "HBM SS1: Channel6",
> +	[17]		= "HBM SS1: Channel7",
> +	[18 ... 31]	= "Undefined",
> +};

I'd add static_assert() against register size here.

> +static const char * const pvc_slave_global_err_reg[] = {
> +	[0]		= "Undefined",
> +	[1]		= "HBM SS2: Channel0",
> +	[2]		= "HBM SS2: Channel1",
> +	[3]		= "HBM SS2: Channel2",
> +	[4]		= "HBM SS2: Channel3",
> +	[5]		= "HBM SS2: Channel4",
> +	[6]		= "HBM SS2: Channel5",
> +	[7]		= "HBM SS2: Channel6",
> +	[8]		= "HBM SS2: Channel7",
> +	[9]		= "HBM SS3: Channel0",
> +	[10]		= "HBM SS3: Channel1",
> +	[11]		= "HBM SS3: Channel2",
> +	[12]		= "HBM SS3: Channel3",
> +	[13]		= "HBM SS3: Channel4",
> +	[14]		= "HBM SS3: Channel5",
> +	[15]		= "HBM SS3: Channel6",
> +	[16]		= "HBM SS3: Channel7",
> +	[17]		= "Undefined",
> +	[18]		= "ANR MDFI",
> +	[19 ... 31]	= "Undefined",
> +};

Ditto.

> +static const char * const pvc_slave_local_fatal_err_reg[] = {
> +	[0]		= "Local IEH: Malformed PCIe AER",
> +	[1]		= "Local IEH: Malformed PCIe ERR",
> +	[2]		= "Local IEH: UR conditions in IEH",
> +	[3]		= "Local IEH: From SERR Sources",
> +	[4 ... 19]	= "Undefined",
> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
> +	[21 ... 31]	= "Undefined",
> +};

Ditto.

> +static const char * const pvc_master_local_fatal_err_reg[] = {
> +	[0]		= "Local IEH: Malformed IOSF PCIe AER",
> +	[1]		= "Local IEH: Malformed IOSF PCIe ERR",
> +	[2]		= "Local IEH: UR RESPONSE",
> +	[3]		= "Local IEH: From SERR SPI controller",
> +	[4]		= "Base Die MDFI T2T",
> +	[5]		= "Undefined",
> +	[6]		= "Base Die MDFI T2C",
> +	[7]		= "Undefined",
> +	[8]		= "Invalid CSC PSF Command Parity",
> +	[9]		= "Invalid CSC PSF Unexpected Completion",
> +	[10]		= "Invalid CSC PSF Unsupported Request",
> +	[11]		= "Invalid PCIe PSF Command Parity",
> +	[12]		= "PCIe PSF Unexpected Completion",
> +	[13]		= "PCIe PSF Unsupported Request",
> +	[14 ... 19]	= "Undefined",
> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
> +	[21 ... 31]	= "Undefined",
> +};

Ditto.

> +static const char * const pvc_master_local_nonfatal_err_reg[] = {
> +	[0 ... 3]	= "Undefined",
> +	[4]		= "Base Die MDFI T2T",
> +	[5]		= "Undefined",
> +	[6]		= "Base Die MDFI T2C",
> +	[7]		= "Undefined",
> +	[8]		= "Invalid CSC PSF Command Parity",
> +	[9]		= "Invalid CSC PSF Unexpected Completion",
> +	[10]		= "Invalid PCIe PSF Command Parity",
> +	[11 ... 31]	= "Undefined",
> +};

Ditto.

>  static bool fault_inject_csc_hw_error(void)
>  {
>  	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> @@ -132,6 +220,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
>  			 name, severity_str, i, err);
>  }
>  
> +static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
> +			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
> +{
> +	const char *severity_str = error_severity[severity];
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	const char *name;
> +
> +	name = reg_info[err_bit];
> +
> +	if (strcmp(name, "Undefined")) {
> +		if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)

Same comment as last patch.

> +			drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
> +		else
> +			drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
> +		atomic64_inc(&info[index].counter);
> +	}
> +}
> +
>  static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>  				u32 error_id)
>  {
> @@ -210,6 +318,93 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>  	}
>  }
>  
> +static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> +				 u32 error_id)
> +{
> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_mmio *mmio = &tile->mmio;
> +	unsigned long master_global_errstat, slave_global_errstat;
> +	unsigned long master_local_errstat, slave_local_errstat;
> +	u32 base, slave_base, regbit;
> +	int i;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return;
> +
> +	base = SOC_PVC_MASTER_BASE;

'master'?

> +	slave_base = SOC_PVC_SLAVE_BASE;
> +
> +	/* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i), ~REG_BIT(hw_err));
> +
> +	if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err), REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err), REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
> +				REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
> +				REG_GENMASK(31, 0));
> +		goto unmask_gsysevtctl;
> +	}
> +
> +	/*
> +	 * Read the master global IEH error register if BIT 1 is set then process

BIT(1)

> +	 * the slave IEH first. If BIT 0 in global error register is set then process

BIT(0)

> +	 * the corresponding local error registers

Punctuations please!

> +	 */
> +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err));
> +	if (master_global_errstat & SOC_SLAVE_IEH) {
> +		slave_global_errstat = xe_mmio_read32(mmio,
> +						      SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err));
> +		if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
> +			slave_local_errstat = xe_mmio_read32(mmio,
> +							     SOC_LOCAL_ERR_STAT_REG(slave_base,
> +										    hw_err));

With long names usually comes the ugly wrapping :(
So let's either try to shorten some of them here or split the condition
into another function for readability.

> +			if (hw_err == HARDWARE_ERROR_FATAL) {

So we don't log for other severities?

> +				for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
> +					log_soc_error(tile, pvc_slave_local_fatal_err_reg,
> +						      severity, regbit, error_id);
> +			}
> +
> +			xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
> +					slave_local_errstat);
> +		}
> +
> +		for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
> +			log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
> +
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
> +				slave_global_errstat);
> +	}
> +
> +	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {

Ditto for split.

Raag

> +		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err));
> +
> +		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
> +			const char * const *reg_info = (hw_err == HARDWARE_ERROR_FATAL) ?
> +						       pvc_master_local_fatal_err_reg :
> +						       pvc_master_local_nonfatal_err_reg;
> +
> +			log_soc_error(tile, reg_info, severity, regbit, error_id);
> +		}
> +
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err), master_local_errstat);
> +	}
> +
> +	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
> +		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
> +
> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err), master_global_errstat);
> +
> +unmask_gsysevtctl:
> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
> +				(DRM_XE_RAS_ERROR_SEVERITY_MAX << 1) + 1);
> +}
> +
>  static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>  {
>  	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> @@ -263,8 +458,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>  				 "TILE%d reported %s %s, bit[%d] is set\n",
>  				 tile->id, name, severity_str, err_bit);
>  		}
> +
>  		if (err_bit == XE_GT_ERROR)
>  			gt_hw_error_handler(tile, hw_err, error_id);
> +		if (err_bit == XE_SOC_ERROR)
> +			soc_hw_error_handler(tile, hw_err, error_id);
>  	}
>  
>  clear_reg:
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2026-01-21  7:09   ` Raag Jadav
@ 2026-01-27  8:29     ` Riana Tauro
  2026-01-27 10:12       ` Raag Jadav
  0 siblings, 1 reply; 22+ messages in thread
From: Riana Tauro @ 2026-01-27  8:29 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

Hi Raag

On 1/21/2026 12:39 PM, Raag Jadav wrote:
> On Mon, Jan 19, 2026 at 09:30:25AM +0530, Riana Tauro wrote:
>> PVC supports GT error reporting via vector registers along with
>> error status register. Add support to report these errors and
>> update respective counters. Incase of Subslice error reported
>> by vector register, process the error status register
>> for applicable bits.
>>
>> Incorporate the counter inside the driver itself and start
>> using the drm_ras generic netlink to report them.
>>
>> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: Add ID's and names as uAPI (Rodrigo)
>>
>> v3: use REG_BIT
>>      do not use _ffs
>>      use a single function for GT errors
>>      remove redundant errors from logs (Raag)
>>      use only correctable/uncorrectable error severity (Pratik/Aravind)
>> ---
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  53 +++++-
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 182 +++++++++++++++++++--
>>   2 files changed, 220 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> index c146b9ef44eb..5eeb0be27300 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -6,15 +6,60 @@
>>   #ifndef _XE_HW_ERROR_REGS_H_
>>   #define _XE_HW_ERROR_REGS_H_
>>   
>> -#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
>> -#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>> +#define HEC_UNCORR_ERR_STATUS(base)		XE_REG((base) + 0x118)
>> +#define   UNCORR_FW_REPORTED_ERR		REG_BIT(6)
>>   
>> -#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
>> +#define HEC_UNCORR_FW_ERR_DW0(base)		XE_REG((base) + 0x124)
>> +
>> +#define ERR_STAT_GT_COR				0x100160
>> +#define   EU_GRF_COR_ERR			REG_BIT(15)
>> +#define   EU_IC_COR_ERR				REG_BIT(14)
>> +#define   SLM_COR_ERR				REG_BIT(13)
>> +#define   GUC_COR_ERR				REG_BIT(1)
>> +
>> +#define ERR_STAT_GT_NONFATAL			0x100164
>> +#define ERR_STAT_GT_FATAL			0x100168
>> +#define   EU_GRF_FAT_ERR			REG_BIT(15)
>> +#define   SLM_FAT_ERR				REG_BIT(13)
>> +#define   GUC_FAT_ERR				REG_BIT(6)
>> +#define   FPU_FAT_ERR				REG_BIT(3)
>> +
>> +#define ERR_STAT_GT_REG(x)			XE_REG(_PICK_EVEN((x), \
>> +								  ERR_STAT_GT_COR, \
>> +								  ERR_STAT_GT_NONFATAL))
> 
> Shouldn't this be FATAL?

No it is correct

#define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a)))

index=0	val=0x100160
index=1 val=0x100164
index=2 val=0x100168

> 
>> +#define PVC_COR_ERR_MASK			(GUC_COR_ERR | SLM_COR_ERR | EU_IC_COR_ERR | \
>> +						 EU_GRF_COR_ERR)
>> +
>> +#define PVC_FAT_ERR_MASK			(FPU_FAT_ERR | GUC_FAT_ERR | EU_GRF_FAT_ERR | \
>> +						 SLM_FAT_ERR)
>>   
>>   #define DEV_ERR_STAT_NONFATAL			0x100178
>>   #define DEV_ERR_STAT_CORRECTABLE		0x10017c
>>   #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>>   								  DEV_ERR_STAT_CORRECTABLE, \
>>   								  DEV_ERR_STAT_NONFATAL))
>> -#define   XE_CSC_ERROR				BIT(17)
>> +
>> +#define   XE_CSC_ERROR				17
>> +#define   XE_GT_ERROR				0
>> +
>> +#define ERR_STAT_GT_FATAL_VECTOR_0		0x100260
>> +#define ERR_STAT_GT_FATAL_VECTOR_1		0x100264
>> +
>> +#define ERR_STAT_GT_FATAL_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
>> +								  ERR_STAT_GT_FATAL_VECTOR_0, \
>> +								  ERR_STAT_GT_FATAL_VECTOR_1))
>> +
>> +#define ERR_STAT_GT_COR_VECTOR_0		0x1002a0
>> +#define ERR_STAT_GT_COR_VECTOR_1		0x1002a4
>> +
>> +#define ERR_STAT_GT_COR_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
>> +								  ERR_STAT_GT_COR_VECTOR_0, \
>> +								  ERR_STAT_GT_COR_VECTOR_1))
>> +#define ERR_STAT_GT_COR_VECTOR_LEN		4
> 
> Now this makes me question about FATAL_VECTOR_LEN, perhaps we should add
> it? Since we already have enums for it, I'm wondering if we should reuse
> them here instead of having separate raw values?

Hmm let me check.

> 
>> +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
>> +						ERR_STAT_GT_COR_VECTOR_REG(x) : \
>> +						ERR_STAT_GT_FATAL_VECTOR_REG(x))
>> +
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index b42495d3015a..bd0cf61741ca 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,6 +3,7 @@
>>    * Copyright © 2025 Intel Corporation
>>    */
>>   
>> +#include <linux/bitmap.h>
>>   #include <linux/fault-inject.h>
>>   
>>   #include "regs/xe_gsc_regs.h"
>> @@ -15,7 +16,10 @@
>>   #include "xe_mmio.h"
>>   #include "xe_survivability_mode.h"
>>   
>> -#define  HEC_UNCORR_FW_ERR_BITS 4
>> +#define  GT_HW_ERROR_MAX_ERR_BITS	16
>> +#define  HEC_UNCORR_FW_ERR_BITS 	4
>> +#define  XE_RAS_REG_SIZE		32
> 
> This looks like it can be BITS_PER_TYPE(). Also, why do we need a separate
> macro?

The reason i kept a separate macro is that for_each_set_bit requires a 
unsigned long, but the register size is 32.


> 
>>   extern struct fault_attr inject_csc_hw_error;
>>   static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>>   
>> @@ -26,10 +30,21 @@ static const char * const hec_uncorrected_fw_errors[] = {
>>   	"Data Corruption"
>>   };
>>   
>> -static bool fault_inject_csc_hw_error(void)
>> -{
>> -	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
>> -}
>> +static const unsigned long xe_hw_error_map[] = {
>> +	[XE_GT_ERROR] = DRM_XE_RAS_ERROR_CLASS_GT,
>> +};
>> +
>> +enum gt_vector_regs {
>> +	ERR_STAT_GT_VECTOR0 = 0,
>> +	ERR_STAT_GT_VECTOR1,
>> +	ERR_STAT_GT_VECTOR2,
>> +	ERR_STAT_GT_VECTOR3,
>> +	ERR_STAT_GT_VECTOR4,
>> +	ERR_STAT_GT_VECTOR5,
>> +	ERR_STAT_GT_VECTOR6,
>> +	ERR_STAT_GT_VECTOR7,
>> +	ERR_STAT_GT_VECTOR_MAX,
> 
> This is guaranteed last member, so redundant comma.

will fix

> 
>> +};
>>   
>>   static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_err)
>>   {
>> @@ -39,6 +54,11 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
>>   	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
>>   }
>>   
>> +static bool fault_inject_csc_hw_error(void)
>> +{
>> +	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
>> +}
>> +
>>   static void csc_hw_error_work(struct work_struct *work)
>>   {
>>   	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
>> @@ -86,15 +106,121 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>>   	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>>   }
>>   
>> +static void log_hw_error(struct xe_tile *tile, const char *name,
>> +			 const enum drm_xe_ras_error_severity severity)
>> +{
>> +	const char *severity_str = error_severity[severity];
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +
>> +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
> 
> If we have FATAL case in the future, should we come back refactoring this?
> Perhaps the reverse logic would be a bit more future proof.


There will be only two severity levels correctable and uncorrectable and 
that is confirmed for XE KMD

sure i can reverse it.

> 
>> +		drm_err_ratelimited(&xe->drm, "%s %s detected\n", name, severity_str);
>> +	else
>> +		drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
>> +}
>> +
>> +static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
>> +		       const enum drm_xe_ras_error_severity severity)
>> +{
>> +	const char *severity_str = error_severity[severity];
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +
>> +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
> 
> Ditto.
> 
>> +		drm_err_ratelimited(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
>> +				    name, severity_str, i, err);
>> +	else
>> +		drm_warn(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
>> +			 name, severity_str, i, err);
>> +}
>> +
>> +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>> +				u32 error_id)
>> +{
>> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	unsigned long err_stat = 0;
>> +	int i, len;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return;
>> +
>> +	if (hw_err == HARDWARE_ERROR_NONFATAL) {
>> +		atomic64_inc(&info[error_id].counter);
>> +		log_hw_error(tile, info[error_id].name, severity);
>> +		return;
>> +	}
>> +
>> +	len = (hw_err == HARDWARE_ERROR_CORRECTABLE) ? ERR_STAT_GT_COR_VECTOR_LEN
>> +						     : ERR_STAT_GT_VECTOR_MAX;
>> +
>> +	for (i = 0; i < len; i++) {
>> +		u32 vector, val;
>> +
>> +		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i));
>> +		if (!vector)
>> +			continue;
>> +
>> +		switch (i) {
>> +		case ERR_STAT_GT_VECTOR0:
>> +		case ERR_STAT_GT_VECTOR1:
>> +			u32 errbit;
> 
> With this I think you'll need braces to make the compiler happy, so either
> add them or move this to the top.
 > >> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			log_gt_err(tile, "Subslice", i, vector, severity);
>> +
>> +			/* Read Error Status Register once */
> 
> Why? Can you please elaborate?

The register will be populated only once. Even though there are multiple 
vectors reported, the causes for the subslice error will be read and 
cleared once.

Will add it in comment.

> 
>> +			if (err_stat)
>> +				break;
>> +
>> +			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
>> +			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
>> +				if (hw_err == HARDWARE_ERROR_CORRECTABLE &&
>> +				    (BIT(errbit) & PVC_COR_ERR_MASK))
> 
> I'm wondering if this can be a (hw_err ? x) macro for this? Perhaps it'll
> help remove the duplication.

It is used once. Will check

> 
>> +					atomic64_inc(&info[error_id].counter);
>> +				if (hw_err == HARDWARE_ERROR_FATAL &&
>> +				    (BIT(errbit) & PVC_FAT_ERR_MASK))
>> +					atomic64_inc(&info[error_id].counter);
>> +			}
>> +			if (err_stat)
>> +				xe_mmio_write32(mmio, ERR_STAT_GT_REG(hw_err), err_stat);
>> +			break;
>> +		case ERR_STAT_GT_VECTOR2:
>> +		case ERR_STAT_GT_VECTOR3:
>> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			log_gt_err(tile, "L3 BANK", i, vector, severity);
>> +			break;
>> +		case ERR_STAT_GT_VECTOR6:
>> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			log_gt_err(tile, "TLB", i, vector, severity);
>> +			break;
>> +		case ERR_STAT_GT_VECTOR7:
>> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			break;
>> +		default:
>> +			log_gt_err(tile, "Undefined", i, vector, severity);
>> +		}
>> +
>> +		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
>> +	}
>> +}
>> +
>>   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>   {
>>   	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>>   	const char *severity_str = error_severity[severity];
>>   	struct xe_device *xe = tile_to_xe(tile);
>> -	unsigned long flags;
>> -	u32 err_src;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	unsigned long flags, err_src;
>> +	u32 err_bit;
>>   
>> -	if (xe->info.platform != XE_BATTLEMAGE)
>> +	if (!IS_DGFX(xe))
>>   		return;
>>   
>>   	spin_lock_irqsave(&xe->irq.lock, flags);
> 
> I'm wondering if we really need this? We're already inside irq handler so
> what are we protecting here?

This is not related to the series. Will have to check
> 
>> @@ -105,11 +231,44 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>   		goto unlock;
>>   	}
>>   
>> -	if (err_src & XE_CSC_ERROR)
>> +	/*
>> +	 * On encountering CSC firmware errors, the graphics device is non-recoverable.
> 
> ... "so bail immediately."

The code is quite intutive but will add it for additional clarity.

> 
>> +	 * The only way to recover from these errors is firmware flash. The device will
>> +	 * enter Runtime Survivability mode when such errors are detected.
>> +	 */
>> +	if (err_src & XE_CSC_ERROR) {
>>   		csc_hw_error_handler(tile, hw_err);
>> +		goto clear_reg;
>> +	}
>>   
>> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>> +	if (!info) {
>> +		drm_err_ratelimited(&xe->drm, HW_ERR "Errors undefined\n");
>> +		goto clear_reg;
>> +	}
>> +
>> +	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
>> +		u32 error_id = xe_hw_error_map[err_bit];
> 
> Does this need bounds checking against ARRAY_SIZE()?
> 
>> +		const char *name;
>> +
>> +		name = info[error_id].name;
>> +		if (!name)
>> +			goto clear_reg;
> 
> Shouldn't we atleast give the next id a try?

yeah makes sense. will add it.

Thanks
Riana

> 
>> +		if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE) {
> 
> Ditto for logging per severity.
> 
> Raag
> 
>> +			drm_err_ratelimited(&xe->drm, HW_ERR
>> +					    "TILE%d reported %s %s, bit[%d] is set\n",
>> +					    tile->id, name, severity_str, err_bit);
>> +		} else {
>> +			drm_warn(&xe->drm, HW_ERR
>> +				 "TILE%d reported %s %s, bit[%d] is set\n",
>> +				 tile->id, name, severity_str, err_bit);
>> +		}
>> +		if (err_bit == XE_GT_ERROR)
>> +			gt_hw_error_handler(tile, hw_err, error_id);
>> +	}
>> +
>> +clear_reg:
>> +	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>>   unlock:
>>   	spin_unlock_irqrestore(&xe->irq.lock, flags);
>>   }
>> @@ -131,9 +290,10 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>>   	if (fault_inject_csc_hw_error())
>>   		schedule_work(&tile->csc_hw_error_work);
>>   
>> -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>> +	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
>>   		if (master_ctl & ERROR_IRQ(hw_err))
>>   			hw_error_source_handler(tile, hw_err);
>> +	}
>>   }
>>   
>>   static int hw_error_info_init(struct xe_device *xe)
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors
  2026-01-23 10:33   ` Raag Jadav
@ 2026-01-27  9:43     ` Riana Tauro
  0 siblings, 0 replies; 22+ messages in thread
From: Riana Tauro @ 2026-01-27  9:43 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray



On 1/23/2026 4:03 PM, Raag Jadav wrote:
> On Mon, Jan 19, 2026 at 09:30:26AM +0530, Riana Tauro wrote:
>> Report the SOC nonfatal/fatal hardware error and update the counters.
>>
>> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: Add ID's and names as uAPI (Rodrigo)
>>
>> v3: reorder and align arrays
>>      remove redundant string err
>>      use REG_BIT
>>      fix aesthic review comments (Raag)
>>      use only correctable/uncorrectable error severity (Aravind)
>> ---
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 200 ++++++++++++++++++++-
>>   2 files changed, 223 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> index 5eeb0be27300..b9e072f9e56c 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -41,6 +41,7 @@
>>   								  DEV_ERR_STAT_NONFATAL))
>>   
>>   #define   XE_CSC_ERROR				17
>> +#define   XE_SOC_ERROR				16
>>   #define   XE_GT_ERROR				0
>>   
>>   #define ERR_STAT_GT_FATAL_VECTOR_0		0x100260
>> @@ -62,4 +63,27 @@
>>   						ERR_STAT_GT_COR_VECTOR_REG(x) : \
>>   						ERR_STAT_GT_FATAL_VECTOR_REG(x))
>>   
>> +#define SOC_PVC_MASTER_BASE			0x282000
>> +#define SOC_PVC_SLAVE_BASE			0x283000
>> +
>> +#define SOC_GCOERRSTS				0x200
>> +#define SOC_GNFERRSTS				0x210
>> +#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
>> +								  (base) + SOC_GCOERRSTS, \
>> +								  (base) + SOC_GNFERRSTS))
>> +#define   SOC_SLAVE_IEH				REG_BIT(1)
>> +#define   SOC_IEH0_LOCAL_ERR_STATUS		REG_BIT(0)
>> +#define   SOC_IEH1_LOCAL_ERR_STATUS		REG_BIT(0)
>> +
>> +#define SOC_GSYSEVTCTL				0x264
>> +#define SOC_GSYSEVTCTL_REG(base, slave_base, x)	XE_REG(_PICK_EVEN((x), \
> 
> Can we add 'master' for consistency? This gets me confused with
> other macros where 'base' can mean either one.

sure..

> 
>> +								  (base) + SOC_GSYSEVTCTL, \
>> +								  (slave_base) + SOC_GSYSEVTCTL))
>> +
>> +#define SOC_LERRUNCSTS				0x280
>> +#define SOC_LERRCORSTS				0x294
>> +#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)	XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
>> +						      (base) + SOC_LERRCORSTS : \
>> +						      (base) + SOC_LERRUNCSTS)
> 
> Nit: Perhaps an additional whitespace is needed? ;)

Ahh.. yeah it looks combined. if i do that it'll mess the indentation.

I will add an extra tab for the entire file

> 
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index bd0cf61741ca..d1c30bb199d3 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -19,6 +19,7 @@
>>   #define  GT_HW_ERROR_MAX_ERR_BITS	16
>>   #define  HEC_UNCORR_FW_ERR_BITS 	4
>>   #define  XE_RAS_REG_SIZE		32
>> +#define  XE_SOC_NUM_IEH 		2
>>   
>>   extern struct fault_attr inject_csc_hw_error;
>>   static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>> @@ -31,7 +32,8 @@ static const char * const hec_uncorrected_fw_errors[] = {
>>   };
>>   
>>   static const unsigned long xe_hw_error_map[] = {
>> -	[XE_GT_ERROR] = DRM_XE_RAS_ERROR_CLASS_GT,
>> +	[XE_GT_ERROR]	= DRM_XE_RAS_ERROR_CLASS_GT,
>> +	[XE_SOC_ERROR]	= DRM_XE_RAS_ERROR_CLASS_SOC,
>>   };
>>   
>>   enum gt_vector_regs {
>> @@ -54,6 +56,92 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
>>   	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
>>   }
>>   
>> +static const char * const pvc_master_global_err_reg[] = {
>> +	[0 ... 1]	= "Undefined",
>> +	[2]		= "HBM SS0: Channel0",
>> +	[3]		= "HBM SS0: Channel1",
>> +	[4]		= "HBM SS0: Channel2",
>> +	[5]		= "HBM SS0: Channel3",
>> +	[6]		= "HBM SS0: Channel4",
>> +	[7]		= "HBM SS0: Channel5",
>> +	[8]		= "HBM SS0: Channel6",
>> +	[9]		= "HBM SS0: Channel7",
>> +	[10]		= "HBM SS1: Channel0",
>> +	[11]		= "HBM SS1: Channel1",
>> +	[12]		= "HBM SS1: Channel2",
>> +	[13]		= "HBM SS1: Channel3",
>> +	[14]		= "HBM SS1: Channel4",
>> +	[15]		= "HBM SS1: Channel5",
>> +	[16]		= "HBM SS1: Channel6",
>> +	[17]		= "HBM SS1: Channel7",
>> +	[18 ... 31]	= "Undefined",
>> +};
> 
> I'd add static_assert() against register size here.

These are anyway register values. so i think it won't get modified above 
32 bits. Yeah sure i can add a static_assert


> 
>> +static const char * const pvc_slave_global_err_reg[] = {
>> +	[0]		= "Undefined",
>> +	[1]		= "HBM SS2: Channel0",
>> +	[2]		= "HBM SS2: Channel1",
>> +	[3]		= "HBM SS2: Channel2",
>> +	[4]		= "HBM SS2: Channel3",
>> +	[5]		= "HBM SS2: Channel4",
>> +	[6]		= "HBM SS2: Channel5",
>> +	[7]		= "HBM SS2: Channel6",
>> +	[8]		= "HBM SS2: Channel7",
>> +	[9]		= "HBM SS3: Channel0",
>> +	[10]		= "HBM SS3: Channel1",
>> +	[11]		= "HBM SS3: Channel2",
>> +	[12]		= "HBM SS3: Channel3",
>> +	[13]		= "HBM SS3: Channel4",
>> +	[14]		= "HBM SS3: Channel5",
>> +	[15]		= "HBM SS3: Channel6",
>> +	[16]		= "HBM SS3: Channel7",
>> +	[17]		= "Undefined",
>> +	[18]		= "ANR MDFI",
>> +	[19 ... 31]	= "Undefined",
>> +};
> 
> Ditto.
> 
>> +static const char * const pvc_slave_local_fatal_err_reg[] = {
>> +	[0]		= "Local IEH: Malformed PCIe AER",
>> +	[1]		= "Local IEH: Malformed PCIe ERR",
>> +	[2]		= "Local IEH: UR conditions in IEH",
>> +	[3]		= "Local IEH: From SERR Sources",
>> +	[4 ... 19]	= "Undefined",
>> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
>> +	[21 ... 31]	= "Undefined",
>> +};
> 
> Ditto.
> 
>> +static const char * const pvc_master_local_fatal_err_reg[] = {
>> +	[0]		= "Local IEH: Malformed IOSF PCIe AER",
>> +	[1]		= "Local IEH: Malformed IOSF PCIe ERR",
>> +	[2]		= "Local IEH: UR RESPONSE",
>> +	[3]		= "Local IEH: From SERR SPI controller",
>> +	[4]		= "Base Die MDFI T2T",
>> +	[5]		= "Undefined",
>> +	[6]		= "Base Die MDFI T2C",
>> +	[7]		= "Undefined",
>> +	[8]		= "Invalid CSC PSF Command Parity",
>> +	[9]		= "Invalid CSC PSF Unexpected Completion",
>> +	[10]		= "Invalid CSC PSF Unsupported Request",
>> +	[11]		= "Invalid PCIe PSF Command Parity",
>> +	[12]		= "PCIe PSF Unexpected Completion",
>> +	[13]		= "PCIe PSF Unsupported Request",
>> +	[14 ... 19]	= "Undefined",
>> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
>> +	[21 ... 31]	= "Undefined",
>> +};
> 
> Ditto.
> 
>> +static const char * const pvc_master_local_nonfatal_err_reg[] = {
>> +	[0 ... 3]	= "Undefined",
>> +	[4]		= "Base Die MDFI T2T",
>> +	[5]		= "Undefined",
>> +	[6]		= "Base Die MDFI T2C",
>> +	[7]		= "Undefined",
>> +	[8]		= "Invalid CSC PSF Command Parity",
>> +	[9]		= "Invalid CSC PSF Unexpected Completion",
>> +	[10]		= "Invalid PCIe PSF Command Parity",
>> +	[11 ... 31]	= "Undefined",
>> +};
> 
> Ditto.
> 
>>   static bool fault_inject_csc_hw_error(void)
>>   {
>>   	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
>> @@ -132,6 +220,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
>>   			 name, severity_str, i, err);
>>   }
>>   
>> +static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
>> +			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
>> +{
>> +	const char *severity_str = error_severity[severity];
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	const char *name;
>> +
>> +	name = reg_info[err_bit];
>> +
>> +	if (strcmp(name, "Undefined")) {
>> +		if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
> 
> Same comment as last patch.

Will reverse

> 
>> +			drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
>> +		else
>> +			drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
>> +		atomic64_inc(&info[index].counter);
>> +	}
>> +}
>> +
>>   static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>>   				u32 error_id)
>>   {
>> @@ -210,6 +318,93 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>>   	}
>>   }
>>   
>> +static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>> +				 u32 error_id)
>> +{
>> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	unsigned long master_global_errstat, slave_global_errstat;
>> +	unsigned long master_local_errstat, slave_local_errstat;
>> +	u32 base, slave_base, regbit;
>> +	int i;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return;
>> +
>> +	base = SOC_PVC_MASTER_BASE;
> 
> 'master'?

sure.

> 
>> +	slave_base = SOC_PVC_SLAVE_BASE;
>> +
>> +	/* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
>> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
>> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i), ~REG_BIT(hw_err));
>> +
>> +	if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err), REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err), REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
>> +				REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
>> +				REG_GENMASK(31, 0));
>> +		goto unmask_gsysevtctl;
>> +	}
>> +
>> +	/*
>> +	 * Read the master global IEH error register if BIT 1 is set then process
> 
> BIT(1)
> 
>> +	 * the slave IEH first. If BIT 0 in global error register is set then process
> 
> BIT(0)
> 
>> +	 * the corresponding local error registers
> 
> Punctuations please!
> 
>> +	 */
>> +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err));
>> +	if (master_global_errstat & SOC_SLAVE_IEH) {
>> +		slave_global_errstat = xe_mmio_read32(mmio,
>> +						      SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err));
>> +		if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
>> +			slave_local_errstat = xe_mmio_read32(mmio,
>> +							     SOC_LOCAL_ERR_STAT_REG(slave_base,
>> +										    hw_err));
> 
> With long names usually comes the ugly wrapping :(
> So let's either try to shorten some of them here or split the condition
> into another function for readability.
> 
>> +			if (hw_err == HARDWARE_ERROR_FATAL) {
> 
> So we don't log for other severities?

Requirement is only fatal. There are no errors defined for non-fatal

> 
>> +				for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
>> +					log_soc_error(tile, pvc_slave_local_fatal_err_reg,
>> +						      severity, regbit, error_id);
>> +			}
>> +
>> +			xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
>> +					slave_local_errstat);
>> +		}
>> +
>> +		for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
>> +			log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
>> +
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
>> +				slave_global_errstat);
>> +	}
>> +
>> +	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
> 
> Ditto for split.

This should be fine. i will split the slave into a function and retain 
master in this.

Riana

> 
> Raag
> 
>> +		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err));
>> +
>> +		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
>> +			const char * const *reg_info = (hw_err == HARDWARE_ERROR_FATAL) ?
>> +						       pvc_master_local_fatal_err_reg :
>> +						       pvc_master_local_nonfatal_err_reg;
>> +
>> +			log_soc_error(tile, reg_info, severity, regbit, error_id);
>> +		}
>> +
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, hw_err), master_local_errstat);
>> +	}
>> +
>> +	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
>> +		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
>> +
>> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, hw_err), master_global_errstat);
>> +
>> +unmask_gsysevtctl:
>> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
>> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
>> +				(DRM_XE_RAS_ERROR_SEVERITY_MAX << 1) + 1);
>> +}
>> +
>>   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>   {
>>   	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>> @@ -263,8 +458,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>   				 "TILE%d reported %s %s, bit[%d] is set\n",
>>   				 tile->id, name, severity_str, err_bit);
>>   		}
>> +
>>   		if (err_bit == XE_GT_ERROR)
>>   			gt_hw_error_handler(tile, hw_err, error_id);
>> +		if (err_bit == XE_SOC_ERROR)
>> +			soc_hw_error_handler(tile, hw_err, error_id);
>>   	}
>>   
>>   clear_reg:
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2026-01-27  8:29     ` Riana Tauro
@ 2026-01-27 10:12       ` Raag Jadav
  0 siblings, 0 replies; 22+ messages in thread
From: Raag Jadav @ 2026-01-27 10:12 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

On Tue, Jan 27, 2026 at 01:59:12PM +0530, Riana Tauro wrote:
> On 1/21/2026 12:39 PM, Raag Jadav wrote:
> > On Mon, Jan 19, 2026 at 09:30:25AM +0530, Riana Tauro wrote:
> > > PVC supports GT error reporting via vector registers along with
> > > error status register. Add support to report these errors and
> > > update respective counters. Incase of Subslice error reported
> > > by vector register, process the error status register
> > > for applicable bits.
> > > 
> > > Incorporate the counter inside the driver itself and start
> > > using the drm_ras generic netlink to report them.
> > > 
> > > Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > > v2: Add ID's and names as uAPI (Rodrigo)
> > > 
> > > v3: use REG_BIT
> > >      do not use _ffs
> > >      use a single function for GT errors
> > >      remove redundant errors from logs (Raag)
> > >      use only correctable/uncorrectable error severity (Pratik/Aravind)
> > > ---
> > >   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  53 +++++-
> > >   drivers/gpu/drm/xe/xe_hw_error.c           | 182 +++++++++++++++++++--
> > >   2 files changed, 220 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > index c146b9ef44eb..5eeb0be27300 100644
> > > --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > @@ -6,15 +6,60 @@
> > >   #ifndef _XE_HW_ERROR_REGS_H_
> > >   #define _XE_HW_ERROR_REGS_H_
> > > -#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
> > > -#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
> > > +#define HEC_UNCORR_ERR_STATUS(base)		XE_REG((base) + 0x118)
> > > +#define   UNCORR_FW_REPORTED_ERR		REG_BIT(6)
> > > -#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
> > > +#define HEC_UNCORR_FW_ERR_DW0(base)		XE_REG((base) + 0x124)
> > > +
> > > +#define ERR_STAT_GT_COR				0x100160
> > > +#define   EU_GRF_COR_ERR			REG_BIT(15)
> > > +#define   EU_IC_COR_ERR				REG_BIT(14)
> > > +#define   SLM_COR_ERR				REG_BIT(13)
> > > +#define   GUC_COR_ERR				REG_BIT(1)
> > > +
> > > +#define ERR_STAT_GT_NONFATAL			0x100164
> > > +#define ERR_STAT_GT_FATAL			0x100168
> > > +#define   EU_GRF_FAT_ERR			REG_BIT(15)
> > > +#define   SLM_FAT_ERR				REG_BIT(13)
> > > +#define   GUC_FAT_ERR				REG_BIT(6)
> > > +#define   FPU_FAT_ERR				REG_BIT(3)
> > > +
> > > +#define ERR_STAT_GT_REG(x)			XE_REG(_PICK_EVEN((x), \
> > > +								  ERR_STAT_GT_COR, \
> > > +								  ERR_STAT_GT_NONFATAL))
> > 
> > Shouldn't this be FATAL?
> 
> No it is correct
> 
> #define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a)))
> 
> index=0	val=0x100160
> index=1 val=0x100164
> index=2 val=0x100168

Yep, I confused this with our lack of NONFATAL_VECTOR, or perhaps
I missed the coffee last time around :D

> > > +#define PVC_COR_ERR_MASK			(GUC_COR_ERR | SLM_COR_ERR | EU_IC_COR_ERR | \
> > > +						 EU_GRF_COR_ERR)
> > > +
> > > +#define PVC_FAT_ERR_MASK			(FPU_FAT_ERR | GUC_FAT_ERR | EU_GRF_FAT_ERR | \
> > > +						 SLM_FAT_ERR)
> > >   #define DEV_ERR_STAT_NONFATAL			0x100178
> > >   #define DEV_ERR_STAT_CORRECTABLE		0x10017c
> > >   #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
> > >   								  DEV_ERR_STAT_CORRECTABLE, \
> > >   								  DEV_ERR_STAT_NONFATAL))
> > > -#define   XE_CSC_ERROR				BIT(17)
> > > +
> > > +#define   XE_CSC_ERROR				17
> > > +#define   XE_GT_ERROR				0
> > > +
> > > +#define ERR_STAT_GT_FATAL_VECTOR_0		0x100260
> > > +#define ERR_STAT_GT_FATAL_VECTOR_1		0x100264
> > > +
> > > +#define ERR_STAT_GT_FATAL_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
> > > +								  ERR_STAT_GT_FATAL_VECTOR_0, \
> > > +								  ERR_STAT_GT_FATAL_VECTOR_1))
> > > +
> > > +#define ERR_STAT_GT_COR_VECTOR_0		0x1002a0
> > > +#define ERR_STAT_GT_COR_VECTOR_1		0x1002a4
> > > +
> > > +#define ERR_STAT_GT_COR_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
> > > +								  ERR_STAT_GT_COR_VECTOR_0, \
> > > +								  ERR_STAT_GT_COR_VECTOR_1))
> > > +#define ERR_STAT_GT_COR_VECTOR_LEN		4
> > 
> > Now this makes me question about FATAL_VECTOR_LEN, perhaps we should add
> > it? Since we already have enums for it, I'm wondering if we should reuse
> > them here instead of having separate raw values?
> 
> Hmm let me check.

Also, they look like vector indexes and are unrelated to register defs.
So perhaps move them to xe_hw_error.c where the enums are? Perhaps it'll
make the enum reuse easier.

> > > +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
> > > +						ERR_STAT_GT_COR_VECTOR_REG(x) : \
> > > +						ERR_STAT_GT_FATAL_VECTOR_REG(x))
> > > +
> > >   #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> > > index b42495d3015a..bd0cf61741ca 100644
> > > --- a/drivers/gpu/drm/xe/xe_hw_error.c
> > > +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> > > @@ -3,6 +3,7 @@
> > >    * Copyright © 2025 Intel Corporation
> > >    */
> > > +#include <linux/bitmap.h>
> > >   #include <linux/fault-inject.h>
> > >   #include "regs/xe_gsc_regs.h"
> > > @@ -15,7 +16,10 @@
> > >   #include "xe_mmio.h"
> > >   #include "xe_survivability_mode.h"
> > > -#define  HEC_UNCORR_FW_ERR_BITS 4
> > > +#define  GT_HW_ERROR_MAX_ERR_BITS	16
> > > +#define  HEC_UNCORR_FW_ERR_BITS 	4
> > > +#define  XE_RAS_REG_SIZE		32
> > 
> > This looks like it can be BITS_PER_TYPE(). Also, why do we need a separate
> > macro?
> 
> The reason i kept a separate macro is that for_each_set_bit requires a
> unsigned long, but the register size is 32.

Would something like this work?

	for_each_set_bit(err_bit, &err_src, BITS_PER_TYPE(err_src)) {
		...
	}

> > >   extern struct fault_attr inject_csc_hw_error;
> > >   static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
> > > @@ -26,10 +30,21 @@ static const char * const hec_uncorrected_fw_errors[] = {
> > >   	"Data Corruption"
> > >   };
> > > -static bool fault_inject_csc_hw_error(void)
> > > -{
> > > -	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> > > -}
> > > +static const unsigned long xe_hw_error_map[] = {
> > > +	[XE_GT_ERROR] = DRM_XE_RAS_ERROR_CLASS_GT,
> > > +};
> > > +
> > > +enum gt_vector_regs {
> > > +	ERR_STAT_GT_VECTOR0 = 0,
> > > +	ERR_STAT_GT_VECTOR1,
> > > +	ERR_STAT_GT_VECTOR2,
> > > +	ERR_STAT_GT_VECTOR3,
> > > +	ERR_STAT_GT_VECTOR4,
> > > +	ERR_STAT_GT_VECTOR5,
> > > +	ERR_STAT_GT_VECTOR6,
> > > +	ERR_STAT_GT_VECTOR7,
> > > +	ERR_STAT_GT_VECTOR_MAX,
> > 
> > This is guaranteed last member, so redundant comma.
> 
> will fix
> 
> > 
> > > +};
> > >   static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_err)
> > >   {
> > > @@ -39,6 +54,11 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
> > >   	return DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE;
> > >   }
> > > +static bool fault_inject_csc_hw_error(void)
> > > +{
> > > +	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> > > +}
> > > +
> > >   static void csc_hw_error_work(struct work_struct *work)
> > >   {
> > >   	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
> > > @@ -86,15 +106,121 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
> > >   	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
> > >   }
> > > +static void log_hw_error(struct xe_tile *tile, const char *name,
> > > +			 const enum drm_xe_ras_error_severity severity)
> > > +{
> > > +	const char *severity_str = error_severity[severity];
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +
> > > +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
> > 
> > If we have FATAL case in the future, should we come back refactoring this?
> > Perhaps the reverse logic would be a bit more future proof.
> 
> 
> There will be only two severity levels correctable and uncorrectable and
> that is confirmed for XE KMD

For now, yes.

> sure i can reverse it.

I had something like this in mind

	if (severity == CORRECTABLE)
		drm_warn();
	else
		drm_err();

so we don't have to come back refactoring for FATAL case in the future,
but upto you.

> > > +		drm_err_ratelimited(&xe->drm, "%s %s detected\n", name, severity_str);
> > > +	else
> > > +		drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
> > > +}
> > > +
> > > +static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
> > > +		       const enum drm_xe_ras_error_severity severity)
> > > +{
> > > +	const char *severity_str = error_severity[severity];
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +
> > > +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE)
> > 
> > Ditto.
> > 
> > > +		drm_err_ratelimited(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
> > > +				    name, severity_str, i, err);
> > > +	else
> > > +		drm_warn(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
> > > +			 name, severity_str, i, err);
> > > +}
> > > +
> > > +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> > > +				u32 error_id)
> > > +{
> > > +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	struct xe_drm_ras_counter *info = ras->info[severity];
> > > +	struct xe_mmio *mmio = &tile->mmio;
> > > +	unsigned long err_stat = 0;
> > > +	int i, len;
> > > +
> > > +	if (xe->info.platform != XE_PVC)
> > > +		return;
> > > +
> > > +	if (hw_err == HARDWARE_ERROR_NONFATAL) {
> > > +		atomic64_inc(&info[error_id].counter);
> > > +		log_hw_error(tile, info[error_id].name, severity);
> > > +		return;
> > > +	}
> > > +
> > > +	len = (hw_err == HARDWARE_ERROR_CORRECTABLE) ? ERR_STAT_GT_COR_VECTOR_LEN
> > > +						     : ERR_STAT_GT_VECTOR_MAX;
> > > +
> > > +	for (i = 0; i < len; i++) {
> > > +		u32 vector, val;
> > > +
> > > +		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i));
> > > +		if (!vector)
> > > +			continue;
> > > +
> > > +		switch (i) {
> > > +		case ERR_STAT_GT_VECTOR0:
> > > +		case ERR_STAT_GT_VECTOR1:
> > > +			u32 errbit;
> > 
> > With this I think you'll need braces to make the compiler happy, so either
> > add them or move this to the top.
> > >> +			val = hweight32(vector);
> > > +			atomic64_add(val, &info[error_id].counter);
> > > +			log_gt_err(tile, "Subslice", i, vector, severity);
> > > +
> > > +			/* Read Error Status Register once */
> > 
> > Why? Can you please elaborate?
> 
> The register will be populated only once. Even though there are multiple
> vectors reported, the causes for the subslice error will be read and cleared
> once.
> 
> Will add it in comment.

Yes please!

> > > +			if (err_stat)
> > > +				break;
> > > +
> > > +			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
> > > +			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
> > > +				if (hw_err == HARDWARE_ERROR_CORRECTABLE &&
> > > +				    (BIT(errbit) & PVC_COR_ERR_MASK))
> > 
> > I'm wondering if this can be a (hw_err ? x) macro for this? Perhaps it'll
> > help remove the duplication.
> 
> It is used once. Will check

Sure, I don't mind but perhaps it'll help abstract the rather busier
logic here.

> > > +					atomic64_inc(&info[error_id].counter);
> > > +				if (hw_err == HARDWARE_ERROR_FATAL &&
> > > +				    (BIT(errbit) & PVC_FAT_ERR_MASK))
> > > +					atomic64_inc(&info[error_id].counter);
> > > +			}
> > > +			if (err_stat)
> > > +				xe_mmio_write32(mmio, ERR_STAT_GT_REG(hw_err), err_stat);
> > > +			break;
> > > +		case ERR_STAT_GT_VECTOR2:
> > > +		case ERR_STAT_GT_VECTOR3:
> > > +			val = hweight32(vector);
> > > +			atomic64_add(val, &info[error_id].counter);
> > > +			log_gt_err(tile, "L3 BANK", i, vector, severity);
> > > +			break;
> > > +		case ERR_STAT_GT_VECTOR6:
> > > +			val = hweight32(vector);
> > > +			atomic64_add(val, &info[error_id].counter);
> > > +			log_gt_err(tile, "TLB", i, vector, severity);
> > > +			break;
> > > +		case ERR_STAT_GT_VECTOR7:
> > > +			val = hweight32(vector);
> > > +			atomic64_add(val, &info[error_id].counter);
> > > +			break;
> > > +		default:
> > > +			log_gt_err(tile, "Undefined", i, vector, severity);
> > > +		}
> > > +
> > > +		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
> > > +	}
> > > +}
> > > +
> > >   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> > >   {
> > >   	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> > >   	const char *severity_str = error_severity[severity];
> > >   	struct xe_device *xe = tile_to_xe(tile);
> > > -	unsigned long flags;
> > > -	u32 err_src;
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	struct xe_drm_ras_counter *info = ras->info[severity];
> > > +	unsigned long flags, err_src;
> > > +	u32 err_bit;
> > > -	if (xe->info.platform != XE_BATTLEMAGE)
> > > +	if (!IS_DGFX(xe))
> > >   		return;
> > >   	spin_lock_irqsave(&xe->irq.lock, flags);
> > 
> > I'm wondering if we really need this? We're already inside irq handler so
> > what are we protecting here?
> 
> This is not related to the series. Will have to check

Then let's address it separately. Perhaps a standalone fix?

> > > @@ -105,11 +231,44 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
> > >   		goto unlock;
> > >   	}
> > > -	if (err_src & XE_CSC_ERROR)
> > > +	/*
> > > +	 * On encountering CSC firmware errors, the graphics device is non-recoverable.
> > 
> > ... "so bail immediately."
> 
> The code is quite intutive but will add it for additional clarity.

Thank you! We always assume the lack of coffee ;)
Also, nit: s/non-recoverable/unrecoverable

Raag

> > > +	 * The only way to recover from these errors is firmware flash. The device will
> > > +	 * enter Runtime Survivability mode when such errors are detected.
> > > +	 */
> > > +	if (err_src & XE_CSC_ERROR) {
> > >   		csc_hw_error_handler(tile, hw_err);
> > > +		goto clear_reg;
> > > +	}
> > > -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
> > > +	if (!info) {
> > > +		drm_err_ratelimited(&xe->drm, HW_ERR "Errors undefined\n");
> > > +		goto clear_reg;
> > > +	}
> > > +
> > > +	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
> > > +		u32 error_id = xe_hw_error_map[err_bit];
> > 
> > Does this need bounds checking against ARRAY_SIZE()?
> > 
> > > +		const char *name;
> > > +
> > > +		name = info[error_id].name;
> > > +		if (!name)
> > > +			goto clear_reg;
> > 
> > Shouldn't we atleast give the next id a try?
> 
> yeah makes sense. will add it.
> 
> Thanks
> Riana
> 
> > 
> > > +		if (severity == DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE) {
> > 
> > Ditto for logging per severity.
> > 
> > Raag
> > 
> > > +			drm_err_ratelimited(&xe->drm, HW_ERR
> > > +					    "TILE%d reported %s %s, bit[%d] is set\n",
> > > +					    tile->id, name, severity_str, err_bit);
> > > +		} else {
> > > +			drm_warn(&xe->drm, HW_ERR
> > > +				 "TILE%d reported %s %s, bit[%d] is set\n",
> > > +				 tile->id, name, severity_str, err_bit);
> > > +		}
> > > +		if (err_bit == XE_GT_ERROR)
> > > +			gt_hw_error_handler(tile, hw_err, error_id);
> > > +	}
> > > +
> > > +clear_reg:
> > > +	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
> > >   unlock:
> > >   	spin_unlock_irqrestore(&xe->irq.lock, flags);
> > >   }
> > > @@ -131,9 +290,10 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
> > >   	if (fault_inject_csc_hw_error())
> > >   		schedule_work(&tile->csc_hw_error_work);
> > > -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
> > > +	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
> > >   		if (master_ctl & ERROR_IRQ(hw_err))
> > >   			hw_error_source_handler(tile, hw_err);
> > > +	}
> > >   }
> > >   static int hw_error_info_init(struct xe_device *xe)
> > > -- 
> > > 2.47.1
> > > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-20 17:01   ` Raag Jadav
@ 2026-01-28  6:51     ` Riana Tauro
  2026-01-28  7:15       ` Raag Jadav
  0 siblings, 1 reply; 22+ messages in thread
From: Riana Tauro @ 2026-01-28  6:51 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri



On 1/20/2026 10:31 PM, Raag Jadav wrote:
> On Mon, Jan 19, 2026 at 09:30:24AM +0530, Riana Tauro wrote:
>> Allocate correctable, uncorrectable nodes for every xe device
>> Each node contains error classes, counters and respective
>> query counter functions.
> 
> ...
> 
>> +static int hw_query_error_counter(struct xe_drm_ras_counter *info,
>> +				  u32 error_id, const char **name, u32 *val)
>> +{
>> +	if (error_id < DRM_XE_RAS_ERROR_CLASS_GT || error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)
> 
> This looks like it can be in_range().

in_range has start+len. Should again use count here.
This seems simpler

> 
>> +		return -EINVAL;
>> +
>> +	if (!info[error_id].name)
>> +		return -ENOENT;
>> +
>> +	*name = info[error_id].name;
>> +	*val = atomic64_read(&info[error_id].counter);
>> +
>> +	return 0;
>> +}
>> +
>> +static int query_uncorrectable_error_counters(struct drm_ras_node *ep,
> 
> This is named as 'counters' but I only see a single call here. What am
> I missing?

makes sense will fix it

> 
>> +					      u32 error_id, const char **name,
>> +					      u32 *val)
> 
> Can this be less lines?

will check

> 
>> +{
>> +	struct xe_device *xe = ep->priv;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE];
>> +
>> +	return hw_query_error_counter(info, error_id, name, val);
>> +}
>> +
>> +static int query_correctable_error_counters(struct drm_ras_node *ep,
> 
> Same as above.
> 
>> +					    u32 error_id, const char **name,
>> +					    u32 *val)
> 
> Same as above.
> 
>> +{
>> +	struct xe_device *xe = ep->priv;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE];
>> +
>> +	return hw_query_error_counter(info, error_id, name, val);
>> +}
>> +
>> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
>> +{
>> +	struct xe_drm_ras_counter *counter;
>> +	int i;
>> +
>> +	counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERROR_CLASS_MAX,
>> +			       sizeof(struct xe_drm_ras_counter), GFP_KERNEL);
> 
> I'd make this robust against type changes, i.e. sizeof(*counter).
> 
>> +	if (!counter)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	for (i = 0; i < DRM_XE_RAS_ERROR_CLASS_MAX; i++) {
>> +		if (!errors[i])
>> +			continue;
>> +
>> +		counter[i].name = errors[i];
>> +		atomic64_set(&counter[i].counter, 0);
> 
> Doesn't drmm_kcalloc() already take care of this?
> 
>> +	}
>> +
>> +	return counter;
>> +}
>> +
>> +static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
>> +			      const enum drm_xe_ras_error_severity severity)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	const char *device_name;
>> +
>> +	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
>> +				pci_domain_nr(pdev->bus), pdev->bus->number,
>> +				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
>> +
>> +	node->device_name = device_name;
>> +	node->node_name = error_severity[severity];
>> +	node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
>> +	node->error_counter_range.first = DRM_XE_RAS_ERROR_CLASS_GT;
>> +	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
>> +	node->priv = xe;
>> +
>> +	ras->info[severity] = allocate_and_copy_counters(xe);
>> +	if (IS_ERR(ras->info[severity]))
>> +		return PTR_ERR(ras->info[severity]);
>> +
>> +	if (severity == DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE)
>> +		node->query_error_counter = query_correctable_error_counters;
>> +	else
>> +		node->query_error_counter = query_uncorrectable_error_counters;
> 
> Shouldn't this have explicit severity check, atleast for future proofing?

there are only two severity types right now. incase there is anything 
else added else can be modified accordingly

> 
>> +
>> +	return 0;
>> +}
>> +
>> +static int register_nodes(struct xe_device *xe)
>> +{
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	int i;
>> +
>> +	for_each_error_severity(i) {
>> +		struct drm_ras_node *node = &ras->node[i];
>> +		int ret;
>> +
>> +		ret = assign_node_params(xe, node, i);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ret = drm_ras_node_register(node);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void xe_drm_ras_unregister_nodes(void *arg)
>> +{
>> +	struct xe_device *xe = arg;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	int i;
>> +
>> +	for_each_error_severity(i) {
>> +		struct drm_ras_node *node = &ras->node[i];
>> +
>> +		drm_ras_node_unregister(node);
>> +
>> +		if (i == 0)
>> +			kfree(node->device_name);
> 
> Aren't we allocating this for each node?

Thanks for catching this. The rev2 had this once per node.
I moved the entire node params to a different function.

Will fix this


> 
>> +	}
>> +}
>> +
>> +/**
>> + * xe_drm_ras_allocate_nodes - Allocate DRM RAS nodes
>> + * @xe: xe device instance
>> + *
>> + * Allocate and register DRM RAS nodes per device
>> + *
>> + * Return: 0 on success, error code on failure
>> + */
>> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
>> +{
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct drm_ras_node *node;
>> +	int err;
>> +
>> +	node = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX, sizeof(struct drm_ras_node),
> 
> Ditto for robust against type changes.

okay

> 
>> +			    GFP_KERNEL);
>> +	if (!node)
>> +		return -ENOMEM;
>> +
>> +	ras->node = node;
>> +
>> +	err = register_nodes(xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to register drm ras node\n");
>> +		return err;
>> +	}
>> +
>> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
>> +		return err;
>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
>> new file mode 100644
>> index 000000000000..2d714342e4e5
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras.h
>> @@ -0,0 +1,15 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +#ifndef XE_DRM_RAS_H_
>> +#define XE_DRM_RAS_H_
>> +
>> +struct xe_device;
>> +
>> +#define for_each_error_severity(i)	\
>> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++)
>> +
>> +int xe_drm_ras_allocate_nodes(struct xe_device *xe);
>> +
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> new file mode 100644
>> index 000000000000..528c708e57da
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> @@ -0,0 +1,49 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_DRM_RAS_TYPES_H_
>> +#define _XE_DRM_RAS_TYPES_H_
>> +
>> +#include <drm/xe_drm.h>
>> +#include <linux/atomic.h>
>> +
>> +struct drm_ras_node;
>> +
>> +/* Error categories reported by hardware */
>> +enum hardware_error {
>> +	HARDWARE_ERROR_CORRECTABLE = 0,
>> +	HARDWARE_ERROR_NONFATAL = 1,
>> +	HARDWARE_ERROR_FATAL = 2,
>> +	HARDWARE_ERROR_MAX,
>> +};
>> +
>> +/**
>> + * struct xe_drm_ras_counter - XE RAS counter
>> + *
>> + * This structure contains error class and counter information
>> + */
>> +struct xe_drm_ras_counter {
>> +	/** @name: error class name */
>> +	const char *name;
>> +
>> +	/** @counter: count of error */
>> +	atomic64_t counter;
>> +};
>> +
>> +/**
>> + * struct xe_drm_ras - XE DRM RAS structure
>> + *
>> + * This structure has details of error counters
>> + */
>> +struct xe_drm_ras {
>> +	/** @node: DRM RAS node */
>> +	struct drm_ras_node *node;
>> +
>> +	/** @info: info array for all types of errors */
>> +	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
>> +
> 
> Nit: Redundant blank line.
> 
>> +};
>> +
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index 8c65291f36fc..b42495d3015a 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -10,20 +10,14 @@
>>   #include "regs/xe_irq_regs.h"
>>   
>>   #include "xe_device.h"
>> +#include "xe_drm_ras.h"
>>   #include "xe_hw_error.h"
>>   #include "xe_mmio.h"
>>   #include "xe_survivability_mode.h"
>>   
>>   #define  HEC_UNCORR_FW_ERR_BITS 4
>>   extern struct fault_attr inject_csc_hw_error;
>> -
>> -/* Error categories reported by hardware */
>> -enum hardware_error {
>> -	HARDWARE_ERROR_CORRECTABLE = 0,
>> -	HARDWARE_ERROR_NONFATAL = 1,
>> -	HARDWARE_ERROR_FATAL = 2,
>> -	HARDWARE_ERROR_MAX,
>> -};
>> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
> 
> This is unrelated to uapi changes, shouldn't we split this into a separate
> patch?

Let me check if i can split this

> 
> ...
> 
>> +/**
>> + * enum drm_xe_ras_error_severity - DRM RAS error severity.
>> + */
>> +enum drm_xe_ras_error_severity {
>> +	/** @DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE: Correctable Error */
>> +	DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE = 0,
> 
> DRM_XE_RAS_ERR_SEV_*? (and same for this entire file)

ERROR_SEVERITY is more verbose

> 
>> +	/** @DRM_XE_RAS_ERROR_UNCORRECTABLE: Uncorrectable Error */
> 
> Match with actual name.

will fix this.

> 
>> +	DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE,
>> +	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
>> +	DRM_XE_RAS_ERROR_SEVERITY_MAX /* non-ABI */
>> +};
>> +
>> +/**
>> + * enum drm_xe_ras_error_class - DRM RAS error classes.
>> + */
>> +enum drm_xe_ras_error_class {
>> +	/** @DRM_XE_RAS_ERROR_CLASS_GT: GT Error */
>> +	DRM_XE_RAS_ERROR_CLASS_GT = 1,
>> +	/** @DRM_XE_RAS_ERROR_CLASS_SOC: SoC Error */
>> +	DRM_XE_RAS_ERROR_CLASS_SOC,
>> +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
>> +	DRM_XE_RAS_ERROR_CLASS_MAX	/* non-ABI */
> 
> I don't find 'CLASS' to be much translatable since it can inherently mean
> anything, but I'm not sure if this to match with spec naming.
> 
> PS: I've used 'COMP' for component in my series[1], but upto you.
> Also, please help review it in case I've missed anything.

It's an aggregated error class.

yeah component does match the spec. Also the rest of the errors will be 
renamed accordingly to.

core-compute and soc-internal

Thanks
Riana

> 
> [1] https://lore.kernel.org/intel-xe/20260116093432.914040-1-raag.jadav@intel.com/
> 
> Raag
> 
>> +};
>> +
>> +/*
>> + * Error severity to name mapping.
>> + */
>> +#define DRM_XE_RAS_ERROR_SEVERITY_NAMES {					\
>> +	[DRM_XE_RAS_ERROR_SEVERITY_CORRECTABLE] = "correctable-errors",		\
>> +	[DRM_XE_RAS_ERROR_SEVERITY_UNCORRECTABLE] = "uncorrectable-errors",	\
>> +}
>> +
>> +/*
>> + * Error class to name mapping.
>> + */
>> +#define DRM_XE_RAS_ERROR_CLASS_NAMES {					\
>> +	[DRM_XE_RAS_ERROR_CLASS_GT] = "GT",				\
>> +	[DRM_XE_RAS_ERROR_CLASS_SOC] = "SoC"				\
>> +}
>> +
>>   #if defined(__cplusplus)
>>   }
>>   #endif
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-28  6:51     ` Riana Tauro
@ 2026-01-28  7:15       ` Raag Jadav
  2026-01-28  7:34         ` Riana Tauro
  0 siblings, 1 reply; 22+ messages in thread
From: Raag Jadav @ 2026-01-28  7:15 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri

On Wed, Jan 28, 2026 at 12:21:18PM +0530, Riana Tauro wrote:
> On 1/20/2026 10:31 PM, Raag Jadav wrote:
> > On Mon, Jan 19, 2026 at 09:30:24AM +0530, Riana Tauro wrote:
> > > Allocate correctable, uncorrectable nodes for every xe device
> > > Each node contains error classes, counters and respective
> > > query counter functions.
> > 
> > ...
> > 
> > > +static int hw_query_error_counter(struct xe_drm_ras_counter *info,
> > > +				  u32 error_id, const char **name, u32 *val)
> > > +{
> > > +	if (error_id < DRM_XE_RAS_ERROR_CLASS_GT || error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)
> > 
> > This looks like it can be in_range().
> 
> in_range has start+len. Should again use count here.
> This seems simpler

I just had another look at this and wondering if we really need lower
bound check? error_id is already unsigned right?

Raag

> > > +		return -EINVAL;
> > > +
> > > +	if (!info[error_id].name)
> > > +		return -ENOENT;
> > > +
> > > +	*name = info[error_id].name;
> > > +	*val = atomic64_read(&info[error_id].counter);
> > > +
> > > +	return 0;
> > > +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-28  7:15       ` Raag Jadav
@ 2026-01-28  7:34         ` Riana Tauro
  0 siblings, 0 replies; 22+ messages in thread
From: Riana Tauro @ 2026-01-28  7:34 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri



On 1/28/2026 12:45 PM, Raag Jadav wrote:
> On Wed, Jan 28, 2026 at 12:21:18PM +0530, Riana Tauro wrote:
>> On 1/20/2026 10:31 PM, Raag Jadav wrote:
>>> On Mon, Jan 19, 2026 at 09:30:24AM +0530, Riana Tauro wrote:
>>>> Allocate correctable, uncorrectable nodes for every xe device
>>>> Each node contains error classes, counters and respective
>>>> query counter functions.
>>>
>>> ...
>>>
>>>> +static int hw_query_error_counter(struct xe_drm_ras_counter *info,
>>>> +				  u32 error_id, const char **name, u32 *val)
>>>> +{
>>>> +	if (error_id < DRM_XE_RAS_ERROR_CLASS_GT || error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)
>>>
>>> This looks like it can be in_range().
>>
>> in_range has start+len. Should again use count here.
>> This seems simpler
> 
> I just had another look at this and wondering if we really need lower
> bound check? error_id is already unsigned right?


added this because error_id starts from 1 and not 0.

the entire condition can be removed. This is already being checked in 
the first patch

+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;

Will remove this in next rev

Thanks
Riana

> 
> Raag
> 
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!info[error_id].name)
>>>> +		return -ENOENT;
>>>> +
>>>> +	*name = info[error_id].name;
>>>> +	*val = atomic64_read(&info[error_id].counter);
>>>> +
>>>> +	return 0;
>>>> +}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-22 21:51   ` Zack McKevitt
@ 2026-02-02  6:20     ` Riana Tauro
  0 siblings, 0 replies; 22+ messages in thread
From: Riana Tauro @ 2026-02-02  6:20 UTC (permalink / raw)
  To: Zack McKevitt, intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Lijo Lazar, Hawking Zhang, Jakub Kicinski,
	David S. Miller, Paolo Abeni, Eric Dumazet, netdev



On 1/23/2026 3:21 AM, Zack McKevitt wrote:
> Hi Riana and Rodrigo,
> 
> Thanks for incorporating the various pieces of feedback. I think this 
> looks good from our end.
> 

Hi Zack

Thank you for the RB.

Riana

> Reviewed-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> 
> Zack
> 
> On 1/18/2026 9:00 PM, Riana Tauro wrote:
>> From: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>
>> Introduces the DRM RAS infrastructure over generic netlink.
>>
>> The new interface allows drivers to expose RAS nodes and their
>> associated error counters to userspace in a structured and extensible
>> way. Each drm_ras node can register its own set of error counters, which
>> are then discoverable and queryable through netlink operations. This
>> lays the groundwork for reporting and managing hardware error states
>> in a unified manner across different DRM drivers.
>>
>> Currently is only supports error-counter nodes. But it can be
>> extended later.
>>
>> The registration is also no tied to any drm node, so it can be
>> used by accel devices as well.
>>
>> It uses the new and mandatory YAML description format stored in
>> Documentation/netlink/specs/. This forces a single generic netlink
>> family namespace for the entire drm: "drm-ras".
>> But multiple-endpoints are supported within the single family.
>>
>> Any modification to this API needs to be applied to
>> Documentation/netlink/specs/drm_ras.yaml before regenerating the
>> code:
>>
>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>   Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
>>   > include/uapi/drm/drm_ras.h
>>
>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>   Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
>>   > include/drm/drm_ras_nl.h
>>
>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>   Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
>>   > drivers/gpu/drm/drm_ras_nl.c
>>
>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: netdev@vger.kernel.org
>> Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: fix doc and memory leak
>>      use xe_for_each_start
>>      use standard genlmsg_iput (Jakub Kicinski)
>>
>> v3: add documentation to index
>>      modify documentation to mention uAPI requirements (Rodrigo)
>>
>> v4: fix typo (Zack)
>> ---
>>   Documentation/gpu/drm-ras.rst            | 109 +++++++
>>   Documentation/gpu/index.rst              |   1 +
>>   Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
>>   drivers/gpu/drm/Kconfig                  |   9 +
>>   drivers/gpu/drm/Makefile                 |   1 +
>>   drivers/gpu/drm/drm_drv.c                |   6 +
>>   drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
>>   drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
>>   drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
>>   include/drm/drm_ras.h                    |  76 +++++
>>   include/drm/drm_ras_genl_family.h        |  17 ++
>>   include/drm/drm_ras_nl.h                 |  24 ++
>>   include/uapi/drm/drm_ras.h               |  49 ++++
>>   13 files changed, 869 insertions(+)
>>   create mode 100644 Documentation/gpu/drm-ras.rst
>>   create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>>   create mode 100644 drivers/gpu/drm/drm_ras.c
>>   create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>>   create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>>   create mode 100644 include/drm/drm_ras.h
>>   create mode 100644 include/drm/drm_ras_genl_family.h
>>   create mode 100644 include/drm/drm_ras_nl.h
>>   create mode 100644 include/uapi/drm/drm_ras.h
>>
>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm- 
>> ras.rst
>> new file mode 100644
>> index 000000000000..cec60cf5d17d
>> --- /dev/null
>> +++ b/Documentation/gpu/drm-ras.rst
>> @@ -0,0 +1,109 @@
>> +.. SPDX-License-Identifier: GPL-2.0+
>> +
>> +============================
>> +DRM RAS over Generic Netlink
>> +============================
>> +
>> +The DRM RAS (Reliability, Availability, Serviceability) interface 
>> provides a
>> +standardized way for GPU/accelerator drivers to expose error counters 
>> and
>> +other reliability nodes to user space via Generic Netlink. This allows
>> +diagnostic tools, monitoring daemons, or test infrastructure to query 
>> hardware
>> +health in a uniform way across different DRM drivers.
>> +
>> +Key Goals:
>> +
>> +* Provide a standardized RAS solution for GPU and accelerator 
>> drivers, enabling
>> +  data center monitoring and reliability operations.
>> +* Implement a single drm-ras Generic Netlink family to meet modern 
>> Netlink YAML
>> +  specifications and centralize all RAS-related communication in one 
>> namespace.
>> +* Support a basic error counter interface, addressing the immediate, 
>> essential
>> +  monitoring needs.
>> +* Offer a flexible, future-proof interface that can be extended to 
>> support
>> +  additional types of RAS data in the future.
>> +* Allow multiple nodes per driver, enabling drivers to register separate
>> +  nodes for different IP blocks, sub-blocks, or other logical 
>> subdivisions
>> +  as applicable.
>> +
>> +Nodes
>> +=====
>> +
>> +Nodes are logical abstractions representing an error source or block 
>> within
>> +the device. Currently, only error counter nodes is supported.
>> +
>> +Drivers are responsible for registering and unregistering nodes via the
>> +`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
>> +
>> +Node Management
>> +-------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
>> +   :doc: DRM RAS Node Management
>> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
>> +   :internal:
>> +
>> +Generic Netlink Usage
>> +=====================
>> +
>> +The interface is implemented as a Generic Netlink family named ``drm- 
>> ras``.
>> +User space tools can:
>> +
>> +* List registered nodes with the ``get-nodes`` command.
>> +* List all error counters in an node with the ``get-error-counters`` 
>> command.
>> +* Query error counters using the ``query-error-counter`` command.
>> +
>> +YAML-based Interface
>> +--------------------
>> +
>> +The interface is described in a YAML specification:
>> +
>> +:ref:`Documentation/netlink/specs/drm_ras.yaml`
>> +
>> +This YAML is used to auto-generate user space bindings via
>> +``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of 
>> netlink
>> +attributes and operations.
>> +
>> +Usage Notes
>> +-----------
>> +
>> +* User space must first enumerate nodes to obtain their IDs.
>> +* Node IDs or Node names can be used for all further queries, such as 
>> error counters.
>> +* Error counters can be queried by either the Error ID or Error name.
>> +* Query Parameters should be defined as part of the uAPI to ensure 
>> user interface stability.
>> +* The interface supports future extension by adding new node types and
>> +  additional attributes.
>> +
>> +Example: List nodes using ynl
>> +
>> +.. code-block:: bash
>> +
>> +    sudo ynl --family drm_ras  --dump list-nodes
>> +    [{'device-name': '0000:03:00.0',
>> +    'node-id': 0,
>> +    'node-name': 'correctable-errors',
>> +    'node-type': 'error-counter'},
>> +    {'device-name': '0000:03:00.0',
>> +     'node-id': 1,
>> +    'node-name': 'nonfatal-errors',
>> +    'node-type': 'error-counter'},
>> +    {'device-name': '0000:03:00.0',
>> +    'node-id': 2,
>> +    'node-name': 'fatal-errors',
>> +    'node-type': 'error-counter'}]
>> +
>> +Example: List all error counters using ynl
>> +
>> +.. code-block:: bash
>> +
>> +
>> +   sudo ynl --family drm_ras  --dump get-error-counters --json 
>> '{"node-id":1}'
>> +   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
>> +   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
>> +
>> +
>> +Example: Query an error counter for a given node
>> +
>> +.. code-block:: bash
>> +
>> +   sudo ynl --family drm_ras --do query-error-counter  --json 
>> '{"node-id":2, "error-id":1}'
>> +   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
>> +
>> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
>> index 7dcb15850afd..60c73fdcfeed 100644
>> --- a/Documentation/gpu/index.rst
>> +++ b/Documentation/gpu/index.rst
>> @@ -9,6 +9,7 @@ GPU Driver Developer's Guide
>>      drm-mm
>>      drm-kms
>>      drm-kms-helpers
>> +   drm-ras
>>      drm-uapi
>>      drm-usage-stats
>>      driver-uapi
>> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/ 
>> netlink/specs/drm_ras.yaml
>> new file mode 100644
>> index 000000000000..be0e379c5bc9
>> --- /dev/null
>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>> @@ -0,0 +1,130 @@
>> +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>> BSD-3-Clause)
>> +---
>> +name: drm-ras
>> +protocol: genetlink
>> +uapi-header: drm/drm_ras.h
>> +
>> +doc: >-
>> +  DRM RAS (Reliability, Availability, Serviceability) over Generic 
>> Netlink.
>> +  Provides a standardized mechanism for DRM drivers to register "nodes"
>> +  representing hardware/software components capable of reporting 
>> error counters.
>> +  Userspace tools can query the list of nodes or individual error 
>> counters
>> +  via the Generic Netlink interface.
>> +
>> +definitions:
>> +  -
>> +    type: enum
>> +    name: node-type
>> +    value-start: 1
>> +    entries: [error-counter]
>> +    doc: >-
>> +         Type of the node. Currently, only error-counter nodes are
>> +         supported, which expose reliability counters for a hardware/ 
>> software
>> +         component.
>> +
>> +attribute-sets:
>> +  -
>> +    name: node-attrs
>> +    attributes:
>> +      -
>> +        name: node-id
>> +        type: u32
>> +        doc: >-
>> +             Unique identifier for the node.
>> +             Assigned dynamically by the DRM RAS core upon registration.
>> +      -
>> +        name: device-name
>> +        type: string
>> +        doc: >-
>> +             Device name chosen by the driver at registration.
>> +             Can be a PCI BDF, UUID, or module name if unique.
>> +      -
>> +        name: node-name
>> +        type: string
>> +        doc: >-
>> +             Node name chosen by the driver at registration.
>> +             Can be an IP block name, or any name that identifies the
>> +             RAS node inside the device.
>> +      -
>> +        name: node-type
>> +        type: u32
>> +        doc: Type of this node, identifying its function.
>> +        enum: node-type
>> +  -
>> +    name: error-counter-attrs
>> +    attributes:
>> +      -
>> +        name: node-id
>> +        type: u32
>> +        doc:  Node ID targeted by this error counter operation.
>> +      -
>> +        name: error-id
>> +        type: u32
>> +        doc: Unique identifier for a specific error counter within an 
>> node.
>> +      -
>> +        name: error-name
>> +        type: string
>> +        doc: Name of the error.
>> +      -
>> +        name: error-value
>> +        type: u32
>> +        doc: Current value of the requested error counter.
>> +
>> +operations:
>> +  list:
>> +    -
>> +      name: list-nodes
>> +      doc: >-
>> +           Retrieve the full list of currently registered DRM RAS nodes.
>> +           Each node includes its dynamically assigned ID, name, and 
>> type.
>> +           **Important:** User space must call this operation first 
>> to obtain
>> +           the node IDs. These IDs are required for all subsequent
>> +           operations on nodes, such as querying error counters.
>> +      attribute-set: node-attrs
>> +      flags: [admin-perm]
>> +      dump:
>> +        reply:
>> +          attributes:
>> +            - node-id
>> +            - device-name
>> +            - node-name
>> +            - node-type
>> +    -
>> +      name: get-error-counters
>> +      doc: >-
>> +           Retrieve the full list of error counters for a given node.
>> +           The response include the id, the name, and even the current
>> +           value of each counter.
>> +      attribute-set: error-counter-attrs
>> +      flags: [admin-perm]
>> +      dump:
>> +        request:
>> +          attributes:
>> +            - node-id
>> +        reply:
>> +          attributes:
>> +            - error-id
>> +            - error-name
>> +            - error-value
>> +    -
>> +      name: query-error-counter
>> +      doc: >-
>> +           Query the information of a specific error counter for a 
>> given node.
>> +           Users must provide the node ID and the error counter ID.
>> +           The response contains the id, the name, and the current value
>> +           of the counter.
>> +      attribute-set: error-counter-attrs
>> +      flags: [admin-perm]
>> +      do:
>> +        request:
>> +          attributes:
>> +            - node-id
>> +            - error-id
>> +        reply:
>> +          attributes:
>> +            - error-id
>> +            - error-name
>> +            - error-value
>> +
>> +kernel-family:
>> +  headers: ["drm/drm_ras_nl.h"]
>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>> index a33b90251530..f378e77048c8 100644
>> --- a/drivers/gpu/drm/Kconfig
>> +++ b/drivers/gpu/drm/Kconfig
>> @@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
>>         Smaller QR code are easier to read, but will contain less 
>> debugging
>>         data. Default is 40.
>> +config DRM_RAS
>> +    bool "DRM RAS support"
>> +    depends on DRM
>> +    help
>> +      Enables the DRM RAS (Reliability, Availability and Serviceability)
>> +      support for DRM drivers. This provides a Generic Netlink interface
>> +      for error reporting and queries.
>> +      If in doubt, say "N".
>> +
>>   config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
>>           bool "Enable refcount backtrace history in the DP MST helpers"
>>       depends on STACKTRACE_SUPPORT
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index 0deee72ef935..2eea3f54db53 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
>>   drm-$(CONFIG_DRM_PANIC) += drm_panic.o
>>   drm-$(CONFIG_DRM_DRAW) += drm_draw.o
>>   drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
>> +drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
>>   obj-$(CONFIG_DRM)    += drm.o
>>   obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += 
>> drm_panel_orientation_quirks.o
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 2915118436ce..6b965c3d3307 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -53,6 +53,7 @@
>>   #include <drm/drm_panic.h>
>>   #include <drm/drm_print.h>
>>   #include <drm/drm_privacy_screen_machine.h>
>> +#include <drm/drm_ras_genl_family.h>
>>   #include "drm_crtc_internal.h"
>>   #include "drm_internal.h"
>> @@ -1223,6 +1224,7 @@ static const struct file_operations 
>> drm_stub_fops = {
>>   static void drm_core_exit(void)
>>   {
>> +    drm_ras_genl_family_unregister();
>>       drm_privacy_screen_lookup_exit();
>>       drm_panic_exit();
>>       accel_core_exit();
>> @@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
>>       drm_privacy_screen_lookup_init();
>> +    ret = drm_ras_genl_family_register();
>> +    if (ret < 0)
>> +        goto error;
>> +
>>       drm_core_init_complete = true;
>>       DRM_DEBUG("Initialized\n");
>> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
>> new file mode 100644
>> index 000000000000..7bc77ea24fe2
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_ras.c
>> @@ -0,0 +1,351 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/kernel.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/xarray.h>
>> +#include <net/genetlink.h>
>> +
>> +#include <drm/drm_ras.h>
>> +
>> +/**
>> + * DOC: DRM RAS Node Management
>> + *
>> + * This module provides the infrastructure to manage RAS (Reliability,
>> + * Availability, and Serviceability) nodes for DRM drivers. Each
>> + * DRM driver may register one or more RAS nodes, which represent
>> + * logical components capable of reporting error counters and other
>> + * reliability metrics.
>> + *
>> + * The nodes are stored in a global xarray `drm_ras_xa` to allow
>> + * efficient lookup by ID. Nodes can be registered or unregistered
>> + * dynamically at runtime.
>> + *
>> + * A Generic Netlink family `drm_ras` exposes three main operations to
>> + * userspace:
>> + *
>> + * 1. LIST_NODES: Dump all currently registered RAS nodes.
>> + *    The user receives an array of node IDs, names, and types.
>> + *
>> + * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
>> + *    The user receives an array of error IDs, names, and current value.
>> + *
>> + * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given 
>> node.
>> + *    Userspace must provide the node ID and the counter ID, and
>> + *    receives the ID, the error name, and its current value.
>> + *
>> + * Node registration:
>> + * - drm_ras_node_register(): Registers a new node and assigns
>> + *   it a unique ID in the xarray.
>> + * - drm_ras_node_unregister(): Removes a previously registered
>> + *   node from the xarray.
>> + *
>> + * Node type:
>> + * - ERROR_COUNTER:
>> + *     + Currently, only error counters are supported.
>> + *     + The driver must implement the query_error_counter() callback 
>> to provide
>> + *       the name and the value of the error counter.
>> + *     + The driver must provide a error_counter_range.last value 
>> informing the
>> + *       last valid error ID.
>> + *     + The driver can provide a error_counter_range.first value 
>> informing the
>> + *       frst valid error ID.
>> + *     + The error counters in the driver doesn't need to be 
>> contiguous, but the
>> + *       driver must return -ENOENT to the query_error_counter as an 
>> indication
>> + *       that the ID should be skipped and not listed in the netlink 
>> API.
>> + *
>> + * Netlink handlers:
>> + * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
>> + *   operation, iterating over the xarray.
>> + * - drm_ras_nl_get_error_counters_dumpit(): Implements the 
>> GET_ERROR_COUNTERS
>> + *   operation, iterating over the know valid error_counter_range.
>> + * - drm_ras_nl_query_error_counter_doit(): Implements the 
>> QUERY_ERROR_COUNTER
>> + *   operation, fetching a counter value from a specific node.
>> + */
>> +
>> +static DEFINE_XARRAY_ALLOC(drm_ras_xa);
>> +
>> +/*
>> + * The netlink callback context carries dump state across multiple 
>> dumpit calls
>> + */
>> +struct drm_ras_ctx {
>> +    /* Which xarray id to restart the dump from */
>> +    unsigned long restart;
>> +};
>> +
>> +/**
>> + * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
>> + * @skb: Netlink message buffer
>> + * @cb: Callback context for multi-part dumps
>> + *
>> + * Iterates over all registered RAS nodes in the global xarray and 
>> appends
>> + * their attributes (ID, name, type) to the given netlink message 
>> buffer.
>> + * Uses @cb->ctx to track progress in case the message buffer fills 
>> up, allowing
>> + * multi-part dump support. On buffer overflow, updates the context 
>> to resume
>> + * from the last node on the next invocation.
>> + *
>> + * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
>> + *          the buffer filled up (requires multi-part continuation), or
>> + *          a negative error code on failure.
>> + */
>> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
>> +                 struct netlink_callback *cb)
>> +{
>> +    const struct genl_info *info = genl_info_dump(cb);
>> +    struct drm_ras_ctx *ctx = (void *)cb->ctx;
>> +    struct drm_ras_node *node;
>> +    struct nlattr *hdr;
>> +    unsigned long id;
>> +    int ret;
>> +
>> +    xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
>> +        hdr = genlmsg_iput(skb, info);
>> +        if (!hdr) {
>> +            ret = -EMSGSIZE;
>> +            break;
>> +        }
>> +
>> +        ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
>> +        if (ret) {
>> +            genlmsg_cancel(skb, hdr);
>> +            break;
>> +        }
>> +
>> +        ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
>> +                     node->device_name);
>> +        if (ret) {
>> +            genlmsg_cancel(skb, hdr);
>> +            break;
>> +        }
>> +
>> +        ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
>> +                     node->node_name);
>> +        if (ret) {
>> +            genlmsg_cancel(skb, hdr);
>> +            break;
>> +        }
>> +
>> +        ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
>> +                  node->type);
>> +        if (ret) {
>> +            genlmsg_cancel(skb, hdr);
>> +            break;
>> +        }
>> +
>> +        genlmsg_end(skb, hdr);
>> +    }
>> +
>> +    if (ret == -EMSGSIZE)
>> +        ctx->restart = id;
>> +
>> +    return ret;
>> +}
>> +
>> +static int get_node_error_counter(u32 node_id, u32 error_id,
>> +                  const char **name, u32 *value)
>> +{
>> +    struct drm_ras_node *node;
>> +
>> +    node = xa_load(&drm_ras_xa, node_id);
>> +    if (!node || !node->query_error_counter)
>> +        return -ENOENT;
>> +
>> +    if (error_id < node->error_counter_range.first ||
>> +        error_id > node->error_counter_range.last)
>> +        return -EINVAL;
>> +
>> +    return node->query_error_counter(node, error_id, name, value);
>> +}
>> +
>> +static int msg_reply_value(struct sk_buff *msg, u32 error_id,
>> +               const char *error_name, u32 value)
>> +{
>> +    int ret;
>> +
>> +    ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, 
>> error_id);
>> +    if (ret)
>> +        return ret;
>> +
>> +    ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
>> +                 error_name);
>> +    if (ret)
>> +        return ret;
>> +
>> +    return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
>> +               value);
>> +}
>> +
>> +static int doit_reply_value(struct genl_info *info, u32 node_id,
>> +                u32 error_id)
>> +{
>> +    struct sk_buff *msg;
>> +    struct nlattr *hdr;
>> +    const char *error_name;
>> +    u32 value;
>> +    int ret;
>> +
>> +    msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
>> +    if (!msg)
>> +        return -ENOMEM;
>> +
>> +    hdr = genlmsg_iput(msg, info);
>> +    if (!hdr) {
>> +        nlmsg_free(msg);
>> +        return -EMSGSIZE;
>> +    }
>> +
>> +    ret = get_node_error_counter(node_id, error_id,
>> +                     &error_name, &value);
>> +    if (ret)
>> +        return ret;
>> +
>> +    ret = msg_reply_value(msg, error_id, error_name, value);
>> +    if (ret) {
>> +        genlmsg_cancel(msg, hdr);
>> +        nlmsg_free(msg);
>> +        return ret;
>> +    }
>> +
>> +    genlmsg_end(msg, hdr);
>> +
>> +    return genlmsg_reply(msg, info);
>> +}
>> +
>> +/**
>> + * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
>> + * @skb: Netlink message buffer
>> + * @cb: Callback context for multi-part dumps
>> + *
>> + * Iterates over all error counters in a given Node and appends
>> + * their attributes (ID, name, value) to the given netlink message 
>> buffer.
>> + * Uses @cb->ctx to track progress in case the message buffer fills 
>> up, allowing
>> + * multi-part dump support. On buffer overflow, updates the context 
>> to resume
>> + * from the last node on the next invocation.
>> + *
>> + * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
>> + *          the buffer filled up (requires multi-part continuation), or
>> + *          a negative error code on failure.
>> + */
>> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
>> +                     struct netlink_callback *cb)
>> +{
>> +    const struct genl_info *info = genl_info_dump(cb);
>> +    struct drm_ras_ctx *ctx = (void *)cb->ctx;
>> +    struct drm_ras_node *node;
>> +    struct nlattr *hdr;
>> +    const char *error_name;
>> +    u32 node_id, error_id, value;
>> +    int ret;
>> +
>> +    if (!info->attrs || !info- 
>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
>> +        return -EINVAL;
>> +
>> +    node_id = nla_get_u32(info- 
>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>> +
>> +    node = xa_load(&drm_ras_xa, node_id);
>> +    if (!node)
>> +        return -ENOENT;
>> +
>> +    for (error_id = max(node->error_counter_range.first, ctx->restart);
>> +         error_id <= node->error_counter_range.last;
>> +         error_id++) {
>> +        ret = get_node_error_counter(node_id, error_id,
>> +                         &error_name, &value);
>> +        /*
>> +         * For non-contiguous range, driver return -ENOENT as indication
>> +         * to skip this ID when listing all errors.
>> +         */
>> +        if (ret == -ENOENT)
>> +            continue;
>> +        if (ret)
>> +            return ret;
>> +
>> +        hdr = genlmsg_iput(skb, info);
>> +
>> +        if (!hdr) {
>> +            ret = -EMSGSIZE;
>> +            break;
>> +        }
>> +
>> +        ret = msg_reply_value(skb, error_id, error_name, value);
>> +        if (ret) {
>> +            genlmsg_cancel(skb, hdr);
>> +            break;
>> +        }
>> +
>> +        genlmsg_end(skb, hdr);
>> +    }
>> +
>> +    if (ret == -EMSGSIZE)
>> +        ctx->restart = error_id;
>> +
>> +    return ret;
>> +}
>> +
>> +/**
>> + * drm_ras_nl_query_error_counter_doit() - Query an error counter of 
>> an node
>> + * @skb: Netlink message buffer
>> + * @info: Generic Netlink info containing attributes of the request
>> + *
>> + * Extracts the node ID and error ID from the netlink attributes and
>> + * retrieves the current value of the corresponding error counter. 
>> Sends the
>> + * result back to the requesting user via the standard Genl reply.
>> + *
>> + * Return: 0 on success, or negative errno on failure.
>> + */
>> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
>> +                    struct genl_info *info)
>> +{
>> +    u32 node_id, error_id;
>> +
>> +    if (!info->attrs ||
>> +        !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
>> +        !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
>> +        return -EINVAL;
>> +
>> +    node_id = nla_get_u32(info- 
>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>> +    error_id = nla_get_u32(info- 
>> >attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
>> +
>> +    return doit_reply_value(info, node_id, error_id);
>> +}
>> +
>> +/**
>> + * drm_ras_node_register() - Register a new RAS node
>> + * @node: Node structure to register
>> + *
>> + * Adds the given RAS node to the global node xarray and assigns it
>> + * a unique ID. Both @node->name and @node->type must be valid.
>> + *
>> + * Return: 0 on success, or negative errno on failure:
>> + */
>> +int drm_ras_node_register(struct drm_ras_node *node)
>> +{
>> +    if (!node->device_name || !node->node_name)
>> +        return -EINVAL;
>> +
>> +    /* Currently, only Error Counter Endpoinnts are supported */
>> +    if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
>> +        return -EINVAL;
>> +
>> +    /* Mandatorty entries for Error Counter Node */
>> +    if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
>> +        (!node->error_counter_range.last || !node->query_error_counter))
>> +        return -EINVAL;
>> +
>> +    return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, 
>> GFP_KERNEL);
>> +}
>> +EXPORT_SYMBOL(drm_ras_node_register);
>> +
>> +/**
>> + * drm_ras_node_unregister() - Unregister a previously registered node
>> + * @node: Node structure to unregister
>> + *
>> + * Removes the given node from the global node xarray using its ID.
>> + */
>> +void drm_ras_node_unregister(struct drm_ras_node *node)
>> +{
>> +    xa_erase(&drm_ras_xa, node->id);
>> +}
>> +EXPORT_SYMBOL(drm_ras_node_unregister);
>> diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/ 
>> drm_ras_genl_family.c
>> new file mode 100644
>> index 000000000000..2d818b8c3808
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_ras_genl_family.c
>> @@ -0,0 +1,42 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_ras_genl_family.h>
>> +#include <drm/drm_ras_nl.h>
>> +
>> +/* Track family registration so the drm_exit can be called at any 
>> time */
>> +static bool registered;
>> +
>> +/**
>> + * drm_ras_genl_family_register() - Register drm-ras genl family
>> + *
>> + * Only to be called one at drm_drv_init()
>> + */
>> +int drm_ras_genl_family_register(void)
>> +{
>> +    int ret;
>> +
>> +    registered = false;
>> +
>> +    ret = genl_register_family(&drm_ras_nl_family);
>> +    if (ret)
>> +        return ret;
>> +
>> +    registered = true;
>> +    return 0;
>> +}
>> +
>> +/**
>> + * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
>> + *
>> + * To be called one at drm_drv_exit() at any moment, but only once.
>> + */
>> +void drm_ras_genl_family_unregister(void)
>> +{
>> +    if (registered) {
>> +        genl_unregister_family(&drm_ras_nl_family);
>> +        registered = false;
>> +    }
>> +}
>> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
>> new file mode 100644
>> index 000000000000..fcd1392410e4
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_ras_nl.c
>> @@ -0,0 +1,54 @@
>> +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>> BSD-3-Clause)
>> +/* Do not edit directly, auto-generated from: */
>> +/*    Documentation/netlink/specs/drm_ras.yaml */
>> +/* YNL-GEN kernel source */
>> +
>> +#include <net/netlink.h>
>> +#include <net/genetlink.h>
>> +
>> +#include <uapi/drm/drm_ras.h>
>> +#include <drm/drm_ras_nl.h>
>> +
>> +/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
>> +static const struct nla_policy 
>> drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
>> +    [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>> +};
>> +
>> +/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
>> +static const struct nla_policy 
>> drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
>> +    [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>> +    [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
>> +};
>> +
>> +/* Ops table for drm_ras */
>> +static const struct genl_split_ops drm_ras_nl_ops[] = {
>> +    {
>> +        .cmd    = DRM_RAS_CMD_LIST_NODES,
>> +        .dumpit    = drm_ras_nl_list_nodes_dumpit,
>> +        .flags    = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>> +    },
>> +    {
>> +        .cmd        = DRM_RAS_CMD_GET_ERROR_COUNTERS,
>> +        .dumpit        = drm_ras_nl_get_error_counters_dumpit,
>> +        .policy        = drm_ras_get_error_counters_nl_policy,
>> +        .maxattr    = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>> +        .flags        = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>> +    },
>> +    {
>> +        .cmd        = DRM_RAS_CMD_QUERY_ERROR_COUNTER,
>> +        .doit        = drm_ras_nl_query_error_counter_doit,
>> +        .policy        = drm_ras_query_error_counter_nl_policy,
>> +        .maxattr    = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>> +        .flags        = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>> +    },
>> +};
>> +
>> +struct genl_family drm_ras_nl_family __ro_after_init = {
>> +    .name        = DRM_RAS_FAMILY_NAME,
>> +    .version    = DRM_RAS_FAMILY_VERSION,
>> +    .netnsok    = true,
>> +    .parallel_ops    = true,
>> +    .module        = THIS_MODULE,
>> +    .split_ops    = drm_ras_nl_ops,
>> +    .n_split_ops    = ARRAY_SIZE(drm_ras_nl_ops),
>> +};
>> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
>> new file mode 100644
>> index 000000000000..bba47a282ef8
>> --- /dev/null
>> +++ b/include/drm/drm_ras.h
>> @@ -0,0 +1,76 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_RAS_H__
>> +#define __DRM_RAS_H__
>> +
>> +#include "drm_ras_nl.h"
>> +
>> +/**
>> + * struct drm_ras_node - A DRM RAS Node
>> + */
>> +struct drm_ras_node {
>> +    /** @id: Unique identifier for the node. Dynamically assigned. */
>> +    u32 id;
>> +    /**
>> +     * @device_name: Human-readable name of the device. Given by the 
>> driver.
>> +     */
>> +    const char *device_name;
>> +    /** @node_name: Human-readable name of the node. Given by the 
>> driver. */
>> +    const char *node_name;
>> +    /** @type: Type of the node (enum drm_ras_node_type). */
>> +    enum drm_ras_node_type type;
>> +
>> +    /* Error-Counter Related Callback and Variables */
>> +
>> +    /** @error_counter_range: Range of valid Error IDs for this node. */
>> +    struct {
>> +        /** @first: First valid Error ID. */
>> +        u32 first;
>> +        /** @last: Last valid Error ID. Mandatory entry. */
>> +        u32 last;
>> +    } error_counter_range;
>> +
>> +    /**
>> +     * @query_error_counter:
>> +     *
>> +     * This callback is used by drm-ras to query a specific error 
>> counter.
>> +     * counters supported by this node. Used for input check and to
>> +     * iterate in all counters.
>> +     *
>> +     * Driver should expect query_error_counters() to be called with
>> +     * error_id from `error_counter_range.first` to
>> +     * `error_counter_range.last`.
>> +     *
>> +     * The @query_error_counter is a mandatory callback for
>> +     * error_counter_node.
>> +     *
>> +     * Returns: 0 on success,
>> +     *          -ENOENT when error_id is not supported as an 
>> indication that
>> +     *                  drm_ras should silently skip this entry. Used 
>> for
>> +     *                  supporting non-contiguous error ranges.
>> +     *                  Driver is responsible for maintaining the 
>> list of
>> +     *                  supported error IDs in the range of first to 
>> last.
>> +     *          Other negative values on errors that should terminate 
>> the
>> +     *          netlink query.
>> +     */
>> +    int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
>> +                   const char **name, u32 *val);
>> +
>> +    /** @priv: Driver private data */
>> +    void *priv;
>> +};
>> +
>> +struct drm_device;
>> +
>> +#if IS_ENABLED(CONFIG_DRM_RAS)
>> +int drm_ras_node_register(struct drm_ras_node *ep);
>> +void drm_ras_node_unregister(struct drm_ras_node *ep);
>> +#else
>> +static inline int drm_ras_node_register(struct drm_ras_node *ep) 
>> { return 0; }
>> +static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
>> +#endif
>> +
>> +#endif
>> diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/ 
>> drm_ras_genl_family.h
>> new file mode 100644
>> index 000000000000..5931b53429f1
>> --- /dev/null
>> +++ b/include/drm/drm_ras_genl_family.h
>> @@ -0,0 +1,17 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_RAS_GENL_FAMILY_H__
>> +#define __DRM_RAS_GENL_FAMILY_H__
>> +
>> +#if IS_ENABLED(CONFIG_DRM_RAS)
>> +int drm_ras_genl_family_register(void);
>> +void drm_ras_genl_family_unregister(void);
>> +#else
>> +static inline int drm_ras_genl_family_register(void) { return 0; }
>> +static inline void drm_ras_genl_family_unregister(void) { }
>> +#endif
>> +
>> +#endif
>> diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
>> new file mode 100644
>> index 000000000000..9613b7d9ffdb
>> --- /dev/null
>> +++ b/include/drm/drm_ras_nl.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>> BSD-3-Clause) */
>> +/* Do not edit directly, auto-generated from: */
>> +/*    Documentation/netlink/specs/drm_ras.yaml */
>> +/* YNL-GEN kernel header */
>> +
>> +#ifndef _LINUX_DRM_RAS_GEN_H
>> +#define _LINUX_DRM_RAS_GEN_H
>> +
>> +#include <net/netlink.h>
>> +#include <net/genetlink.h>
>> +
>> +#include <uapi/drm/drm_ras.h>
>> +#include <drm/drm_ras_nl.h>
>> +
>> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
>> +                 struct netlink_callback *cb);
>> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
>> +                     struct netlink_callback *cb);
>> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
>> +                    struct genl_info *info);
>> +
>> +extern struct genl_family drm_ras_nl_family;
>> +
>> +#endif /* _LINUX_DRM_RAS_GEN_H */
>> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
>> new file mode 100644
>> index 000000000000..3415ba345ac8
>> --- /dev/null
>> +++ b/include/uapi/drm/drm_ras.h
>> @@ -0,0 +1,49 @@
>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>> BSD-3-Clause) */
>> +/* Do not edit directly, auto-generated from: */
>> +/*    Documentation/netlink/specs/drm_ras.yaml */
>> +/* YNL-GEN uapi header */
>> +
>> +#ifndef _UAPI_LINUX_DRM_RAS_H
>> +#define _UAPI_LINUX_DRM_RAS_H
>> +
>> +#define DRM_RAS_FAMILY_NAME    "drm-ras"
>> +#define DRM_RAS_FAMILY_VERSION    1
>> +
>> +/*
>> + * Type of the node. Currently, only error-counter nodes are 
>> supported, which
>> + * expose reliability counters for a hardware/software component.
>> + */
>> +enum drm_ras_node_type {
>> +    DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
>> +};
>> +
>> +enum {
>> +    DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
>> +    DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
>> +    DRM_RAS_A_NODE_ATTRS_NODE_NAME,
>> +    DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
>> +
>> +    __DRM_RAS_A_NODE_ATTRS_MAX,
>> +    DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
>> +};
>> +
>> +enum {
>> +    DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
>> +    DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>> +    DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
>> +    DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
>> +
>> +    __DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
>> +    DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = 
>> (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
>> +};
>> +
>> +enum {
>> +    DRM_RAS_CMD_LIST_NODES = 1,
>> +    DRM_RAS_CMD_GET_ERROR_COUNTERS,
>> +    DRM_RAS_CMD_QUERY_ERROR_COUNTER,
>> +
>> +    __DRM_RAS_CMD_MAX,
>> +    DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
>> +};
>> +
>> +#endif /* _UAPI_LINUX_DRM_RAS_H */
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-02-02  6:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-19  4:00 [PATCH v4 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-01-19  3:36 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev4) Patchwork
2026-01-19  3:37 ` ✓ CI.KUnit: success " Patchwork
2026-01-19  3:52 ` ✗ CI.checksparse: warning " Patchwork
2026-01-19  4:00 ` [PATCH v4 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2026-01-22 21:51   ` Zack McKevitt
2026-02-02  6:20     ` Riana Tauro
2026-01-19  4:00 ` [PATCH v4 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
2026-01-20 17:01   ` Raag Jadav
2026-01-28  6:51     ` Riana Tauro
2026-01-28  7:15       ` Raag Jadav
2026-01-28  7:34         ` Riana Tauro
2026-01-19  4:00 ` [PATCH v4 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
2026-01-19  9:06   ` kernel test robot
2026-01-21  7:09   ` Raag Jadav
2026-01-27  8:29     ` Riana Tauro
2026-01-27 10:12       ` Raag Jadav
2026-01-19  4:00 ` [PATCH v4 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
2026-01-23 10:33   ` Raag Jadav
2026-01-27  9:43     ` Riana Tauro
2026-01-19  4:11 ` ✗ Xe.CI.BAT: failure for Introduce DRM_RAS using generic netlink for RAS (rev4) Patchwork
2026-01-19  5:33 ` ✗ Xe.CI.Full: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox