* [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters
@ 2025-07-30 6:13 Aravind Iddamsetty
2025-07-30 6:13 ` [RFC i-g-t v3 1/1] tools/RAS: A tool to read " Aravind Iddamsetty
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Aravind Iddamsetty @ 2025-07-30 6:13 UTC (permalink / raw)
To: igt-dev
Cc: Alex Deucher, Simona Vetter, David Airlie, Joonas Lahtinen,
Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Riana Tauro,
Anshuman Gupta
This tool is to demonstrate the use of netlink sockets to read RAS error
counters, which is being proposed via series
"[RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem".
v2: update uapi header.
v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP Block.
The tool supports the following commands:
READ_ONE, READ_BLOCK, READ_ALL, WAIT_ON_EVENT, LIST_ERRORS
read single error counter:
$ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
counter value 0
read all error counters:
$ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
name config-id counter
error-gt0-correctable-guc 0x0000000000000001 0
error-gt0-correctable-slm 0x0000000000000003 0
error-gt0-correctable-eu-ic 0x0000000000000004 0
error-gt0-correctable-eu-grf 0x0000000000000005 0
error-gt0-fatal-guc 0x0000000000000009 0
error-gt0-fatal-slm 0x000000000000000d 0
error-gt0-fatal-eu-grf 0x000000000000000f 0
error-gt0-fatal-fpu 0x0000000000000010 0
error-gt0-fatal-tlb 0x0000000000000011 0
error-gt0-fatal-l3-fabric 0x0000000000000012 0
error-gt0-correctable-subslice 0x0000000000000013 0
error-gt0-correctable-l3bank 0x0000000000000014 0
error-gt0-fatal-subslice 0x0000000000000015 0
error-gt0-fatal-l3bank 0x0000000000000016 0
error-gt0-sgunit-correctable 0x0000000000000017 0
error-gt0-sgunit-nonfatal 0x0000000000000018 0
error-gt0-sgunit-fatal 0x0000000000000019 0
error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0
error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0
error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0
error-gt0-soc-fatal-punit 0x000000000000001d 0
error-gt0-soc-fatal-psf-0 0x000000000000001e 0
error-gt0-soc-fatal-psf-1 0x000000000000001f 0
error-gt0-soc-fatal-psf-2 0x0000000000000020 0
error-gt0-soc-fatal-cd0 0x0000000000000021 0
error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0
error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0
error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0
error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0
error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0
error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0
error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0
error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0
error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0
error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0
error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0
error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0
error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0
error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0
error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0
error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0
error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0
error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0
error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0
error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0
error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0
error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0
error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0
error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0
error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0
error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0
error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0
error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0
error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0
error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0
error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0
error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0
error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0
error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0
error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0
error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0
error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0
error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0
error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0
error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0
error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0
error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0
error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0
error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0
error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0
error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0
error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0
error-gt1-correctable-guc 0x1000000000000001 0
error-gt1-correctable-slm 0x1000000000000003 0
error-gt1-correctable-eu-ic 0x1000000000000004 0
error-gt1-correctable-eu-grf 0x1000000000000005 0
error-gt1-fatal-guc 0x1000000000000009 0
error-gt1-fatal-slm 0x100000000000000d 0
error-gt1-fatal-eu-grf 0x100000000000000f 0
error-gt1-fatal-fpu 0x1000000000000010 0
error-gt1-fatal-tlb 0x1000000000000011 0
error-gt1-fatal-l3-fabric 0x1000000000000012 0
error-gt1-correctable-subslice 0x1000000000000013 0
error-gt1-correctable-l3bank 0x1000000000000014 0
error-gt1-fatal-subslice 0x1000000000000015 0
error-gt1-fatal-l3bank 0x1000000000000016 0
error-gt1-sgunit-correctable 0x1000000000000017 0
error-gt1-sgunit-nonfatal 0x1000000000000018 0
error-gt1-sgunit-fatal 0x1000000000000019 0
error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0
error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0
error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0
error-gt1-soc-fatal-punit 0x100000000000001d 0
error-gt1-soc-fatal-psf-0 0x100000000000001e 0
error-gt1-soc-fatal-psf-1 0x100000000000001f 0
error-gt1-soc-fatal-psf-2 0x1000000000000020 0
error-gt1-soc-fatal-cd0 0x1000000000000021 0
error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0
error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0
error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0
error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0
error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0
error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0
error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0
error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0
error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0
error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0
error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0
error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0
error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0
error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0
error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0
error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0
error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0
error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0
error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0
error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0
error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0
error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0
error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0
error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0
error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0
error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0
error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0
error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0
error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0
error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0
error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0
error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0
error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0
error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0
error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0
wait on a error event:
$ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
waiting for error event
error event received
counter value 0
list all errors:
$ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
name config-id
error-gt0-correctable-guc 0x0000000000000001
error-gt0-correctable-slm 0x0000000000000003
error-gt0-correctable-eu-ic 0x0000000000000004
error-gt0-correctable-eu-grf 0x0000000000000005
error-gt0-fatal-guc 0x0000000000000009
error-gt0-fatal-slm 0x000000000000000d
error-gt0-fatal-eu-grf 0x000000000000000f
error-gt0-fatal-fpu 0x0000000000000010
error-gt0-fatal-tlb 0x0000000000000011
error-gt0-fatal-l3-fabric 0x0000000000000012
error-gt0-correctable-subslice 0x0000000000000013
error-gt0-correctable-l3bank 0x0000000000000014
error-gt0-fatal-subslice 0x0000000000000015
error-gt0-fatal-l3bank 0x0000000000000016
error-gt0-sgunit-correctable 0x0000000000000017
error-gt0-sgunit-nonfatal 0x0000000000000018
error-gt0-sgunit-fatal 0x0000000000000019
error-gt0-soc-fatal-psf-csc-0 0x000000000000001a
error-gt0-soc-fatal-psf-csc-1 0x000000000000001b
error-gt0-soc-fatal-psf-csc-2 0x000000000000001c
error-gt0-soc-fatal-punit 0x000000000000001d
error-gt0-soc-fatal-psf-0 0x000000000000001e
error-gt0-soc-fatal-psf-1 0x000000000000001f
error-gt0-soc-fatal-psf-2 0x0000000000000020
error-gt0-soc-fatal-cd0 0x0000000000000021
error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022
error-gt0-soc-fatal-mdfi-east 0x0000000000000023
error-gt0-soc-fatal-mdfi-south 0x0000000000000024
error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025
error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026
error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027
error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028
error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029
error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a
error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b
error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c
error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d
error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e
error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f
error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030
error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031
error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032
error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033
error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034
error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035
error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036
error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037
error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038
error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039
error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a
error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b
error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c
error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d
error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e
error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f
error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040
error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041
error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042
error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043
error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044
error-gt0-gsc-correctable-sram-ecc 0x0000000000000045
error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046
error-gt0-gsc-nonfatal-mia-int 0x0000000000000047
error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048
error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049
error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a
error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b
error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c
error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d
error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e
error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f
error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050
error-gt1-correctable-guc 0x1000000000000001
error-gt1-correctable-slm 0x1000000000000003
error-gt1-correctable-eu-ic 0x1000000000000004
error-gt1-correctable-eu-grf 0x1000000000000005
error-gt1-fatal-guc 0x1000000000000009
error-gt1-fatal-slm 0x100000000000000d
error-gt1-fatal-eu-grf 0x100000000000000f
error-gt1-fatal-fpu 0x1000000000000010
error-gt1-fatal-tlb 0x1000000000000011
error-gt1-fatal-l3-fabric 0x1000000000000012
error-gt1-correctable-subslice 0x1000000000000013
error-gt1-correctable-l3bank 0x1000000000000014
error-gt1-fatal-subslice 0x1000000000000015
error-gt1-fatal-l3bank 0x1000000000000016
error-gt1-sgunit-correctable 0x1000000000000017
error-gt1-sgunit-nonfatal 0x1000000000000018
error-gt1-sgunit-fatal 0x1000000000000019
error-gt1-soc-fatal-psf-csc-0 0x100000000000001a
error-gt1-soc-fatal-psf-csc-1 0x100000000000001b
error-gt1-soc-fatal-psf-csc-2 0x100000000000001c
error-gt1-soc-fatal-punit 0x100000000000001d
error-gt1-soc-fatal-psf-0 0x100000000000001e
error-gt1-soc-fatal-psf-1 0x100000000000001f
error-gt1-soc-fatal-psf-2 0x1000000000000020
error-gt1-soc-fatal-cd0 0x1000000000000021
error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022
error-gt1-soc-fatal-mdfi-east 0x1000000000000023
error-gt1-soc-fatal-mdfi-south 0x1000000000000024
error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025
error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026
error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027
error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028
error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029
error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a
error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b
error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c
error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d
error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e
error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f
error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030
error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031
error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032
error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033
error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034
error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035
error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036
error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037
error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038
error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039
error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a
error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b
error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c
error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d
error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e
error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f
error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040
error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041
error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042
error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043
error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: David Airlie <airlied@gmail.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Cc: Anshuman Gupta <anshuman.gupta@intel.com>
Aravind Iddamsetty (1):
tools/RAS: A tool to read error counters
include/drm-uapi/drm_netlink.h | 105 ++++++++
meson.build | 4 +
tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++
tools/meson.build | 5 +
4 files changed, 542 insertions(+)
create mode 100644 include/drm-uapi/drm_netlink.h
create mode 100644 tools/drm_ras.c
--
2.25.1
^ permalink raw reply [flat|nested] 7+ messages in thread* [RFC i-g-t v3 1/1] tools/RAS: A tool to read error counters 2025-07-30 6:13 [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters Aravind Iddamsetty @ 2025-07-30 6:13 ` Aravind Iddamsetty 2025-08-13 12:46 ` Kamil Konieczny 2025-08-15 22:13 ` Rodrigo Vivi 2025-07-30 20:05 ` [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS " Rodrigo Vivi 2025-08-13 12:42 ` Kamil Konieczny 2 siblings, 2 replies; 7+ messages in thread From: Aravind Iddamsetty @ 2025-07-30 6:13 UTC (permalink / raw) To: igt-dev Cc: Alex Deucher, Simona Vetter, David Airlie, Joonas Lahtinen, Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Riana Tauro, Anshuman Gupta This tool demonstrates the use of netlink sockets to query and read the error counters on a hardware. It provides following commands LIST_ERRORS, READ_ONE, READ_ALL to read counters and WAIT_ON_EVENT to wait for occurrence on a particular event, presently hardcoded to wait on occurrence of correctable error event and read a error counter. v2: update uapi header. v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP Block. Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> --- include/drm-uapi/drm_netlink.h | 105 ++++++++ meson.build | 4 + tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++ tools/meson.build | 5 + 4 files changed, 542 insertions(+) create mode 100644 include/drm-uapi/drm_netlink.h create mode 100644 tools/drm_ras.c diff --git a/include/drm-uapi/drm_netlink.h b/include/drm-uapi/drm_netlink.h new file mode 100644 index 000000000..c978efaab --- /dev/null +++ b/include/drm-uapi/drm_netlink.h @@ -0,0 +1,105 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright 2023 Intel Corporation + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice (including the next + * paragraph) shall be included in all copies or substantial portions of the + * Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + */ + +#ifndef _DRM_NETLINK_H_ +#define _DRM_NETLINK_H_ + +#define DRM_GENL_VERSION 1 +#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR "drm_corr_err" +#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR "drm_uncorr_err" + +#if defined(__cplusplus) +extern "C" { +#endif + +/** + * enum drm_genl_error_cmds - Supported error commands + * + */ +enum drm_genl_error_cmds { + DRM_CMD_UNSPEC, + /** + * @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id in verbose mode. + * In normal mode will list IP blocks, total instances available and error types supported + */ + DRM_RAS_CMD_QUERY, + /** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */ + DRM_RAS_CMD_READ_ONE, + /** @DRM_RAS_CMD_READ_BLOCK: Command to get a counter of specific error type from an IP + * block + */ + DRM_RAS_CMD_READ_BLOCK, + /** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */ + DRM_RAS_CMD_READ_ALL, + /** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of multicast event */ + DRM_RAS_CMD_ERROR_EVENT, + + __DRM_CMD_MAX, + DRM_CMD_MAX = __DRM_CMD_MAX - 1, +}; + +enum drm_cmd_request_type { + DRM_RAS_CMD_QUERY_VERBOSE = 1, + DRM_RAS_CMD_QUERY_NORMAL = 2, +}; + +/** + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds + * + */ +enum drm_error_attr { + DRM_ATTR_UNSPEC, + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, + /** + * @DRM_RAS_ATTR_QUERY: Should be used with DRM_RAS_CMD_QUERY, + * DRM_RAS_CMD_READ_ALL + */ + DRM_RAS_ATTR_QUERY, /* NLA_U8 */ + /** + * @DRM_RAS_ATTR_READ_ALL: Should be used with DRM_RAS_CMD_READ_ALL + */ + DRM_RAS_ATTR_READ_ALL, /* NLA_U8 */ + /** + * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a + * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands. + */ + DRM_RAS_ATTR_QUERY_REPLY, /* NLA_NESTED */ + /** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */ + DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ + /** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id, should be used with + * DRM_RAS_CMD_READ_ONE, DRM_RAS_CMD_READ_BLOCK + */ + DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */ + /** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */ + DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */ + + __DRM_ATTR_MAX, + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, +}; + +#if defined(__cplusplus) +} +#endif + +#endif diff --git a/meson.build b/meson.build index 4efad72cf..c3e5c95d5 100644 --- a/meson.build +++ b/meson.build @@ -165,6 +165,10 @@ cairo = dependency('cairo', version : '>1.12.0', required : true) libudev = dependency('libudev', required : true) glib = dependency('glib-2.0', required : true) +libnl = dependency('libnl-3.0', required: false) +libnl_genl = dependency('libnl-genl-3.0', required: false) +libnl_cli = dependency('libnl-cli-3.0', required:false) + xmlrpc = dependency('xmlrpc', required : false) xmlrpc_util = dependency('xmlrpc_util', required : false) xmlrpc_client = dependency('xmlrpc_client', required : false) diff --git a/tools/drm_ras.c b/tools/drm_ras.c new file mode 100644 index 000000000..68946ff6a --- /dev/null +++ b/tools/drm_ras.c @@ -0,0 +1,428 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright © 2021 Intel Corporation + */ + +#include <stdio.h> +#include <sys/types.h> +#include <unistd.h> +#include <getopt.h> +#include <linux/genetlink.h> +#include <netlink/cli/utils.h> + +#include "drm_netlink.h" +#include "igt_device_scan.h" + +#define ARRAY_SIZE(array) (sizeof(array) / sizeof((array)[0])) + +struct nl_sock *sock, *mcsock; +int family_id; + +enum opt_val { + OPT_UNKNOWN = '?', + OPT_END = -1, + OPT_DEVICE, + OPT_CONFIG, + OPT_VERBOSE, + OPT_HELP, +}; + +enum cmd_ids { + INVALID_CMD = -1, + LIST_ERRORS = 0, + READ_ONE, + READ_BLOCK, + READ_ALL, + WAIT_ON_EVENT, + + __MAX_CMDS, +}; + +static const char * const cmd_names[] = { + "LIST_ERRORS", + "READ_ONE", + "READ_BLOCK", + "READ_ALL", + "WAIT_ON_EVENT", +}; + +static void help(char **argv) +{ + int i; + + printf("Usage: %s command [<command options>]\n", argv[0]); + printf("commands:\n"); + + for (i = 0; i < __MAX_CMDS; i++) { + switch (i) { + case LIST_ERRORS: + printf("%s %s --device=<device filter> --verbose [default normal]\n", argv[0], cmd_names[i]); + break; + case READ_ALL: + case WAIT_ON_EVENT: + printf("%s %s --device=<device filter>\n", argv[0], cmd_names[i]); + break; + case READ_ONE: + case READ_BLOCK: + printf("%s %s --device=<device filter> --error_id=<id returned from query>\n", argv[0], cmd_names[i]); + break; + } + } + + igt_device_print_filter_types(); +} + +static int list_errors(struct nl_cache_ops *ops, struct genl_cmd *cmd, + struct genl_info *info, void *arg) +{ + const struct nlmsghdr *nlh = info->nlh; + struct nlattr *nla; + int len, remain; + + len = GENL_HDRLEN; + + nlmsg_for_each_attr(nla, nlh, len, remain) { + if ((nla_type(nla) == DRM_RAS_ATTR_QUERY_REPLY) && nla_is_nested(nla)) { + struct nlattr *cur; + int rem; + + if (cmd->c_id == DRM_RAS_CMD_READ_ALL) + printf("%-50s\t%-18s\t%s\n", "name", "config-id", "counter"); + else + printf("%-50s\t%-18s\n", "name", "config-id"); + + nla_for_each_nested(cur, nla, rem) { + switch (nla_type(cur)) { + case DRM_RAS_ATTR_ERROR_NAME: + printf("\n%-50s", nla_get_string(cur)); + break; + case DRM_RAS_ATTR_ERROR_ID: + printf("\t0x%016lx", nla_get_u64(cur)); + break; + case DRM_RAS_ATTR_ERROR_VALUE: + printf("\t%lu", nla_get_u64(cur)); + break; + default: + break; + } + } + printf("\n"); + } + } + + return NL_OK; +} + +static int read_single(struct nl_cache_ops *ops, struct genl_cmd *cmd, + struct genl_info *info, void *arg) +{ + if (!info->attrs[DRM_RAS_ATTR_ERROR_VALUE]) + nl_cli_fatal(NLE_FAILURE, "DRM_RAS_ATTR_ERROR_VALUE attribute is missing"); + + printf("counter value %lu\n", nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_VALUE])); + + return NL_OK; +} + +static int mcast_event_handler(struct nl_cache_ops *ops, struct genl_cmd *cmd, + struct genl_info *info, void *arg) +{ + struct nl_msg *msg; + uint64_t config = 0x0000000000000005; /* error-gt0-correctable-eu-grf */ + void *msg_head; + int ret; + + printf("error event received\n"); + + msg = nlmsg_alloc(); + if (!msg) + nl_cli_fatal(NLE_INVAL, "nlmsg_alloc failed\n"); + + msg_head = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, family_id, 0, 0, + DRM_RAS_CMD_READ_ONE, 1); + if (!msg_head) + nl_cli_fatal(ENOMEM, "genlmsg_put failed\n"); + + nla_put_u64(msg, DRM_RAS_ATTR_ERROR_ID, config); + + ret = nl_send_auto(sock, msg); + if (ret < 0) + nl_cli_fatal(ret, "Unable to send message: %s", nl_geterror(ret)); + + ret = nl_recvmsgs_default(sock); + if (ret < 0) + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); + + nlmsg_free(msg); + + return NL_OK; +} + +static struct nla_policy drm_genl_policy[DRM_ATTR_MAX + 1] = { + [DRM_RAS_ATTR_QUERY] = { .type = NLA_U8 }, + [DRM_RAS_ATTR_READ_ALL] = { .type = NLA_U8 }, + [DRM_RAS_ATTR_QUERY_REPLY] = { .type = NLA_NESTED }, + [DRM_RAS_ATTR_ERROR_NAME] = { .type = NLA_NUL_STRING }, + [DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 }, + [DRM_RAS_ATTR_ERROR_VALUE] = { .type = NLA_U64 }, +}; + +static struct genl_cmd drm_genl_cmds[] = { + { + .c_id = DRM_RAS_CMD_QUERY, + .c_name = "QUERY", + .c_maxattr = DRM_ATTR_MAX, + .c_attr_policy = drm_genl_policy, + .c_msg_parser = list_errors, + }, + { + .c_id = DRM_RAS_CMD_READ_ONE, + .c_name = "READ_1", + .c_maxattr = DRM_ATTR_MAX, + .c_attr_policy = drm_genl_policy, + .c_msg_parser = read_single, + }, + { + .c_id = DRM_RAS_CMD_READ_BLOCK, + .c_name = "READ_BLOCK", + .c_maxattr = DRM_ATTR_MAX, + .c_attr_policy = drm_genl_policy, + .c_msg_parser = read_single, + }, + { + .c_id = DRM_RAS_CMD_READ_ALL, + .c_name = "READ_ALL", + .c_maxattr = DRM_ATTR_MAX, + .c_attr_policy = drm_genl_policy, + .c_msg_parser = list_errors, + }, + { + .c_id = DRM_RAS_CMD_ERROR_EVENT, + .c_name = "ERROR_EVENT", + .c_maxattr = DRM_ATTR_MAX, + .c_attr_policy = drm_genl_policy, + .c_msg_parser = mcast_event_handler, + }, +}; + +static struct genl_ops drm_genl_ops = { + .o_hdrsize = 0, + .o_cmds = drm_genl_cmds, + .o_ncmds = ARRAY_SIZE(drm_genl_cmds), +}; + +static void send_cmd(int cmd, uint64_t config) +{ + struct nl_msg *msg; + void *msg_head; + int ret; + + msg = nlmsg_alloc(); + if (!msg) + nl_cli_fatal(NLE_INVAL, "nlmsg_alloc failed\n"); + + msg_head = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, family_id, 0, 0, cmd, 1); + if (!msg_head) + nl_cli_fatal(ENOMEM, "genlmsg_put failed\n"); + switch (cmd) { + case DRM_RAS_CMD_QUERY: + nla_put_u8(msg, DRM_RAS_ATTR_QUERY, config ? DRM_RAS_CMD_QUERY_VERBOSE : + DRM_RAS_CMD_QUERY_NORMAL); + break; + case DRM_RAS_CMD_READ_ONE: + case DRM_RAS_CMD_READ_BLOCK: + nla_put_u64(msg, DRM_RAS_ATTR_ERROR_ID, config); + break; + case DRM_RAS_CMD_READ_ALL: + nla_put_u8(msg, DRM_RAS_ATTR_READ_ALL, 1); + break; + default: + break; + } + + ret = nl_send_auto(sock, msg); + if (ret < 0) + nl_cli_fatal(ret, "Unable to send message: %s", nl_geterror(ret)); + + ret = nl_recvmsgs_default(sock); + if (ret < 0) + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); + + nlmsg_free(msg); +} + +static int get_cmd(char *cmd_name) +{ + int i; + + if (!cmd_name) + return -1; + + for (i = 0; i < __MAX_CMDS; i++) { + if (strcasecmp(cmd_name, cmd_names[i]) == 0) + return i; + } + + return -1; +} + +int main(int argc, char **argv) +{ + char *endptr; + enum opt_val val; + enum cmd_ids cmd; + char *device = NULL; + uint64_t error_config_id; + bool verbose = false; + int ret, mcgrp, index; + struct igt_device_card card; + char *dev_name, *dup; + + static struct option options[] = { + {"device", required_argument, NULL, OPT_DEVICE}, + {"error_id", required_argument, NULL, OPT_CONFIG}, + {"verbose", no_argument, NULL, OPT_VERBOSE}, + {"help", no_argument, NULL, OPT_HELP}, + { 0 } + }; + + cmd = get_cmd(argv[1]); + if (cmd < 0) { + fprintf(stderr, "invalid command\n"); + help(argv); + exit(EXIT_FAILURE); + } + + for (val = 0; val != OPT_END; ) { + val = getopt_long(argc, argv, "", options, &index); + + switch (val) { + case OPT_DEVICE: + device = strdup(optarg); + break; + case OPT_CONFIG: + error_config_id = strtoull(optarg, &endptr, 16); + if (*endptr) { + fprintf(stderr, "invalid config id %s\n", optarg); + exit(EXIT_FAILURE); + } + break; + case OPT_VERBOSE: + verbose = true; + break; + case OPT_HELP: + help(argv); + exit(EXIT_FAILURE); + case OPT_END: + break; + case OPT_UNKNOWN: + exit(EXIT_FAILURE); + } + } + + if (!device) { + fprintf(stderr, "missing device option\n"); + help(argv); + exit(EXIT_FAILURE); + } else { + ret = igt_device_card_match_pci(device, &card); + if (!ret) { + fprintf(stderr, "device %s not found!\n", device); + exit(EXIT_FAILURE); + } + free(device); + } + + /* get card name */ + dup = strdup(card.card); + + while (dup) + dev_name = strsep(&dup, "/"); + free(dup); + + drm_genl_ops.o_name = strdup(dev_name); + + sock = nl_cli_alloc_socket(); + if (!sock) + nl_cli_fatal(NLE_NOMEM, "Cannot allocate nl_sock"); + + ret = nl_cli_connect(sock, NETLINK_GENERIC); + if (ret < 0) + nl_cli_fatal(ret, "Cannot connect handle"); + + ret = genl_register_family(&drm_genl_ops); + if (ret < 0) + nl_cli_fatal(ret, "Cannot register xe family"); + + ret = genl_ops_resolve(sock, &drm_genl_ops); + if (ret < 0) + nl_cli_fatal(ret, "Unable to resolve family name"); + + family_id = genl_ctrl_resolve(sock, drm_genl_ops.o_name); + if (family_id < 0) + nl_cli_fatal(NLE_INVAL, "Resolving of \"%s\" failed", drm_genl_ops.o_name); + + ret = nl_socket_modify_cb(sock, NL_CB_VALID, NL_CB_CUSTOM, genl_handle_msg, NULL); + if (ret < 0) + nl_cli_fatal(ret, "Unable to modify valid message callback"); + + switch (cmd) { + case LIST_ERRORS: + send_cmd(DRM_RAS_CMD_QUERY, verbose); + break; + case READ_ONE: + send_cmd(DRM_RAS_CMD_READ_ONE, error_config_id); + break; + case READ_BLOCK: + send_cmd(DRM_RAS_CMD_READ_BLOCK, error_config_id); + break; + case READ_ALL: + send_cmd(DRM_RAS_CMD_READ_ALL, 0); + break; + case WAIT_ON_EVENT: + mcsock = nl_cli_alloc_socket(); + if (!mcsock) + nl_cli_fatal(NLE_NOMEM, "Cannot allocate nl_sock"); + + ret = nl_cli_connect(mcsock, NETLINK_GENERIC); + if (ret < 0) + nl_cli_fatal(ret, "Cannot connect handle"); + + ret = genl_ops_resolve(mcsock, &drm_genl_ops); + if (ret < 0) + nl_cli_fatal(ret, "Unable to resolve family name"); + + nl_socket_disable_seq_check(mcsock); + + mcgrp = genl_ctrl_resolve_grp(mcsock, drm_genl_ops.o_name, + DRM_GENL_MCAST_GROUP_NAME_CORR_ERR); + if (mcgrp < 0) + nl_cli_fatal(mcgrp, "failed to resolve generic netlink multicast group"); + + /* Join the multicast group. */ + ret = nl_socket_add_membership(mcsock, mcgrp); + if (ret < 0) + nl_cli_fatal(ret, "failed to join multicast group"); + + ret = nl_socket_modify_cb(mcsock, NL_CB_VALID, NL_CB_CUSTOM, genl_handle_msg, NULL); + if (ret < 0) + nl_cli_fatal(ret, "Unable to modify valid message callback"); + + printf("waiting for error event\n"); + ret = nl_recvmsgs_default(mcsock); + if (ret < 0) + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); + + nl_close(mcsock); + nl_socket_free(mcsock); + break; + default: + break; + } + + nl_close(sock); + nl_socket_free(sock); + + return 0; +} + diff --git a/tools/meson.build b/tools/meson.build index 99a732942..5195d1f62 100644 --- a/tools/meson.build +++ b/tools/meson.build @@ -115,6 +115,11 @@ if build_vmtb install_subdir('vmtb', install_dir: libexecdir) endif +executable('drm_ras', 'drm_ras.c', + dependencies : [tool_deps, libnl, libnl_cli, libnl_genl], + install_rpath : bindir_rpathdir, + install : true) + subdir('i915-perf') subdir('xe-perf') subdir('null_state_gen') -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC i-g-t v3 1/1] tools/RAS: A tool to read error counters 2025-07-30 6:13 ` [RFC i-g-t v3 1/1] tools/RAS: A tool to read " Aravind Iddamsetty @ 2025-08-13 12:46 ` Kamil Konieczny 2025-08-15 22:13 ` Rodrigo Vivi 1 sibling, 0 replies; 7+ messages in thread From: Kamil Konieczny @ 2025-08-13 12:46 UTC (permalink / raw) To: Aravind Iddamsetty Cc: igt-dev, Alex Deucher, Simona Vetter, David Airlie, Joonas Lahtinen, Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Riana Tauro, Anshuman Gupta Hi Aravind, On 2025-07-30 at 11:43:42 +0530, Aravind Iddamsetty wrote: > This tool demonstrates the use of netlink sockets to query and read the > error counters on a hardware. It provides following commands LIST_ERRORS, > READ_ONE, READ_ALL to read counters and WAIT_ON_EVENT to wait for > occurrence on a particular event, presently hardcoded to wait on > occurrence of correctable error event and read a error counter. > > v2: update uapi header. > v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP Block. > > Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> > --- > include/drm-uapi/drm_netlink.h | 105 ++++++++ > meson.build | 4 + > tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++ > tools/meson.build | 5 + > 4 files changed, 542 insertions(+) > create mode 100644 include/drm-uapi/drm_netlink.h > create mode 100644 tools/drm_ras.c > > diff --git a/include/drm-uapi/drm_netlink.h b/include/drm-uapi/drm_netlink.h > new file mode 100644 > index 000000000..c978efaab > --- /dev/null > +++ b/include/drm-uapi/drm_netlink.h > @@ -0,0 +1,105 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright 2023 Intel Corporation Should be 2025 unless is published elsewhere, then 2023-2025 > + * > + * Permission is hereby granted, free of charge, to any person obtaining a Remove this, it was replaces by SPDX. > + * copy of this software and associated documentation files (the "Software"), > + * to deal in the Software without restriction, including without limitation > + * the rights to use, copy, modify, merge, publish, distribute, sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice (including the next > + * paragraph) shall be included in all copies or substantial portions of the > + * Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL > + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR > + * OTHER DEALINGS IN THE SOFTWARE. > + */ > + > +#ifndef _DRM_NETLINK_H_ > +#define _DRM_NETLINK_H_ > + > +#define DRM_GENL_VERSION 1 > +#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR "drm_corr_err" > +#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR "drm_uncorr_err" > + > +#if defined(__cplusplus) > +extern "C" { > +#endif > + > +/** > + * enum drm_genl_error_cmds - Supported error commands > + * > + */ > +enum drm_genl_error_cmds { > + DRM_CMD_UNSPEC, > + /** > + * @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id in verbose mode. > + * In normal mode will list IP blocks, total instances available and error types supported > + */ > + DRM_RAS_CMD_QUERY, > + /** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */ > + DRM_RAS_CMD_READ_ONE, > + /** @DRM_RAS_CMD_READ_BLOCK: Command to get a counter of specific error type from an IP > + * block > + */ > + DRM_RAS_CMD_READ_BLOCK, > + /** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */ > + DRM_RAS_CMD_READ_ALL, > + /** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of multicast event */ > + DRM_RAS_CMD_ERROR_EVENT, > + > + __DRM_CMD_MAX, > + DRM_CMD_MAX = __DRM_CMD_MAX - 1, > +}; > + > +enum drm_cmd_request_type { > + DRM_RAS_CMD_QUERY_VERBOSE = 1, > + DRM_RAS_CMD_QUERY_NORMAL = 2, > +}; > + > +/** > + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds > + * > + */ > +enum drm_error_attr { > + DRM_ATTR_UNSPEC, > + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, > + /** > + * @DRM_RAS_ATTR_QUERY: Should be used with DRM_RAS_CMD_QUERY, > + * DRM_RAS_CMD_READ_ALL > + */ > + DRM_RAS_ATTR_QUERY, /* NLA_U8 */ > + /** > + * @DRM_RAS_ATTR_READ_ALL: Should be used with DRM_RAS_CMD_READ_ALL > + */ > + DRM_RAS_ATTR_READ_ALL, /* NLA_U8 */ > + /** > + * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a > + * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands. > + */ > + DRM_RAS_ATTR_QUERY_REPLY, /* NLA_NESTED */ > + /** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */ > + DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ > + /** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id, should be used with > + * DRM_RAS_CMD_READ_ONE, DRM_RAS_CMD_READ_BLOCK > + */ > + DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */ > + /** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */ > + DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */ > + > + __DRM_ATTR_MAX, > + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, > +}; > + > +#if defined(__cplusplus) > +} > +#endif > + > +#endif > diff --git a/meson.build b/meson.build > index 4efad72cf..c3e5c95d5 100644 > --- a/meson.build > +++ b/meson.build > @@ -165,6 +165,10 @@ cairo = dependency('cairo', version : '>1.12.0', required : true) > libudev = dependency('libudev', required : true) > glib = dependency('glib-2.0', required : true) > > +libnl = dependency('libnl-3.0', required: false) > +libnl_genl = dependency('libnl-genl-3.0', required: false) > +libnl_cli = dependency('libnl-cli-3.0', required:false) > + > xmlrpc = dependency('xmlrpc', required : false) > xmlrpc_util = dependency('xmlrpc_util', required : false) > xmlrpc_client = dependency('xmlrpc_client', required : false) > diff --git a/tools/drm_ras.c b/tools/drm_ras.c > new file mode 100644 > index 000000000..68946ff6a > --- /dev/null > +++ b/tools/drm_ras.c > @@ -0,0 +1,428 @@ > +// SPDX-License-Identifier: MIT > +/* > + * Copyright © 2021 Intel Corporation Year 2025 > + */ > + > +#include <stdio.h> > +#include <sys/types.h> > +#include <unistd.h> > +#include <getopt.h> > +#include <linux/genetlink.h> > +#include <netlink/cli/utils.h> Sort system headers alphabetically, with one exception for unistd.h which could come as first. > + > +#include "drm_netlink.h" > +#include "igt_device_scan.h" > + > +#define ARRAY_SIZE(array) (sizeof(array) / sizeof((array)[0])) > + > +struct nl_sock *sock, *mcsock; > +int family_id; > + > +enum opt_val { > + OPT_UNKNOWN = '?', > + OPT_END = -1, > + OPT_DEVICE, > + OPT_CONFIG, > + OPT_VERBOSE, > + OPT_HELP, > +}; > + > +enum cmd_ids { > + INVALID_CMD = -1, > + LIST_ERRORS = 0, > + READ_ONE, > + READ_BLOCK, > + READ_ALL, > + WAIT_ON_EVENT, > + > + __MAX_CMDS, > +}; > + > +static const char * const cmd_names[] = { > + "LIST_ERRORS", > + "READ_ONE", > + "READ_BLOCK", > + "READ_ALL", > + "WAIT_ON_EVENT", > +}; > + > +static void help(char **argv) > +{ > + int i; > + > + printf("Usage: %s command [<command options>]\n", argv[0]); > + printf("commands:\n"); > + > + for (i = 0; i < __MAX_CMDS; i++) { > + switch (i) { > + case LIST_ERRORS: > + printf("%s %s --device=<device filter> --verbose [default normal]\n", argv[0], cmd_names[i]); > + break; > + case READ_ALL: > + case WAIT_ON_EVENT: > + printf("%s %s --device=<device filter>\n", argv[0], cmd_names[i]); > + break; > + case READ_ONE: > + case READ_BLOCK: > + printf("%s %s --device=<device filter> --error_id=<id returned from query>\n", argv[0], cmd_names[i]); > + break; > + } > + } > + > + igt_device_print_filter_types(); > +} > + > +static int list_errors(struct nl_cache_ops *ops, struct genl_cmd *cmd, > + struct genl_info *info, void *arg) > +{ > + const struct nlmsghdr *nlh = info->nlh; > + struct nlattr *nla; > + int len, remain; > + > + len = GENL_HDRLEN; > + > + nlmsg_for_each_attr(nla, nlh, len, remain) { > + if ((nla_type(nla) == DRM_RAS_ATTR_QUERY_REPLY) && nla_is_nested(nla)) { > + struct nlattr *cur; > + int rem; > + > + if (cmd->c_id == DRM_RAS_CMD_READ_ALL) > + printf("%-50s\t%-18s\t%s\n", "name", "config-id", "counter"); > + else > + printf("%-50s\t%-18s\n", "name", "config-id"); > + > + nla_for_each_nested(cur, nla, rem) { > + switch (nla_type(cur)) { > + case DRM_RAS_ATTR_ERROR_NAME: > + printf("\n%-50s", nla_get_string(cur)); > + break; > + case DRM_RAS_ATTR_ERROR_ID: > + printf("\t0x%016lx", nla_get_u64(cur)); > + break; > + case DRM_RAS_ATTR_ERROR_VALUE: > + printf("\t%lu", nla_get_u64(cur)); > + break; > + default: > + break; > + } > + } > + printf("\n"); > + } > + } > + > + return NL_OK; > +} > + > +static int read_single(struct nl_cache_ops *ops, struct genl_cmd *cmd, > + struct genl_info *info, void *arg) > +{ > + if (!info->attrs[DRM_RAS_ATTR_ERROR_VALUE]) > + nl_cli_fatal(NLE_FAILURE, "DRM_RAS_ATTR_ERROR_VALUE attribute is missing"); > + > + printf("counter value %lu\n", nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_VALUE])); > + > + return NL_OK; > +} > + > +static int mcast_event_handler(struct nl_cache_ops *ops, struct genl_cmd *cmd, > + struct genl_info *info, void *arg) > +{ > + struct nl_msg *msg; > + uint64_t config = 0x0000000000000005; /* error-gt0-correctable-eu-grf */ > + void *msg_head; > + int ret; > + > + printf("error event received\n"); > + > + msg = nlmsg_alloc(); > + if (!msg) > + nl_cli_fatal(NLE_INVAL, "nlmsg_alloc failed\n"); > + > + msg_head = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, family_id, 0, 0, > + DRM_RAS_CMD_READ_ONE, 1); > + if (!msg_head) > + nl_cli_fatal(ENOMEM, "genlmsg_put failed\n"); > + > + nla_put_u64(msg, DRM_RAS_ATTR_ERROR_ID, config); > + > + ret = nl_send_auto(sock, msg); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to send message: %s", nl_geterror(ret)); > + > + ret = nl_recvmsgs_default(sock); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); > + > + nlmsg_free(msg); > + > + return NL_OK; > +} > + > +static struct nla_policy drm_genl_policy[DRM_ATTR_MAX + 1] = { > + [DRM_RAS_ATTR_QUERY] = { .type = NLA_U8 }, > + [DRM_RAS_ATTR_READ_ALL] = { .type = NLA_U8 }, > + [DRM_RAS_ATTR_QUERY_REPLY] = { .type = NLA_NESTED }, > + [DRM_RAS_ATTR_ERROR_NAME] = { .type = NLA_NUL_STRING }, > + [DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 }, > + [DRM_RAS_ATTR_ERROR_VALUE] = { .type = NLA_U64 }, > +}; > + > +static struct genl_cmd drm_genl_cmds[] = { > + { > + .c_id = DRM_RAS_CMD_QUERY, > + .c_name = "QUERY", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = list_errors, > + }, > + { > + .c_id = DRM_RAS_CMD_READ_ONE, > + .c_name = "READ_1", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = read_single, > + }, > + { > + .c_id = DRM_RAS_CMD_READ_BLOCK, > + .c_name = "READ_BLOCK", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = read_single, > + }, > + { > + .c_id = DRM_RAS_CMD_READ_ALL, > + .c_name = "READ_ALL", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = list_errors, > + }, > + { > + .c_id = DRM_RAS_CMD_ERROR_EVENT, > + .c_name = "ERROR_EVENT", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = mcast_event_handler, > + }, > +}; > + > +static struct genl_ops drm_genl_ops = { > + .o_hdrsize = 0, > + .o_cmds = drm_genl_cmds, > + .o_ncmds = ARRAY_SIZE(drm_genl_cmds), > +}; > + > +static void send_cmd(int cmd, uint64_t config) > +{ > + struct nl_msg *msg; > + void *msg_head; > + int ret; > + > + msg = nlmsg_alloc(); > + if (!msg) > + nl_cli_fatal(NLE_INVAL, "nlmsg_alloc failed\n"); > + > + msg_head = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, family_id, 0, 0, cmd, 1); > + if (!msg_head) > + nl_cli_fatal(ENOMEM, "genlmsg_put failed\n"); > + switch (cmd) { > + case DRM_RAS_CMD_QUERY: > + nla_put_u8(msg, DRM_RAS_ATTR_QUERY, config ? DRM_RAS_CMD_QUERY_VERBOSE : > + DRM_RAS_CMD_QUERY_NORMAL); > + break; > + case DRM_RAS_CMD_READ_ONE: > + case DRM_RAS_CMD_READ_BLOCK: > + nla_put_u64(msg, DRM_RAS_ATTR_ERROR_ID, config); > + break; > + case DRM_RAS_CMD_READ_ALL: > + nla_put_u8(msg, DRM_RAS_ATTR_READ_ALL, 1); > + break; > + default: > + break; > + } > + > + ret = nl_send_auto(sock, msg); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to send message: %s", nl_geterror(ret)); > + > + ret = nl_recvmsgs_default(sock); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); > + > + nlmsg_free(msg); > +} > + > +static int get_cmd(char *cmd_name) > +{ > + int i; > + > + if (!cmd_name) > + return -1; > + > + for (i = 0; i < __MAX_CMDS; i++) { > + if (strcasecmp(cmd_name, cmd_names[i]) == 0) > + return i; > + } > + > + return -1; > +} > + > +int main(int argc, char **argv) > +{ > + char *endptr; > + enum opt_val val; > + enum cmd_ids cmd; > + char *device = NULL; > + uint64_t error_config_id; > + bool verbose = false; > + int ret, mcgrp, index; > + struct igt_device_card card; > + char *dev_name, *dup; > + > + static struct option options[] = { > + {"device", required_argument, NULL, OPT_DEVICE}, > + {"error_id", required_argument, NULL, OPT_CONFIG}, > + {"verbose", no_argument, NULL, OPT_VERBOSE}, > + {"help", no_argument, NULL, OPT_HELP}, > + { 0 } > + }; > + > + cmd = get_cmd(argv[1]); > + if (cmd < 0) { > + fprintf(stderr, "invalid command\n"); > + help(argv); > + exit(EXIT_FAILURE); > + } > + > + for (val = 0; val != OPT_END; ) { > + val = getopt_long(argc, argv, "", options, &index); > + > + switch (val) { > + case OPT_DEVICE: > + device = strdup(optarg); > + break; > + case OPT_CONFIG: > + error_config_id = strtoull(optarg, &endptr, 16); > + if (*endptr) { > + fprintf(stderr, "invalid config id %s\n", optarg); > + exit(EXIT_FAILURE); > + } > + break; > + case OPT_VERBOSE: > + verbose = true; > + break; > + case OPT_HELP: > + help(argv); > + exit(EXIT_FAILURE); > + case OPT_END: > + break; > + case OPT_UNKNOWN: > + exit(EXIT_FAILURE); > + } > + } > + > + if (!device) { > + fprintf(stderr, "missing device option\n"); > + help(argv); > + exit(EXIT_FAILURE); > + } else { > + ret = igt_device_card_match_pci(device, &card); > + if (!ret) { > + fprintf(stderr, "device %s not found!\n", device); > + exit(EXIT_FAILURE); > + } > + free(device); > + } > + > + /* get card name */ > + dup = strdup(card.card); > + > + while (dup) > + dev_name = strsep(&dup, "/"); > + free(dup); > + > + drm_genl_ops.o_name = strdup(dev_name); > + > + sock = nl_cli_alloc_socket(); > + if (!sock) > + nl_cli_fatal(NLE_NOMEM, "Cannot allocate nl_sock"); > + > + ret = nl_cli_connect(sock, NETLINK_GENERIC); > + if (ret < 0) > + nl_cli_fatal(ret, "Cannot connect handle"); > + > + ret = genl_register_family(&drm_genl_ops); > + if (ret < 0) > + nl_cli_fatal(ret, "Cannot register xe family"); > + > + ret = genl_ops_resolve(sock, &drm_genl_ops); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to resolve family name"); > + > + family_id = genl_ctrl_resolve(sock, drm_genl_ops.o_name); > + if (family_id < 0) > + nl_cli_fatal(NLE_INVAL, "Resolving of \"%s\" failed", drm_genl_ops.o_name); > + > + ret = nl_socket_modify_cb(sock, NL_CB_VALID, NL_CB_CUSTOM, genl_handle_msg, NULL); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to modify valid message callback"); > + > + switch (cmd) { > + case LIST_ERRORS: > + send_cmd(DRM_RAS_CMD_QUERY, verbose); > + break; > + case READ_ONE: > + send_cmd(DRM_RAS_CMD_READ_ONE, error_config_id); > + break; > + case READ_BLOCK: > + send_cmd(DRM_RAS_CMD_READ_BLOCK, error_config_id); > + break; > + case READ_ALL: > + send_cmd(DRM_RAS_CMD_READ_ALL, 0); > + break; > + case WAIT_ON_EVENT: > + mcsock = nl_cli_alloc_socket(); > + if (!mcsock) > + nl_cli_fatal(NLE_NOMEM, "Cannot allocate nl_sock"); > + > + ret = nl_cli_connect(mcsock, NETLINK_GENERIC); > + if (ret < 0) > + nl_cli_fatal(ret, "Cannot connect handle"); > + > + ret = genl_ops_resolve(mcsock, &drm_genl_ops); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to resolve family name"); > + > + nl_socket_disable_seq_check(mcsock); > + > + mcgrp = genl_ctrl_resolve_grp(mcsock, drm_genl_ops.o_name, > + DRM_GENL_MCAST_GROUP_NAME_CORR_ERR); > + if (mcgrp < 0) > + nl_cli_fatal(mcgrp, "failed to resolve generic netlink multicast group"); > + > + /* Join the multicast group. */ > + ret = nl_socket_add_membership(mcsock, mcgrp); > + if (ret < 0) > + nl_cli_fatal(ret, "failed to join multicast group"); > + > + ret = nl_socket_modify_cb(mcsock, NL_CB_VALID, NL_CB_CUSTOM, genl_handle_msg, NULL); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to modify valid message callback"); > + > + printf("waiting for error event\n"); > + ret = nl_recvmsgs_default(mcsock); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); > + > + nl_close(mcsock); > + nl_socket_free(mcsock); > + break; > + default: > + break; > + } > + > + nl_close(sock); > + nl_socket_free(sock); > + > + return 0; > +} > + > diff --git a/tools/meson.build b/tools/meson.build > index 99a732942..5195d1f62 100644 > --- a/tools/meson.build > +++ b/tools/meson.build > @@ -115,6 +115,11 @@ if build_vmtb > install_subdir('vmtb', install_dir: libexecdir) > endif > > +executable('drm_ras', 'drm_ras.c', > + dependencies : [tool_deps, libnl, libnl_cli, libnl_genl], > + install_rpath : bindir_rpathdir, > + install : true) > + Build only if dependences are present. Regards, Kamil > subdir('i915-perf') > subdir('xe-perf') > subdir('null_state_gen') > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC i-g-t v3 1/1] tools/RAS: A tool to read error counters 2025-07-30 6:13 ` [RFC i-g-t v3 1/1] tools/RAS: A tool to read " Aravind Iddamsetty 2025-08-13 12:46 ` Kamil Konieczny @ 2025-08-15 22:13 ` Rodrigo Vivi 1 sibling, 0 replies; 7+ messages in thread From: Rodrigo Vivi @ 2025-08-15 22:13 UTC (permalink / raw) To: Aravind Iddamsetty Cc: igt-dev, Alex Deucher, Simona Vetter, David Airlie, Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Riana Tauro, Anshuman Gupta On Wed, Jul 30, 2025 at 11:43:42AM +0530, Aravind Iddamsetty wrote: > This tool demonstrates the use of netlink sockets to query and read the > error counters on a hardware. It provides following commands LIST_ERRORS, > READ_ONE, READ_ALL to read counters and WAIT_ON_EVENT to wait for > occurrence on a particular event, presently hardcoded to wait on > occurrence of correctable error event and read a error counter. > > v2: update uapi header. > v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP Block. > > Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> > --- > include/drm-uapi/drm_netlink.h | 105 ++++++++ > meson.build | 4 + > tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++ > tools/meson.build | 5 + > 4 files changed, 542 insertions(+) > create mode 100644 include/drm-uapi/drm_netlink.h > create mode 100644 tools/drm_ras.c > > diff --git a/include/drm-uapi/drm_netlink.h b/include/drm-uapi/drm_netlink.h > new file mode 100644 > index 000000000..c978efaab > --- /dev/null > +++ b/include/drm-uapi/drm_netlink.h > @@ -0,0 +1,105 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright 2023 Intel Corporation > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > + * copy of this software and associated documentation files (the "Software"), > + * to deal in the Software without restriction, including without limitation > + * the rights to use, copy, modify, merge, publish, distribute, sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice (including the next > + * paragraph) shall be included in all copies or substantial portions of the > + * Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL > + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR > + * OTHER DEALINGS IN THE SOFTWARE. > + */ > + > +#ifndef _DRM_NETLINK_H_ > +#define _DRM_NETLINK_H_ > + > +#define DRM_GENL_VERSION 1 > +#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR "drm_corr_err" > +#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR "drm_uncorr_err" > + > +#if defined(__cplusplus) > +extern "C" { > +#endif > + > +/** > + * enum drm_genl_error_cmds - Supported error commands > + * > + */ > +enum drm_genl_error_cmds { > + DRM_CMD_UNSPEC, > + /** > + * @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id in verbose mode. > + * In normal mode will list IP blocks, total instances available and error types supported > + */ > + DRM_RAS_CMD_QUERY, > + /** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */ > + DRM_RAS_CMD_READ_ONE, > + /** @DRM_RAS_CMD_READ_BLOCK: Command to get a counter of specific error type from an IP > + * block > + */ > + DRM_RAS_CMD_READ_BLOCK, > + /** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */ > + DRM_RAS_CMD_READ_ALL, > + /** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of multicast event */ > + DRM_RAS_CMD_ERROR_EVENT, > + > + __DRM_CMD_MAX, > + DRM_CMD_MAX = __DRM_CMD_MAX - 1, > +}; > + > +enum drm_cmd_request_type { > + DRM_RAS_CMD_QUERY_VERBOSE = 1, > + DRM_RAS_CMD_QUERY_NORMAL = 2, > +}; > + > +/** > + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds > + * > + */ > +enum drm_error_attr { > + DRM_ATTR_UNSPEC, > + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, > + /** > + * @DRM_RAS_ATTR_QUERY: Should be used with DRM_RAS_CMD_QUERY, > + * DRM_RAS_CMD_READ_ALL > + */ > + DRM_RAS_ATTR_QUERY, /* NLA_U8 */ > + /** > + * @DRM_RAS_ATTR_READ_ALL: Should be used with DRM_RAS_CMD_READ_ALL > + */ > + DRM_RAS_ATTR_READ_ALL, /* NLA_U8 */ > + /** > + * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a > + * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands. > + */ > + DRM_RAS_ATTR_QUERY_REPLY, /* NLA_NESTED */ > + /** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */ > + DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ > + /** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id, should be used with > + * DRM_RAS_CMD_READ_ONE, DRM_RAS_CMD_READ_BLOCK > + */ > + DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */ > + /** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */ > + DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */ > + > + __DRM_ATTR_MAX, > + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, > +}; > + > +#if defined(__cplusplus) > +} > +#endif > + > +#endif > diff --git a/meson.build b/meson.build > index 4efad72cf..c3e5c95d5 100644 > --- a/meson.build > +++ b/meson.build > @@ -165,6 +165,10 @@ cairo = dependency('cairo', version : '>1.12.0', required : true) > libudev = dependency('libudev', required : true) > glib = dependency('glib-2.0', required : true) > > +libnl = dependency('libnl-3.0', required: false) > +libnl_genl = dependency('libnl-genl-3.0', required: false) > +libnl_cli = dependency('libnl-cli-3.0', required:false) oh! I just noticed it was here, but I'm afraid it doesn't work in here... or perhaps is the required that needs to be set to true?! take a look to my comment down below in the other meson.build file > + > xmlrpc = dependency('xmlrpc', required : false) > xmlrpc_util = dependency('xmlrpc_util', required : false) > xmlrpc_client = dependency('xmlrpc_client', required : false) > diff --git a/tools/drm_ras.c b/tools/drm_ras.c > new file mode 100644 > index 000000000..68946ff6a > --- /dev/null > +++ b/tools/drm_ras.c > @@ -0,0 +1,428 @@ > +// SPDX-License-Identifier: MIT > +/* > + * Copyright © 2021 Intel Corporation > + */ > + > +#include <stdio.h> > +#include <sys/types.h> > +#include <unistd.h> > +#include <getopt.h> > +#include <linux/genetlink.h> > +#include <netlink/cli/utils.h> > + > +#include "drm_netlink.h" > +#include "igt_device_scan.h" > + > +#define ARRAY_SIZE(array) (sizeof(array) / sizeof((array)[0])) > + > +struct nl_sock *sock, *mcsock; > +int family_id; > + > +enum opt_val { > + OPT_UNKNOWN = '?', > + OPT_END = -1, > + OPT_DEVICE, > + OPT_CONFIG, > + OPT_VERBOSE, > + OPT_HELP, > +}; > + > +enum cmd_ids { > + INVALID_CMD = -1, > + LIST_ERRORS = 0, > + READ_ONE, > + READ_BLOCK, > + READ_ALL, > + WAIT_ON_EVENT, > + > + __MAX_CMDS, > +}; > + > +static const char * const cmd_names[] = { > + "LIST_ERRORS", > + "READ_ONE", > + "READ_BLOCK", > + "READ_ALL", > + "WAIT_ON_EVENT", > +}; please let's avoid caps in the commands > + > +static void help(char **argv) > +{ > + int i; > + > + printf("Usage: %s command [<command options>]\n", argv[0]); > + printf("commands:\n"); > + > + for (i = 0; i < __MAX_CMDS; i++) { > + switch (i) { > + case LIST_ERRORS: > + printf("%s %s --device=<device filter> --verbose [default normal]\n", argv[0], cmd_names[i]); > + break; > + case READ_ALL: > + case WAIT_ON_EVENT: > + printf("%s %s --device=<device filter>\n", argv[0], cmd_names[i]); > + break; > + case READ_ONE: > + case READ_BLOCK: > + printf("%s %s --device=<device filter> --error_id=<id returned from query>\n", argv[0], cmd_names[i]); > + break; > + } > + } > + > + igt_device_print_filter_types(); > +} > + > +static int list_errors(struct nl_cache_ops *ops, struct genl_cmd *cmd, > + struct genl_info *info, void *arg) > +{ > + const struct nlmsghdr *nlh = info->nlh; > + struct nlattr *nla; > + int len, remain; > + > + len = GENL_HDRLEN; > + > + nlmsg_for_each_attr(nla, nlh, len, remain) { > + if ((nla_type(nla) == DRM_RAS_ATTR_QUERY_REPLY) && nla_is_nested(nla)) { > + struct nlattr *cur; > + int rem; > + > + if (cmd->c_id == DRM_RAS_CMD_READ_ALL) > + printf("%-50s\t%-18s\t%s\n", "name", "config-id", "counter"); > + else > + printf("%-50s\t%-18s\n", "name", "config-id"); > + > + nla_for_each_nested(cur, nla, rem) { > + switch (nla_type(cur)) { > + case DRM_RAS_ATTR_ERROR_NAME: > + printf("\n%-50s", nla_get_string(cur)); > + break; > + case DRM_RAS_ATTR_ERROR_ID: > + printf("\t0x%016lx", nla_get_u64(cur)); > + break; > + case DRM_RAS_ATTR_ERROR_VALUE: > + printf("\t%lu", nla_get_u64(cur)); > + break; > + default: > + break; > + } > + } > + printf("\n"); > + } > + } > + > + return NL_OK; > +} > + > +static int read_single(struct nl_cache_ops *ops, struct genl_cmd *cmd, > + struct genl_info *info, void *arg) > +{ > + if (!info->attrs[DRM_RAS_ATTR_ERROR_VALUE]) > + nl_cli_fatal(NLE_FAILURE, "DRM_RAS_ATTR_ERROR_VALUE attribute is missing"); > + > + printf("counter value %lu\n", nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_VALUE])); > + > + return NL_OK; > +} > + > +static int mcast_event_handler(struct nl_cache_ops *ops, struct genl_cmd *cmd, > + struct genl_info *info, void *arg) > +{ > + struct nl_msg *msg; > + uint64_t config = 0x0000000000000005; /* error-gt0-correctable-eu-grf */ > + void *msg_head; > + int ret; > + > + printf("error event received\n"); > + > + msg = nlmsg_alloc(); > + if (!msg) > + nl_cli_fatal(NLE_INVAL, "nlmsg_alloc failed\n"); > + > + msg_head = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, family_id, 0, 0, > + DRM_RAS_CMD_READ_ONE, 1); > + if (!msg_head) > + nl_cli_fatal(ENOMEM, "genlmsg_put failed\n"); > + > + nla_put_u64(msg, DRM_RAS_ATTR_ERROR_ID, config); > + > + ret = nl_send_auto(sock, msg); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to send message: %s", nl_geterror(ret)); > + > + ret = nl_recvmsgs_default(sock); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); > + > + nlmsg_free(msg); > + > + return NL_OK; > +} > + > +static struct nla_policy drm_genl_policy[DRM_ATTR_MAX + 1] = { > + [DRM_RAS_ATTR_QUERY] = { .type = NLA_U8 }, > + [DRM_RAS_ATTR_READ_ALL] = { .type = NLA_U8 }, > + [DRM_RAS_ATTR_QUERY_REPLY] = { .type = NLA_NESTED }, > + [DRM_RAS_ATTR_ERROR_NAME] = { .type = NLA_NUL_STRING }, > + [DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 }, > + [DRM_RAS_ATTR_ERROR_VALUE] = { .type = NLA_U64 }, > +}; > + > +static struct genl_cmd drm_genl_cmds[] = { > + { > + .c_id = DRM_RAS_CMD_QUERY, > + .c_name = "QUERY", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = list_errors, > + }, > + { > + .c_id = DRM_RAS_CMD_READ_ONE, > + .c_name = "READ_1", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = read_single, > + }, > + { > + .c_id = DRM_RAS_CMD_READ_BLOCK, > + .c_name = "READ_BLOCK", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = read_single, > + }, > + { > + .c_id = DRM_RAS_CMD_READ_ALL, > + .c_name = "READ_ALL", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = list_errors, > + }, > + { > + .c_id = DRM_RAS_CMD_ERROR_EVENT, > + .c_name = "ERROR_EVENT", > + .c_maxattr = DRM_ATTR_MAX, > + .c_attr_policy = drm_genl_policy, > + .c_msg_parser = mcast_event_handler, > + }, > +}; > + > +static struct genl_ops drm_genl_ops = { > + .o_hdrsize = 0, > + .o_cmds = drm_genl_cmds, > + .o_ncmds = ARRAY_SIZE(drm_genl_cmds), > +}; > + > +static void send_cmd(int cmd, uint64_t config) > +{ > + struct nl_msg *msg; > + void *msg_head; > + int ret; > + > + msg = nlmsg_alloc(); > + if (!msg) > + nl_cli_fatal(NLE_INVAL, "nlmsg_alloc failed\n"); > + > + msg_head = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, family_id, 0, 0, cmd, 1); > + if (!msg_head) > + nl_cli_fatal(ENOMEM, "genlmsg_put failed\n"); > + switch (cmd) { > + case DRM_RAS_CMD_QUERY: > + nla_put_u8(msg, DRM_RAS_ATTR_QUERY, config ? DRM_RAS_CMD_QUERY_VERBOSE : > + DRM_RAS_CMD_QUERY_NORMAL); > + break; > + case DRM_RAS_CMD_READ_ONE: > + case DRM_RAS_CMD_READ_BLOCK: > + nla_put_u64(msg, DRM_RAS_ATTR_ERROR_ID, config); > + break; > + case DRM_RAS_CMD_READ_ALL: > + nla_put_u8(msg, DRM_RAS_ATTR_READ_ALL, 1); > + break; > + default: > + break; > + } > + > + ret = nl_send_auto(sock, msg); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to send message: %s", nl_geterror(ret)); > + > + ret = nl_recvmsgs_default(sock); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); > + > + nlmsg_free(msg); > +} > + > +static int get_cmd(char *cmd_name) > +{ > + int i; > + > + if (!cmd_name) > + return -1; > + > + for (i = 0; i < __MAX_CMDS; i++) { > + if (strcasecmp(cmd_name, cmd_names[i]) == 0) > + return i; > + } > + > + return -1; > +} > + > +int main(int argc, char **argv) > +{ > + char *endptr; > + enum opt_val val; > + enum cmd_ids cmd; > + char *device = NULL; > + uint64_t error_config_id; > + bool verbose = false; > + int ret, mcgrp, index; > + struct igt_device_card card; > + char *dev_name, *dup; > + > + static struct option options[] = { > + {"device", required_argument, NULL, OPT_DEVICE}, indeed device is a required option, let's make this a positional argument instead of part of this list so we avoid the '--device='... also 'drm' should be part of family name and not required as extra argument here. > + {"error_id", required_argument, NULL, OPT_CONFIG}, > + {"verbose", no_argument, NULL, OPT_VERBOSE}, > + {"help", no_argument, NULL, OPT_HELP}, > + { 0 } > + }; > + > + cmd = get_cmd(argv[1]); > + if (cmd < 0) { > + fprintf(stderr, "invalid command\n"); > + help(argv); > + exit(EXIT_FAILURE); > + } > + > + for (val = 0; val != OPT_END; ) { > + val = getopt_long(argc, argv, "", options, &index); > + > + switch (val) { > + case OPT_DEVICE: > + device = strdup(optarg); > + break; > + case OPT_CONFIG: > + error_config_id = strtoull(optarg, &endptr, 16); > + if (*endptr) { > + fprintf(stderr, "invalid config id %s\n", optarg); > + exit(EXIT_FAILURE); > + } > + break; > + case OPT_VERBOSE: > + verbose = true; > + break; > + case OPT_HELP: > + help(argv); > + exit(EXIT_FAILURE); > + case OPT_END: > + break; > + case OPT_UNKNOWN: > + exit(EXIT_FAILURE); > + } > + } > + > + if (!device) { > + fprintf(stderr, "missing device option\n"); > + help(argv); > + exit(EXIT_FAILURE); > + } else { > + ret = igt_device_card_match_pci(device, &card); > + if (!ret) { > + fprintf(stderr, "device %s not found!\n", device); > + exit(EXIT_FAILURE); > + } > + free(device); > + } > + > + /* get card name */ > + dup = strdup(card.card); > + > + while (dup) > + dev_name = strsep(&dup, "/"); > + free(dup); > + > + drm_genl_ops.o_name = strdup(dev_name); > + > + sock = nl_cli_alloc_socket(); > + if (!sock) > + nl_cli_fatal(NLE_NOMEM, "Cannot allocate nl_sock"); > + > + ret = nl_cli_connect(sock, NETLINK_GENERIC); > + if (ret < 0) > + nl_cli_fatal(ret, "Cannot connect handle"); > + > + ret = genl_register_family(&drm_genl_ops); > + if (ret < 0) > + nl_cli_fatal(ret, "Cannot register xe family"); > + > + ret = genl_ops_resolve(sock, &drm_genl_ops); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to resolve family name"); > + > + family_id = genl_ctrl_resolve(sock, drm_genl_ops.o_name); > + if (family_id < 0) > + nl_cli_fatal(NLE_INVAL, "Resolving of \"%s\" failed", drm_genl_ops.o_name); > + > + ret = nl_socket_modify_cb(sock, NL_CB_VALID, NL_CB_CUSTOM, genl_handle_msg, NULL); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to modify valid message callback"); > + > + switch (cmd) { > + case LIST_ERRORS: > + send_cmd(DRM_RAS_CMD_QUERY, verbose); > + break; > + case READ_ONE: > + send_cmd(DRM_RAS_CMD_READ_ONE, error_config_id); > + break; > + case READ_BLOCK: > + send_cmd(DRM_RAS_CMD_READ_BLOCK, error_config_id); > + break; > + case READ_ALL: > + send_cmd(DRM_RAS_CMD_READ_ALL, 0); > + break; > + case WAIT_ON_EVENT: > + mcsock = nl_cli_alloc_socket(); > + if (!mcsock) > + nl_cli_fatal(NLE_NOMEM, "Cannot allocate nl_sock"); > + > + ret = nl_cli_connect(mcsock, NETLINK_GENERIC); > + if (ret < 0) > + nl_cli_fatal(ret, "Cannot connect handle"); > + > + ret = genl_ops_resolve(mcsock, &drm_genl_ops); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to resolve family name"); > + > + nl_socket_disable_seq_check(mcsock); > + > + mcgrp = genl_ctrl_resolve_grp(mcsock, drm_genl_ops.o_name, > + DRM_GENL_MCAST_GROUP_NAME_CORR_ERR); > + if (mcgrp < 0) > + nl_cli_fatal(mcgrp, "failed to resolve generic netlink multicast group"); > + > + /* Join the multicast group. */ > + ret = nl_socket_add_membership(mcsock, mcgrp); > + if (ret < 0) > + nl_cli_fatal(ret, "failed to join multicast group"); > + > + ret = nl_socket_modify_cb(mcsock, NL_CB_VALID, NL_CB_CUSTOM, genl_handle_msg, NULL); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to modify valid message callback"); > + > + printf("waiting for error event\n"); > + ret = nl_recvmsgs_default(mcsock); > + if (ret < 0) > + nl_cli_fatal(ret, "Unable to receive message: %s", nl_geterror(ret)); > + > + nl_close(mcsock); > + nl_socket_free(mcsock); > + break; > + default: > + break; > + } > + > + nl_close(sock); > + nl_socket_free(sock); > + > + return 0; > +} > + > diff --git a/tools/meson.build b/tools/meson.build > index 99a732942..5195d1f62 100644 > --- a/tools/meson.build > +++ b/tools/meson.build > @@ -115,6 +115,11 @@ if build_vmtb > install_subdir('vmtb', install_dir: libexecdir) > endif > Please add this here: libnl = dependency('libnl-3.0', required: true) libnl_cli = dependency('libnl-cli-3.0', required: true) libnl_genl = dependency('libnl-genl-3.0', required: true) it took me a very long time to understand what was going on here on my fedora build. > +executable('drm_ras', 'drm_ras.c', executable('drmras', to make user's life easier when typing the command... > + dependencies : [tool_deps, libnl, libnl_cli, libnl_genl], > + install_rpath : bindir_rpathdir, > + install : true) > + > subdir('i915-perf') > subdir('xe-perf') > subdir('null_state_gen') > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters 2025-07-30 6:13 [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters Aravind Iddamsetty 2025-07-30 6:13 ` [RFC i-g-t v3 1/1] tools/RAS: A tool to read " Aravind Iddamsetty @ 2025-07-30 20:05 ` Rodrigo Vivi 2025-07-30 21:50 ` Deucher, Alexander 2025-08-13 12:42 ` Kamil Konieczny 2 siblings, 1 reply; 7+ messages in thread From: Rodrigo Vivi @ 2025-07-30 20:05 UTC (permalink / raw) To: Aravind Iddamsetty, Alex Deucher Cc: igt-dev, Alex Deucher, Simona Vetter, David Airlie, Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Riana Tauro, Anshuman Gupta On Wed, Jul 30, 2025 at 11:43:41AM +0530, Aravind Iddamsetty wrote: > This tool is to demonstrate the use of netlink sockets to read RAS error > counters, which is being proposed via series Alex, what tools are in use for RAS on AMD side? I noticed something in the mesa repo recently. But perhaps you have other high level tools as well? I'm wondering if we should we try consolidate in this tool here in IGT as some official one to drive the RAS netlink APIs in a unified way? Besides converting any other tool to this API of course. > "[RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem". > > v2: update uapi header. > v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP Block. > > The tool supports the following commands: > READ_ONE, READ_BLOCK, READ_ALL, WAIT_ON_EVENT, LIST_ERRORS > > read single error counter: > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 > counter value 0 > > read all error counters: > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > name config-id counter > > error-gt0-correctable-guc 0x0000000000000001 0 > error-gt0-correctable-slm 0x0000000000000003 0 > error-gt0-correctable-eu-ic 0x0000000000000004 0 > error-gt0-correctable-eu-grf 0x0000000000000005 0 > error-gt0-fatal-guc 0x0000000000000009 0 > error-gt0-fatal-slm 0x000000000000000d 0 > error-gt0-fatal-eu-grf 0x000000000000000f 0 > error-gt0-fatal-fpu 0x0000000000000010 0 > error-gt0-fatal-tlb 0x0000000000000011 0 > error-gt0-fatal-l3-fabric 0x0000000000000012 0 > error-gt0-correctable-subslice 0x0000000000000013 0 > error-gt0-correctable-l3bank 0x0000000000000014 0 > error-gt0-fatal-subslice 0x0000000000000015 0 > error-gt0-fatal-l3bank 0x0000000000000016 0 > error-gt0-sgunit-correctable 0x0000000000000017 0 > error-gt0-sgunit-nonfatal 0x0000000000000018 0 > error-gt0-sgunit-fatal 0x0000000000000019 0 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 > error-gt0-soc-fatal-punit 0x000000000000001d 0 > error-gt0-soc-fatal-psf-0 0x000000000000001e 0 > error-gt0-soc-fatal-psf-1 0x000000000000001f 0 > error-gt0-soc-fatal-psf-2 0x0000000000000020 0 > error-gt0-soc-fatal-cd0 0x0000000000000021 0 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 > error-gt1-correctable-guc 0x1000000000000001 0 > error-gt1-correctable-slm 0x1000000000000003 0 > error-gt1-correctable-eu-ic 0x1000000000000004 0 > error-gt1-correctable-eu-grf 0x1000000000000005 0 > error-gt1-fatal-guc 0x1000000000000009 0 > error-gt1-fatal-slm 0x100000000000000d 0 > error-gt1-fatal-eu-grf 0x100000000000000f 0 > error-gt1-fatal-fpu 0x1000000000000010 0 > error-gt1-fatal-tlb 0x1000000000000011 0 > error-gt1-fatal-l3-fabric 0x1000000000000012 0 > error-gt1-correctable-subslice 0x1000000000000013 0 > error-gt1-correctable-l3bank 0x1000000000000014 0 > error-gt1-fatal-subslice 0x1000000000000015 0 > error-gt1-fatal-l3bank 0x1000000000000016 0 > error-gt1-sgunit-correctable 0x1000000000000017 0 > error-gt1-sgunit-nonfatal 0x1000000000000018 0 > error-gt1-sgunit-fatal 0x1000000000000019 0 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 > error-gt1-soc-fatal-punit 0x100000000000001d 0 > error-gt1-soc-fatal-psf-0 0x100000000000001e 0 > error-gt1-soc-fatal-psf-1 0x100000000000001f 0 > error-gt1-soc-fatal-psf-2 0x1000000000000020 0 > error-gt1-soc-fatal-cd0 0x1000000000000021 0 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 > > wait on a error event: > > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 > waiting for error event > error event received > counter value 0 > > list all errors: > > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 > name config-id > > error-gt0-correctable-guc 0x0000000000000001 > error-gt0-correctable-slm 0x0000000000000003 > error-gt0-correctable-eu-ic 0x0000000000000004 > error-gt0-correctable-eu-grf 0x0000000000000005 > error-gt0-fatal-guc 0x0000000000000009 > error-gt0-fatal-slm 0x000000000000000d > error-gt0-fatal-eu-grf 0x000000000000000f > error-gt0-fatal-fpu 0x0000000000000010 > error-gt0-fatal-tlb 0x0000000000000011 > error-gt0-fatal-l3-fabric 0x0000000000000012 > error-gt0-correctable-subslice 0x0000000000000013 > error-gt0-correctable-l3bank 0x0000000000000014 > error-gt0-fatal-subslice 0x0000000000000015 > error-gt0-fatal-l3bank 0x0000000000000016 > error-gt0-sgunit-correctable 0x0000000000000017 > error-gt0-sgunit-nonfatal 0x0000000000000018 > error-gt0-sgunit-fatal 0x0000000000000019 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c > error-gt0-soc-fatal-punit 0x000000000000001d > error-gt0-soc-fatal-psf-0 0x000000000000001e > error-gt0-soc-fatal-psf-1 0x000000000000001f > error-gt0-soc-fatal-psf-2 0x0000000000000020 > error-gt0-soc-fatal-cd0 0x0000000000000021 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 > error-gt1-correctable-guc 0x1000000000000001 > error-gt1-correctable-slm 0x1000000000000003 > error-gt1-correctable-eu-ic 0x1000000000000004 > error-gt1-correctable-eu-grf 0x1000000000000005 > error-gt1-fatal-guc 0x1000000000000009 > error-gt1-fatal-slm 0x100000000000000d > error-gt1-fatal-eu-grf 0x100000000000000f > error-gt1-fatal-fpu 0x1000000000000010 > error-gt1-fatal-tlb 0x1000000000000011 > error-gt1-fatal-l3-fabric 0x1000000000000012 > error-gt1-correctable-subslice 0x1000000000000013 > error-gt1-correctable-l3bank 0x1000000000000014 > error-gt1-fatal-subslice 0x1000000000000015 > error-gt1-fatal-l3bank 0x1000000000000016 > error-gt1-sgunit-correctable 0x1000000000000017 > error-gt1-sgunit-nonfatal 0x1000000000000018 > error-gt1-sgunit-fatal 0x1000000000000019 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c > error-gt1-soc-fatal-punit 0x100000000000001d > error-gt1-soc-fatal-psf-0 0x100000000000001e > error-gt1-soc-fatal-psf-1 0x100000000000001f > error-gt1-soc-fatal-psf-2 0x1000000000000020 > error-gt1-soc-fatal-cd0 0x1000000000000021 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 > > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: Simona Vetter <simona@ffwll.ch> > Cc: David Airlie <airlied@gmail.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Cc: Hawking Zhang <Hawking.Zhang@amd.com> > Cc: Lijo Lazar <lijo.lazar@amd.com> > Cc: Riana Tauro <riana.tauro@intel.com> > Cc: Anshuman Gupta <anshuman.gupta@intel.com> > > > Aravind Iddamsetty (1): > tools/RAS: A tool to read error counters > > include/drm-uapi/drm_netlink.h | 105 ++++++++ > meson.build | 4 + > tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++ > tools/meson.build | 5 + > 4 files changed, 542 insertions(+) > create mode 100644 include/drm-uapi/drm_netlink.h > create mode 100644 tools/drm_ras.c > > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters 2025-07-30 20:05 ` [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS " Rodrigo Vivi @ 2025-07-30 21:50 ` Deucher, Alexander 0 siblings, 0 replies; 7+ messages in thread From: Deucher, Alexander @ 2025-07-30 21:50 UTC (permalink / raw) To: Rodrigo Vivi, Aravind Iddamsetty, Kasiviswanathan, Harish Cc: igt-dev@lists.freedesktop.org, Simona Vetter, David Airlie, Joonas Lahtinen, Zhang, Hawking, Lazar, Lijo, Riana Tauro, Anshuman Gupta [Public] > -----Original Message----- > From: Rodrigo Vivi <rodrigo.vivi@intel.com> > Sent: Wednesday, July 30, 2025 4:06 PM > To: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: igt-dev@lists.freedesktop.org; Deucher, Alexander > <Alexander.Deucher@amd.com>; Simona Vetter <simona@ffwll.ch>; David Airlie > <airlied@gmail.com>; Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Zhang, > Hawking <Hawking.Zhang@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Riana > Tauro <riana.tauro@intel.com>; Anshuman Gupta <anshuman.gupta@intel.com> > Subject: Re: [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read > RAS error counters > > On Wed, Jul 30, 2025 at 11:43:41AM +0530, Aravind Iddamsetty wrote: > > This tool is to demonstrate the use of netlink sockets to read RAS > > error counters, which is being proposed via series > > Alex, what tools are in use for RAS on AMD side? > I noticed something in the mesa repo recently. But perhaps you have other high > level tools as well? > > I'm wondering if we should we try consolidate in this tool here in IGT as some > official one to drive the RAS netlink APIs in a unified way? Besides converting any > other tool to this API of course. + Harish Hawking, Harish, and Lijo can probably provide better feedback since they are closer to the RAS stuff more recently. That said, we have amdsmi and rdc which are used to handle RAS stuff among other things: https://github.com/ROCm/amdsmi https://github.com/ROCm/rdc We have some documentation on our kernel interface as well: https://docs.kernel.org/gpu/amdgpu/ras.html Alex > > > "[RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm > subsystem". > > > > v2: update uapi header. > > v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP > Block. > > > > The tool supports the following commands: > > READ_ONE, READ_BLOCK, READ_ALL, WAIT_ON_EVENT, LIST_ERRORS > > > > read single error counter: > > > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 > > --error_id=0x0000000000000005 counter value 0 > > > > read all error counters: > > > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > > name config-id counter > > > > error-gt0-correctable-guc 0x0000000000000001 0 > > error-gt0-correctable-slm 0x0000000000000003 0 > > error-gt0-correctable-eu-ic 0x0000000000000004 0 > > error-gt0-correctable-eu-grf 0x0000000000000005 0 > > error-gt0-fatal-guc 0x0000000000000009 0 > > error-gt0-fatal-slm 0x000000000000000d 0 > > error-gt0-fatal-eu-grf 0x000000000000000f 0 > > error-gt0-fatal-fpu 0x0000000000000010 0 > > error-gt0-fatal-tlb 0x0000000000000011 0 > > error-gt0-fatal-l3-fabric 0x0000000000000012 0 > > error-gt0-correctable-subslice 0x0000000000000013 0 > > error-gt0-correctable-l3bank 0x0000000000000014 0 > > error-gt0-fatal-subslice 0x0000000000000015 0 > > error-gt0-fatal-l3bank 0x0000000000000016 0 > > error-gt0-sgunit-correctable 0x0000000000000017 0 > > error-gt0-sgunit-nonfatal 0x0000000000000018 0 > > error-gt0-sgunit-fatal 0x0000000000000019 0 > > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 > > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 > > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 > > error-gt0-soc-fatal-punit 0x000000000000001d 0 > > error-gt0-soc-fatal-psf-0 0x000000000000001e 0 > > error-gt0-soc-fatal-psf-1 0x000000000000001f 0 > > error-gt0-soc-fatal-psf-2 0x0000000000000020 0 > > error-gt0-soc-fatal-cd0 0x0000000000000021 0 > > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 > > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 > > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 > > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 > > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 > > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 > > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 > > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 > > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 > > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 > > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 > > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 > > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 > > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 > > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 > > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 > > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 > > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 > > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 > > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 > > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 > > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 > > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 > > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 > > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 > > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 > > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 > > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 > > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 > > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 > > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 > > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 > > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 > > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 > > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 > > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 > > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 > > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 > > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 > > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 > > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 > > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 > > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 > > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 > > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 > > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 > > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 > > error-gt1-correctable-guc 0x1000000000000001 0 > > error-gt1-correctable-slm 0x1000000000000003 0 > > error-gt1-correctable-eu-ic 0x1000000000000004 0 > > error-gt1-correctable-eu-grf 0x1000000000000005 0 > > error-gt1-fatal-guc 0x1000000000000009 0 > > error-gt1-fatal-slm 0x100000000000000d 0 > > error-gt1-fatal-eu-grf 0x100000000000000f 0 > > error-gt1-fatal-fpu 0x1000000000000010 0 > > error-gt1-fatal-tlb 0x1000000000000011 0 > > error-gt1-fatal-l3-fabric 0x1000000000000012 0 > > error-gt1-correctable-subslice 0x1000000000000013 0 > > error-gt1-correctable-l3bank 0x1000000000000014 0 > > error-gt1-fatal-subslice 0x1000000000000015 0 > > error-gt1-fatal-l3bank 0x1000000000000016 0 > > error-gt1-sgunit-correctable 0x1000000000000017 0 > > error-gt1-sgunit-nonfatal 0x1000000000000018 0 > > error-gt1-sgunit-fatal 0x1000000000000019 0 > > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 > > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 > > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 > > error-gt1-soc-fatal-punit 0x100000000000001d 0 > > error-gt1-soc-fatal-psf-0 0x100000000000001e 0 > > error-gt1-soc-fatal-psf-1 0x100000000000001f 0 > > error-gt1-soc-fatal-psf-2 0x1000000000000020 0 > > error-gt1-soc-fatal-cd0 0x1000000000000021 0 > > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 > > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 > > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 > > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 > > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 > > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 > > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 > > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 > > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 > > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 > > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 > > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 > > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 > > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 > > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 > > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 > > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 > > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 > > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 > > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 > > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 > > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 > > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 > > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 > > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 > > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 > > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 > > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 > > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 > > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 > > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 > > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 > > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 > > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 > > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 > > > > wait on a error event: > > > > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 waiting for > > error event error event received counter value 0 > > > > list all errors: > > > > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 > > name config-id > > > > error-gt0-correctable-guc 0x0000000000000001 > > error-gt0-correctable-slm 0x0000000000000003 > > error-gt0-correctable-eu-ic 0x0000000000000004 > > error-gt0-correctable-eu-grf 0x0000000000000005 > > error-gt0-fatal-guc 0x0000000000000009 > > error-gt0-fatal-slm 0x000000000000000d > > error-gt0-fatal-eu-grf 0x000000000000000f > > error-gt0-fatal-fpu 0x0000000000000010 > > error-gt0-fatal-tlb 0x0000000000000011 > > error-gt0-fatal-l3-fabric 0x0000000000000012 > > error-gt0-correctable-subslice 0x0000000000000013 > > error-gt0-correctable-l3bank 0x0000000000000014 > > error-gt0-fatal-subslice 0x0000000000000015 > > error-gt0-fatal-l3bank 0x0000000000000016 > > error-gt0-sgunit-correctable 0x0000000000000017 > > error-gt0-sgunit-nonfatal 0x0000000000000018 > > error-gt0-sgunit-fatal 0x0000000000000019 > > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a > > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b > > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c > > error-gt0-soc-fatal-punit 0x000000000000001d > > error-gt0-soc-fatal-psf-0 0x000000000000001e > > error-gt0-soc-fatal-psf-1 0x000000000000001f > > error-gt0-soc-fatal-psf-2 0x0000000000000020 > > error-gt0-soc-fatal-cd0 0x0000000000000021 > > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 > > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 > > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 > > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 > > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 > > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 > > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 > > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 > > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a > > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b > > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c > > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d > > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e > > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f > > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 > > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 > > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 > > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 > > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 > > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 > > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 > > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 > > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 > > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 > > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a > > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b > > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c > > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d > > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e > > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f > > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 > > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 > > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 > > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 > > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 > > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 > > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 > > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 > > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 > > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 > > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a > > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b > > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c > > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d > > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e > > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f > > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 > > error-gt1-correctable-guc 0x1000000000000001 > > error-gt1-correctable-slm 0x1000000000000003 > > error-gt1-correctable-eu-ic 0x1000000000000004 > > error-gt1-correctable-eu-grf 0x1000000000000005 > > error-gt1-fatal-guc 0x1000000000000009 > > error-gt1-fatal-slm 0x100000000000000d > > error-gt1-fatal-eu-grf 0x100000000000000f > > error-gt1-fatal-fpu 0x1000000000000010 > > error-gt1-fatal-tlb 0x1000000000000011 > > error-gt1-fatal-l3-fabric 0x1000000000000012 > > error-gt1-correctable-subslice 0x1000000000000013 > > error-gt1-correctable-l3bank 0x1000000000000014 > > error-gt1-fatal-subslice 0x1000000000000015 > > error-gt1-fatal-l3bank 0x1000000000000016 > > error-gt1-sgunit-correctable 0x1000000000000017 > > error-gt1-sgunit-nonfatal 0x1000000000000018 > > error-gt1-sgunit-fatal 0x1000000000000019 > > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a > > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b > > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c > > error-gt1-soc-fatal-punit 0x100000000000001d > > error-gt1-soc-fatal-psf-0 0x100000000000001e > > error-gt1-soc-fatal-psf-1 0x100000000000001f > > error-gt1-soc-fatal-psf-2 0x1000000000000020 > > error-gt1-soc-fatal-cd0 0x1000000000000021 > > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 > > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 > > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 > > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 > > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 > > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 > > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 > > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 > > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a > > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b > > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c > > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d > > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e > > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f > > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 > > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 > > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 > > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 > > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 > > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 > > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 > > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 > > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 > > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 > > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a > > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b > > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c > > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d > > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e > > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f > > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 > > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 > > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 > > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 > > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 > > > > Cc: Alex Deucher <alexander.deucher@amd.com> > > Cc: Simona Vetter <simona@ffwll.ch> > > Cc: David Airlie <airlied@gmail.com> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > > Cc: Hawking Zhang <Hawking.Zhang@amd.com> > > Cc: Lijo Lazar <lijo.lazar@amd.com> > > Cc: Riana Tauro <riana.tauro@intel.com> > > Cc: Anshuman Gupta <anshuman.gupta@intel.com> > > > > > > Aravind Iddamsetty (1): > > tools/RAS: A tool to read error counters > > > > include/drm-uapi/drm_netlink.h | 105 ++++++++ > > meson.build | 4 + > > tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++ > > tools/meson.build | 5 + > > 4 files changed, 542 insertions(+) > > create mode 100644 include/drm-uapi/drm_netlink.h create mode 100644 > > tools/drm_ras.c > > > > -- > > 2.25.1 > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters 2025-07-30 6:13 [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters Aravind Iddamsetty 2025-07-30 6:13 ` [RFC i-g-t v3 1/1] tools/RAS: A tool to read " Aravind Iddamsetty 2025-07-30 20:05 ` [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS " Rodrigo Vivi @ 2025-08-13 12:42 ` Kamil Konieczny 2 siblings, 0 replies; 7+ messages in thread From: Kamil Konieczny @ 2025-08-13 12:42 UTC (permalink / raw) To: Aravind Iddamsetty Cc: igt-dev, Alex Deucher, Simona Vetter, David Airlie, Joonas Lahtinen, Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Riana Tauro, Anshuman Gupta Hi Aravind, On 2025-07-30 at 11:43:41 +0530, Aravind Iddamsetty wrote: > This tool is to demonstrate the use of netlink sockets to read RAS error > counters, which is being proposed via series > "[RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem". > > v2: update uapi header. > v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP Block. > > The tool supports the following commands: > READ_ONE, READ_BLOCK, READ_ALL, WAIT_ON_EVENT, LIST_ERRORS > > read single error counter: > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 > counter value 0 > > read all error counters: > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > name config-id counter > [cut] Please add conditional compilation in meson depending on presence of libnl-dev files (on Ubuntu - libnl-3-dev and/or libnl-cli-3-dev) Regards, Kamil > > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: Simona Vetter <simona@ffwll.ch> > Cc: David Airlie <airlied@gmail.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Cc: Hawking Zhang <Hawking.Zhang@amd.com> > Cc: Lijo Lazar <lijo.lazar@amd.com> > Cc: Riana Tauro <riana.tauro@intel.com> > Cc: Anshuman Gupta <anshuman.gupta@intel.com> > > > Aravind Iddamsetty (1): > tools/RAS: A tool to read error counters > > include/drm-uapi/drm_netlink.h | 105 ++++++++ > meson.build | 4 + > tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++ > tools/meson.build | 5 + > 4 files changed, 542 insertions(+) > create mode 100644 include/drm-uapi/drm_netlink.h > create mode 100644 tools/drm_ras.c > > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-08-15 22:13 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-30 6:13 [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters Aravind Iddamsetty 2025-07-30 6:13 ` [RFC i-g-t v3 1/1] tools/RAS: A tool to read " Aravind Iddamsetty 2025-08-13 12:46 ` Kamil Konieczny 2025-08-15 22:13 ` Rodrigo Vivi 2025-07-30 20:05 ` [RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS " Rodrigo Vivi 2025-07-30 21:50 ` Deucher, Alexander 2025-08-13 12:42 ` Kamil Konieczny
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox