intel-xe.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
@ 2025-07-30  6:49 Aravind Iddamsetty
  2025-07-30  6:49 ` [RFC v5 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-07-30  6:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

Revisiting this patch series to address pending feedback and help move
the discussion towards a conclusion. This revision includes updates
based on previous comments[1] and aims to clarify outstanding concerns.
Specifically added command to facility reporting errors from IP blocks
to support AMDGPU driver model of RAS.
[1]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/

I sincerely appreciate everyones patience and thoughtful reviews so
far, and I hope this refreshed series facilitates the final evaluation
and acceptance.

Please feel free to share any further suggestions or questions.

Thank you for your continued consideration.
----------------------------------------------------------------------

Our hardware supports RAS(Reliability, Availability, Serviceability) by
reporting the errors to the host, which the KMD processes and exposes a
set of error counters which can be used by observability tools to take 
corrective actions or repairs. Traditionally there were being exposed 
via PMU (for relative counters) and sysfs interface (for absolute 
value) in our internal branch. But, due to the limitations in this 
approach to use two interfaces and also not able to have an event based 
reporting or configurability, an alternative approach to try netlink 
was suggested by community for drm subsystem wide UAPI for RAS and 
telemetry as discussed in [2]. 

This [2] is the inspiration to this series. It uses the generic
netlink(genl) family subsystem and exposes a set of commands that can
be used by every drm driver, the framework provides a means to have
custom commands too. Each drm driver instance in this example xe driver
instance registers a family and operations to the genl subsystem through
which it enumerates and reports the error counters. An event based
notification is also supported to which userpace can subscribe to and
be notified when any error occurs and read the error counter this avoids
continuous polling on error counter. This can also be extended to
threshold based notification.

[2]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

This series is on top [3] series which introduces error counting infra in Xe
driver.
[3]: https://lore.kernel.org/all/20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com/

V5:
Add support to read error corresponding to an IP BLOCK

v4:
1. Rebase
2. rename drm_genl_send to drm_genl_reply
3. catch error from xa_store and handle appropriately
4. presently xe_list_errors fills blank data for IGFX, prevent it by
having an early check of IS_DGFX (Michael J. Ruhl)

v3:
1. Rebase on latest RAS series for XE
2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
register to netlink subsystem

v2: define common interfaces to genl netlink subsystem that all drm drivers
can leverage.

Below is an example tool drm_ras which demonstrates the use of the
supported commands. The tool will be sent to ML with the subject
"[RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
https://lore.kernel.org/all/20250730061342.1380217-2-aravind.iddamsetty@linux.intel.com/

read single error counter:

$ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
counter value 0

read all error counters:

$ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
name                                                    config-id               counter

error-gt0-correctable-guc                               0x0000000000000001      0
error-gt0-correctable-slm                               0x0000000000000003      0
error-gt0-correctable-eu-ic                             0x0000000000000004      0
error-gt0-correctable-eu-grf                            0x0000000000000005      0
error-gt0-fatal-guc                                     0x0000000000000009      0
error-gt0-fatal-slm                                     0x000000000000000d      0
error-gt0-fatal-eu-grf                                  0x000000000000000f      0
error-gt0-fatal-fpu                                     0x0000000000000010      0
error-gt0-fatal-tlb                                     0x0000000000000011      0
error-gt0-fatal-l3-fabric                               0x0000000000000012      0
error-gt0-correctable-subslice                          0x0000000000000013      0
error-gt0-correctable-l3bank                            0x0000000000000014      0
error-gt0-fatal-subslice                                0x0000000000000015      0
error-gt0-fatal-l3bank                                  0x0000000000000016      0
error-gt0-sgunit-correctable                            0x0000000000000017      0
error-gt0-sgunit-nonfatal                               0x0000000000000018      0
error-gt0-sgunit-fatal                                  0x0000000000000019      0
error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
error-gt0-soc-fatal-punit                               0x000000000000001d      0
error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
error-gt1-correctable-guc                               0x1000000000000001      0
error-gt1-correctable-slm                               0x1000000000000003      0
error-gt1-correctable-eu-ic                             0x1000000000000004      0
error-gt1-correctable-eu-grf                            0x1000000000000005      0
error-gt1-fatal-guc                                     0x1000000000000009      0
error-gt1-fatal-slm                                     0x100000000000000d      0
error-gt1-fatal-eu-grf                                  0x100000000000000f      0
error-gt1-fatal-fpu                                     0x1000000000000010      0
error-gt1-fatal-tlb                                     0x1000000000000011      0
error-gt1-fatal-l3-fabric                               0x1000000000000012      0
error-gt1-correctable-subslice                          0x1000000000000013      0
error-gt1-correctable-l3bank                            0x1000000000000014      0
error-gt1-fatal-subslice                                0x1000000000000015      0
error-gt1-fatal-l3bank                                  0x1000000000000016      0
error-gt1-sgunit-correctable                            0x1000000000000017      0
error-gt1-sgunit-nonfatal                               0x1000000000000018      0
error-gt1-sgunit-fatal                                  0x1000000000000019      0
error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
error-gt1-soc-fatal-punit                               0x100000000000001d      0
error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0

wait on a error event:

$ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
waiting for error event
error event received
counter value 0

list all errors:

$ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
name                                                    config-id

error-gt0-correctable-guc                               0x0000000000000001
error-gt0-correctable-slm                               0x0000000000000003
error-gt0-correctable-eu-ic                             0x0000000000000004
error-gt0-correctable-eu-grf                            0x0000000000000005
error-gt0-fatal-guc                                     0x0000000000000009
error-gt0-fatal-slm                                     0x000000000000000d
error-gt0-fatal-eu-grf                                  0x000000000000000f
error-gt0-fatal-fpu                                     0x0000000000000010
error-gt0-fatal-tlb                                     0x0000000000000011
error-gt0-fatal-l3-fabric                               0x0000000000000012
error-gt0-correctable-subslice                          0x0000000000000013
error-gt0-correctable-l3bank                            0x0000000000000014
error-gt0-fatal-subslice                                0x0000000000000015
error-gt0-fatal-l3bank                                  0x0000000000000016
error-gt0-sgunit-correctable                            0x0000000000000017
error-gt0-sgunit-nonfatal                               0x0000000000000018
error-gt0-sgunit-fatal                                  0x0000000000000019
error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
error-gt0-soc-fatal-punit                               0x000000000000001d
error-gt0-soc-fatal-psf-0                               0x000000000000001e
error-gt0-soc-fatal-psf-1                               0x000000000000001f
error-gt0-soc-fatal-psf-2                               0x0000000000000020
error-gt0-soc-fatal-cd0                                 0x0000000000000021
error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
error-gt1-correctable-guc                               0x1000000000000001
error-gt1-correctable-slm                               0x1000000000000003
error-gt1-correctable-eu-ic                             0x1000000000000004
error-gt1-correctable-eu-grf                            0x1000000000000005
error-gt1-fatal-guc                                     0x1000000000000009
error-gt1-fatal-slm                                     0x100000000000000d
error-gt1-fatal-eu-grf                                  0x100000000000000f
error-gt1-fatal-fpu                                     0x1000000000000010
error-gt1-fatal-tlb                                     0x1000000000000011
error-gt1-fatal-l3-fabric                               0x1000000000000012
error-gt1-correctable-subslice                          0x1000000000000013
error-gt1-correctable-l3bank                            0x1000000000000014
error-gt1-fatal-subslice                                0x1000000000000015
error-gt1-fatal-l3bank                                  0x1000000000000016
error-gt1-sgunit-correctable                            0x1000000000000017
error-gt1-sgunit-nonfatal                               0x1000000000000018
error-gt1-sgunit-fatal                                  0x1000000000000019
error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
error-gt1-soc-fatal-punit                               0x100000000000001d
error-gt1-soc-fatal-psf-0                               0x100000000000001e
error-gt1-soc-fatal-psf-1                               0x100000000000001f
error-gt1-soc-fatal-psf-2                               0x1000000000000020
error-gt1-soc-fatal-cd0                                 0x1000000000000021
error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Cc: Anshuman Gupta <anshuman.gupta@intel.com>


Aravind Iddamsetty (5):
  drm/netlink: Add netlink infrastructure
  drm/xe/RAS: Register netlink capability
  drm/xe/RAS: Expose the error counters
  drm/netlink: Define multicast groups
  drm/xe/RAS: send multicast event on occurrence of an error

 drivers/gpu/drm/Makefile             |   1 +
 drivers/gpu/drm/drm_drv.c            |   7 +
 drivers/gpu/drm/drm_netlink.c        | 219 +++++++++++
 drivers/gpu/drm/xe/Makefile          |   2 +
 drivers/gpu/drm/xe/xe_device.c       |   6 +
 drivers/gpu/drm/xe/xe_device_types.h |   1 +
 drivers/gpu/drm/xe/xe_hw_error.c     |  56 ++-
 drivers/gpu/drm/xe/xe_netlink.c      | 531 +++++++++++++++++++++++++++
 include/drm/drm_device.h             |  10 +
 include/drm/drm_drv.h                |   7 +
 include/drm/drm_netlink.h            |  46 +++
 include/uapi/drm/drm_netlink.h       | 105 ++++++
 include/uapi/drm/xe_drm.h            |  85 +++++
 13 files changed, 1071 insertions(+), 5 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_netlink.c
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
 create mode 100644 include/drm/drm_netlink.h
 create mode 100644 include/uapi/drm/drm_netlink.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v5 1/5] drm/netlink: Add netlink infrastructure
  2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
@ 2025-07-30  6:49 ` Aravind Iddamsetty
  2025-08-15 17:07   ` Zack McKevitt
  2025-08-15 21:48   ` Rodrigo Vivi
  2025-07-30  6:49 ` [RFC v5 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-07-30  6:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

Define the netlink registration interface and commands, attributes that
can be commonly used across by drm drivers. This patch intends to use
the generic netlink family to expose various stats of device. At present
it defines some commands that shall be used to expose RAS error counters.

v2:
define common interfaces to genl netlink subsystem that all drm drivers
can leverage.(Tomer Tayar)

v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
register to netlink subsystem (Daniel Vetter)

v4:(Michael J. Ruhl)
1. rename drm_genl_send to drm_genl_reply
2. catch error from xa_store and handle appropriately

v5:
1. compile only if CONFIG_NET is enabled

V6: Add support for reading an IP block errors

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v4
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/Makefile       |   1 +
 drivers/gpu/drm/drm_drv.c      |   7 ++
 drivers/gpu/drm/drm_netlink.c  | 212 +++++++++++++++++++++++++++++++++
 include/drm/drm_device.h       |  10 ++
 include/drm/drm_drv.h          |   7 ++
 include/drm/drm_netlink.h      |  41 +++++++
 include/uapi/drm/drm_netlink.h | 101 ++++++++++++++++
 7 files changed, 379 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_netlink.c
 create mode 100644 include/drm/drm_netlink.h
 create mode 100644 include/uapi/drm/drm_netlink.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 4dafbdc8f86a..39d5183ab35c 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -77,6 +77,7 @@ drm-$(CONFIG_DRM_CLIENT) += \
 	drm_client.o \
 	drm_client_event.o \
 	drm_client_modeset.o
+drm-$(CONFIG_NET) += drm_netlink.o
 drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
 drm-$(CONFIG_COMPAT) += drm_ioc32.o
 drm-$(CONFIG_DRM_PANEL) += drm_panel.o
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 02556363e918..cce55423141c 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -1088,6 +1088,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
 	if (ret)
 		goto err_minors;
 
+	if (driver->genl_ops) {
+		ret = drm_genl_register(dev);
+		if (ret)
+			goto err_minors;
+	}
+
 	ret = create_compat_control_link(dev);
 	if (ret)
 		goto err_minors;
@@ -1229,6 +1235,7 @@ static void drm_core_exit(void)
 	drm_privacy_screen_lookup_exit();
 	drm_panic_exit();
 	accel_core_exit();
+	drm_genl_exit();
 	unregister_chrdev(DRM_MAJOR, "drm");
 	debugfs_remove(drm_debugfs_root);
 	drm_sysfs_destroy();
diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
new file mode 100644
index 000000000000..da4bfde32a22
--- /dev/null
+++ b/drivers/gpu/drm/drm_netlink.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <net/genetlink.h>
+#include <uapi/drm/drm_netlink.h>
+
+#include <drm/drm_device.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_file.h>
+#include <drm/drm_managed.h>
+#include <drm/drm_netlink.h>
+#include <drm/drm_print.h>
+
+DEFINE_XARRAY(drm_dev_xarray);
+
+/**
+ * drm_genl_reply - response to a request
+ * @msg: socket buffer
+ * @info: receiver information
+ * @usrhdr: pointer to user specific header in the message buffer
+ *
+ * RETURNS:
+ * 0 on success and negative error code on failure
+ */
+int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
+{
+	int ret;
+
+	genlmsg_end(msg, usrhdr);
+
+	ret = genlmsg_reply(msg, info);
+	if (ret)
+		nlmsg_free(msg);
+
+	return ret;
+}
+EXPORT_SYMBOL(drm_genl_reply);
+
+/**
+ * drm_genl_alloc_msg - allocate genl message buffer
+ * @dev: drm_device for which the message is being allocated
+ * @info: receiver information
+ * @msg_size: size of the msg buffer that needs to be allocated
+ * @usrhdr: pointer to user specific header in the message buffer
+ *
+ * RETURNS:
+ * pointer to new allocated buffer on success, NULL on failure
+ */
+struct sk_buff *
+drm_genl_alloc_msg(struct drm_device *dev,
+		   struct genl_info *info,
+		   size_t msg_size, void **usrhdr)
+{
+	struct sk_buff *new_msg;
+
+	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!new_msg)
+		return new_msg;
+
+	*usrhdr = genlmsg_put_reply(new_msg, info, dev->drm_genl_family, 0, info->genlhdr->cmd);
+	if (!*usrhdr) {
+		nlmsg_free(new_msg);
+		new_msg = NULL;
+	}
+
+	return new_msg;
+}
+EXPORT_SYMBOL(drm_genl_alloc_msg);
+
+static struct drm_device *genl_to_dev(struct genl_info *info)
+{
+	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
+}
+
+static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
+{
+	struct drm_device *dev = genl_to_dev(info);
+
+	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL) {
+		if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_READ_ALL))
+			return -EINVAL;
+	} else {
+		if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_QUERY))
+			return -EINVAL;
+	}
+
+	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
+		return -EOPNOTSUPP;
+
+	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
+}
+
+static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
+{
+	struct drm_device *dev = genl_to_dev(info);
+
+	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
+		return -EINVAL;
+
+	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
+		return -EOPNOTSUPP;
+
+	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
+}
+
+/* attribute policies */
+static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
+	[DRM_RAS_ATTR_QUERY] = { .type = NLA_U8 },
+};
+
+static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
+	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
+};
+
+static const struct nla_policy drm_attr_policy_read_all[DRM_ATTR_MAX + 1] = {
+	[DRM_RAS_ATTR_READ_ALL] = { .type = NLA_U8 },
+};
+
+/* drm genl operations definition */
+const struct genl_ops drm_genl_ops[] = {
+	{
+		.cmd = DRM_RAS_CMD_QUERY,
+		.doit = drm_genl_list_errors,
+		.policy = drm_attr_policy_query,
+	},
+	{
+		.cmd = DRM_RAS_CMD_READ_ONE,
+		.doit = drm_genl_read_error,
+		.policy = drm_attr_policy_read_one,
+	},
+	{
+		.cmd = DRM_RAS_CMD_READ_ALL,
+		.doit = drm_genl_list_errors,
+		.policy = drm_attr_policy_read_all,
+	},
+	{
+		.cmd = DRM_RAS_CMD_READ_BLOCK,
+		.doit = drm_genl_read_error,
+		.policy = drm_attr_policy_read_one,
+	},
+
+};
+
+static void drm_genl_family_init(struct drm_device *dev)
+{
+	dev->drm_genl_family = drmm_kzalloc(dev, sizeof(struct genl_family),
+					    GFP_KERNEL);
+
+	/* Use drm primary node name eg: card0 to name the genl family */
+	snprintf(dev->drm_genl_family->name, sizeof(dev->drm_genl_family->name),
+		 "%s", dev->primary->kdev->kobj.name);
+	dev->drm_genl_family->version = DRM_GENL_VERSION;
+	dev->drm_genl_family->parallel_ops = true;
+	dev->drm_genl_family->ops = drm_genl_ops;
+	dev->drm_genl_family->n_ops = ARRAY_SIZE(drm_genl_ops);
+	dev->drm_genl_family->maxattr = DRM_ATTR_MAX;
+	dev->drm_genl_family->module = dev->dev->driver->owner;
+}
+
+static void drm_genl_deregister(struct drm_device *dev, void *arg)
+{
+	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family->name);
+
+	xa_erase(&drm_dev_xarray, dev->drm_genl_family->id);
+
+	genl_unregister_family(dev->drm_genl_family);
+}
+
+/**
+ * drm_genl_register - Register genl family
+ * @dev: drm_device for which genl family needs to be registered
+ *
+ * RETURNS:
+ * 0 on success and negative error code on failure
+ */
+int drm_genl_register(struct drm_device *dev)
+{
+	int ret;
+
+	drm_genl_family_init(dev);
+
+	ret = genl_register_family(dev->drm_genl_family);
+	if (ret < 0) {
+		drm_warn(dev, "genl family registration failed\n");
+		return ret;
+	}
+
+	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family->id,
+		       dev->drm_genl_family->name);
+
+	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family->id, dev, GFP_KERNEL));
+	if (ret)
+		goto genl_unregister;
+
+	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
+
+	return ret;
+
+genl_unregister:
+	genl_unregister_family(dev->drm_genl_family);
+	return ret;
+}
+
+/**
+ * drm_genl_exit: destroy drm_dev_xarray
+ */
+void drm_genl_exit(void)
+{
+	xa_destroy(&drm_dev_xarray);
+}
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index 08b3b2467c4c..8b60a17e4156 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 
 #include <drm/drm_mode_config.h>
+#include <drm/drm_netlink.h>
 
 struct drm_driver;
 struct drm_minor;
@@ -22,6 +23,8 @@ struct inode;
 struct pci_dev;
 struct pci_controller;
 
+struct genl_family;
+
 /*
  * Recovery methods for wedged device in order of less to more side-effects.
  * To be used with drm_dev_wedged_event() as recovery @method. Callers can
@@ -356,6 +359,13 @@ struct drm_device {
 	 * Root directory for debugfs files.
 	 */
 	struct dentry *debugfs_root;
+
+	/**
+	 * @drm_genl_family:
+	 *
+	 * Generic netlink family registration structure.
+	 */
+	struct genl_family *drm_genl_family;
 };
 
 void drm_dev_set_dma_dev(struct drm_device *dev, struct device *dma_dev);
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 3f76a32d6b84..908888ac0db2 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -431,6 +431,13 @@ struct drm_driver {
 	 * some examples.
 	 */
 	const struct file_operations *fops;
+
+	/**
+	 * @genl_ops:
+	 *
+	 * Drivers private callback to genl commands
+	 */
+	const struct driver_genl_ops *genl_ops;
 };
 
 void *__devm_drm_dev_alloc(struct device *parent,
diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
new file mode 100644
index 000000000000..4a746222337a
--- /dev/null
+++ b/include/drm/drm_netlink.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef __DRM_NETLINK_H__
+#define __DRM_NETLINK_H__
+
+#include <linux/types.h>
+
+struct drm_device;
+struct genl_info;
+struct sk_buff;
+
+struct driver_genl_ops {
+	int		       (*doit)(struct drm_device *dev,
+				       struct sk_buff *skb,
+				       struct genl_info *info);
+};
+
+#if IS_ENABLED(CONFIG_NET)
+int drm_genl_register(struct drm_device *dev);
+void drm_genl_exit(void);
+int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
+struct sk_buff *
+drm_genl_alloc_msg(struct drm_device *dev,
+		   struct genl_info *info,
+		   size_t msg_size, void **usrhdr);
+#else
+static inline int drm_genl_register(struct drm_device *dev) { return 0; }
+static inline void drm_genl_exit(void) {}
+static inline int drm_genl_reply(struct sk_buff *msg,
+				 struct genl_info *info,
+				 void *usrhdr) { return 0; }
+static inline struct skb_buff *
+drm_genl_alloc_msg(struct drm_device *dev,
+		   struct genl_info *info,
+		   size_t msg_size, void **usrhdr) { return NULL; }
+#endif
+
+#endif
diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
new file mode 100644
index 000000000000..58afb6e8d84a
--- /dev/null
+++ b/include/uapi/drm/drm_netlink.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright 2023 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _DRM_NETLINK_H_
+#define _DRM_NETLINK_H_
+
+#define DRM_GENL_VERSION 1
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/**
+ * enum drm_genl_error_cmds - Supported error commands
+ *
+ */
+enum drm_genl_error_cmds {
+	DRM_CMD_UNSPEC,
+	/**
+	 * @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id in verbose mode.
+	 * In normal mode will list IP blocks, total instances available and error types supported
+	 */
+	DRM_RAS_CMD_QUERY,
+	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */
+	DRM_RAS_CMD_READ_ONE,
+	/** @DRM_RAS_CMD_READ_BLOCK: Command to get a counter of specific error type from an IP
+	 * block
+	 */
+	DRM_RAS_CMD_READ_BLOCK,
+	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
+	DRM_RAS_CMD_READ_ALL,
+
+	__DRM_CMD_MAX,
+	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
+};
+
+enum drm_cmd_request_type {
+	DRM_RAS_CMD_QUERY_VERBOSE = 1,
+	DRM_RAS_CMD_QUERY_NORMAL = 2,
+};
+
+/**
+ * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
+ *
+ */
+enum drm_error_attr {
+	DRM_ATTR_UNSPEC,
+	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
+	/**
+	 * @DRM_RAS_ATTR_QUERY: Should be used with DRM_RAS_CMD_QUERY,
+	 * DRM_RAS_CMD_READ_ALL
+	 */
+	DRM_RAS_ATTR_QUERY, /* NLA_U8 */
+	/**
+	 * @DRM_RAS_ATTR_READ_ALL: Should be used with DRM_RAS_CMD_READ_ALL
+	 */
+	DRM_RAS_ATTR_READ_ALL, /* NLA_U8 */
+	/**
+	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
+	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands.
+	 */
+	DRM_RAS_ATTR_QUERY_REPLY, /* NLA_NESTED */
+	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
+	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
+	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id, should be used with
+	 * DRM_RAS_CMD_READ_ONE, DRM_RAS_CMD_READ_BLOCK
+	 */
+	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
+	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
+	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
+
+	__DRM_ATTR_MAX,
+	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
+};
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC v5 2/5] drm/xe/RAS: Register netlink capability
  2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
  2025-07-30  6:49 ` [RFC v5 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
@ 2025-07-30  6:49 ` Aravind Iddamsetty
  2025-08-15 21:52   ` Rodrigo Vivi
  2025-07-30  6:49 ` [RFC v5 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-07-30  6:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

Register netlink capability with the DRM and register the driver
callbacks to DRM RAS netlink commands.

v2:
Move the netlink registration parts to DRM susbsytem (Tomer Tayar)

v3: compile only if CONFIG_NET is enabled

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v2
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/xe/Makefile          |  2 ++
 drivers/gpu/drm/xe/xe_device.c       |  6 ++++++
 drivers/gpu/drm/xe/xe_device_types.h |  1 +
 drivers/gpu/drm/xe/xe_netlink.c      | 26 ++++++++++++++++++++++++++
 4 files changed, 35 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 80eecd35e807..e960c2dbe658 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -304,6 +304,8 @@ xe-$(CONFIG_DRM_XE_DISPLAY) += \
 	i915-display/skl_universal_plane.o \
 	i915-display/skl_watermark.o
 
+xe-$(CONFIG_NET) += xe_netlink.o
+
 ifeq ($(CONFIG_ACPI),y)
 	xe-$(CONFIG_DRM_XE_DISPLAY) += \
 		i915-display/intel_acpi.o \
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 806dbdf8118c..ca7a17c16aa5 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -363,6 +363,8 @@ static const struct file_operations xe_driver_fops = {
 	.fop_flags = FOP_UNSIGNED_OFFSET,
 };
 
+extern const struct driver_genl_ops xe_genl_ops[];
+
 static struct drm_driver driver = {
 	/* Don't use MTRRs here; the Xserver or userspace app should
 	 * deal with them for Intel hardware.
@@ -381,6 +383,10 @@ static struct drm_driver driver = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo = xe_drm_client_fdinfo,
 #endif
+#ifdef CONFIG_NET
+	.genl_ops = xe_genl_ops,
+#endif
+
 	.ioctls = xe_ioctls,
 	.num_ioctls = ARRAY_SIZE(xe_ioctls),
 	.fops = &xe_driver_fops,
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 3a851c7a55dd..08d3e53e4b37 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -10,6 +10,7 @@
 
 #include <drm/drm_device.h>
 #include <drm/drm_file.h>
+#include <drm/drm_netlink.h>
 #include <drm/ttm/ttm_device.h>
 
 #include "xe_devcoredump_types.h"
diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
new file mode 100644
index 000000000000..9e588fb19631
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_netlink.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <net/genetlink.h>
+#include <uapi/drm/drm_netlink.h>
+
+#include "xe_device.h"
+
+static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+{
+	return 0;
+}
+
+static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+{
+	return 0;
+}
+
+/* driver callbacks to DRM netlink commands*/
+const struct driver_genl_ops xe_genl_ops[] = {
+	[DRM_RAS_CMD_QUERY] =		{ .doit = xe_genl_list_errors },
+	[DRM_RAS_CMD_READ_ONE] =	{ .doit = xe_genl_read_error },
+	[DRM_RAS_CMD_READ_ALL] =	{ .doit = xe_genl_list_errors, },
+};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC v5 3/5] drm/xe/RAS: Expose the error counters
  2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
  2025-07-30  6:49 ` [RFC v5 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
  2025-07-30  6:49 ` [RFC v5 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
@ 2025-07-30  6:49 ` Aravind Iddamsetty
  2025-08-15 21:58   ` Rodrigo Vivi
  2025-07-30  6:49 ` [RFC v5 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-07-30  6:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

We expose the various error counters supported on a hardware via genl
subsytem through the registered commands to userspace. The
DRM_RAS_CMD_QUERY lists the error names with config id,
DRM_RAD_CMD_READ_ONE returns the counter value for the requested config
id and the DRM_RAS_CMD_READ_ALL lists the counters for all errors along
with their names and config ids.

v2: Rebase

v3:
1. presently xe_list_errors fills blank data for IGFX, prevent it by
having an early check of IS_DGFX (Michael J. Ruhl)
2. update errors from all sources

v4: Check QUERY param, if its normal return not supported.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_hw_error.c |  15 +-
 drivers/gpu/drm/xe/xe_netlink.c  | 509 ++++++++++++++++++++++++++++++-
 include/uapi/drm/xe_drm.h        |  85 ++++++
 3 files changed, 602 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 6a7cd59caac1..bdd9c88674b2 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -531,16 +531,21 @@ static void xe_clear_all_soc_errors(struct xe_device *xe)
 
 		while (hw_err < HARDWARE_ERROR_MAX) {
 			for (i = 0; i < XE_SOC_NUM_IEH; i++)
-				xe_mmio_write32(&gt->tile->mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
+				xe_mmio_write32(&gt->tile->mmio,
+						SOC_GSYSEVTCTL_REG(base, slave_base, i),
 						~REG_BIT(hw_err));
 
-			xe_mmio_write32(&gt->tile->mmio, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
+			xe_mmio_write32(&gt->tile->mmio,
+					SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
 					REG_GENMASK(31, 0));
-			xe_mmio_write32(&gt->tile->mmio, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
+			xe_mmio_write32(&gt->tile->mmio,
+					SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
 					REG_GENMASK(31, 0));
-			xe_mmio_write32(&gt->tile->mmio, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
+			xe_mmio_write32(&gt->tile->mmio,
+					SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
 					REG_GENMASK(31, 0));
-			xe_mmio_write32(&gt->tile->mmio, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
+			xe_mmio_write32(&gt->tile->mmio,
+					SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
 					REG_GENMASK(31, 0));
 			hw_err++;
 		}
diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
index 9e588fb19631..20240875284a 100644
--- a/drivers/gpu/drm/xe/xe_netlink.c
+++ b/drivers/gpu/drm/xe/xe_netlink.c
@@ -6,16 +6,521 @@
 #include <net/genetlink.h>
 #include <uapi/drm/drm_netlink.h>
 
+#include <drm/xe_drm.h>
+
 #include "xe_device.h"
 
-static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+#define MAX_ERROR_NAME	100
+
+static const char * const xe_hw_error_events[] = {
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng",
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc",
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler",
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm",
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic",
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf",
+		[DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
+		[DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
+		[DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker",
+		[DRM_XE_GENL_GT_ERROR_FATAL_GUC] = "fatal-guc",
+		[DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
+		[DRM_XE_GENL_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
+		[DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
+		[DRM_XE_GENL_GT_ERROR_FATAL_SLM] = "fatal-slm",
+		[DRM_XE_GENL_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
+		[DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
+		[DRM_XE_GENL_GT_ERROR_FATAL_FPU] = "fatal-fpu",
+		[DRM_XE_GENL_GT_ERROR_FATAL_TLB] = "fatal-tlb",
+		[DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice",
+		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank",
+		[DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
+		[DRM_XE_GENL_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
+		[DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable",
+		[DRM_XE_GENL_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
+		[DRM_XE_GENL_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = "soc-nonfatal-csc-psf-cmd-parity",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = "soc-nonfatal-csc-psf-unexpected-completion",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = "soc-nonfatal-csc-psf-unsupported-request",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = "soc-nonfatal-anr-mdfi",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = "soc-nonfatal-mdfi-t2t",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = "soc-nonfatal-mdfi-t2c",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = "soc-nonfatal-hbm-ss0-0",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = "soc-nonfatal-hbm-ss0-1",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = "soc-nonfatal-hbm-ss0-2",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = "soc-nonfatal-hbm-ss0-3",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = "soc-nonfatal-hbm-ss0-4",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = "soc-nonfatal-hbm-ss0-5",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = "soc-nonfatal-hbm-ss0-6",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = "soc-nonfatal-hbm-ss0-7",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = "soc-nonfatal-hbm-ss1-0",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = "soc-nonfatal-hbm-ss1-1",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = "soc-nonfatal-hbm-ss1-2",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = "soc-nonfatal-hbm-ss1-3",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = "soc-nonfatal-hbm-ss1-4",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = "soc-nonfatal-hbm-ss1-5",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = "soc-nonfatal-hbm-ss1-6",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = "soc-nonfatal-hbm-ss1-7",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = "soc-nonfatal-hbm-ss2-0",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = "soc-nonfatal-hbm-ss2-1",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = "soc-nonfatal-hbm-ss2-2",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = "soc-nonfatal-hbm-ss2-3",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = "soc-nonfatal-hbm-ss2-4",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = "soc-nonfatal-hbm-ss2-5",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = "soc-nonfatal-hbm-ss2-6",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = "soc-nonfatal-hbm-ss2-7",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = "soc-nonfatal-hbm-ss3-0",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = "soc-nonfatal-hbm-ss3-1",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = "soc-nonfatal-hbm-ss3-2",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = "soc-nonfatal-hbm-ss3-3",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = "soc-nonfatal-hbm-ss3-4",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = "soc-nonfatal-hbm-ss3-5",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = "soc-nonfatal-hbm-ss3-6",
+		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = "soc-nonfatal-hbm-ss3-7",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = "soc-fatal-csc-psf-cmd-parity",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = "soc-fatal-csc-psf-unexpected-completion",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = "soc-fatal-csc-psf-unsupported-request",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = "soc-fatal-pcie-psf-command-parity",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = "soc-fatal-pcie-psf-unexpected-completion",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = "soc-fatal-pcie-psf-unsupported-request",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = "soc-fatal-anr-mdfi",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = "soc-fatal-mdfi-t2t",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = "soc-fatal-mdfi-t2c",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = "soc-fatal-malformed-pcie-aer",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = "soc-fatal-malformed-pcie-err",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND] = "soc-fatal-ur-condition-ieh",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = "soc-fatal-from-serr-sources",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6",
+		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7",
+		[DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = "gsc-nonfatal-mia-internal",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = "gsc-nonfatal-vlt-glitch",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = "gsc-nonfatal-self-mbist",
+		[DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = "gsc-nonfatal-aon-parity",
+		[DRM_XE_GENL_SGGI_ERROR_NONFATAL] = "sggi-nonfatal-data-parity",
+		[DRM_XE_GENL_SGLI_ERROR_NONFATAL] = "sgli-nonfatal-data-parity",
+		[DRM_XE_GENL_SGCI_ERROR_NONFATAL] = "sgci-nonfatal-data-parity",
+		[DRM_XE_GENL_MERT_ERROR_NONFATAL] = "mert-nonfatal-data-parity",
+		[DRM_XE_GENL_SGGI_ERROR_FATAL] = "sggi-fatal-data-parity",
+		[DRM_XE_GENL_SGLI_ERROR_FATAL] = "sgli-fatal-data-parity",
+		[DRM_XE_GENL_SGCI_ERROR_FATAL] = "sgci-fatal-data-parity",
+		[DRM_XE_GENL_MERT_ERROR_FATAL] = "mert-nonfatal-data-parity",
+};
+
+static const unsigned long xe_hw_error_map[] = {
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = XE_HW_ERR_GT_CORR_L3_SNG,
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC] = XE_HW_ERR_GT_CORR_GUC,
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = XE_HW_ERR_GT_CORR_SAMPLER,
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM] = XE_HW_ERR_GT_CORR_SLM,
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = XE_HW_ERR_GT_CORR_EU_IC,
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = XE_HW_ERR_GT_CORR_EU_GRF,
+	[DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST] = XE_HW_ERR_GT_FATAL_ARR_BIST,
+	[DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB] = XE_HW_ERR_GT_FATAL_L3_DOUB,
+	[DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = XE_HW_ERR_GT_FATAL_L3_ECC_CHK,
+	[DRM_XE_GENL_GT_ERROR_FATAL_GUC] = XE_HW_ERR_GT_FATAL_GUC,
+	[DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR] = XE_HW_ERR_GT_FATAL_IDI_PAR,
+	[DRM_XE_GENL_GT_ERROR_FATAL_SQIDI] = XE_HW_ERR_GT_FATAL_SQIDI,
+	[DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER] = XE_HW_ERR_GT_FATAL_SAMPLER,
+	[DRM_XE_GENL_GT_ERROR_FATAL_SLM] = XE_HW_ERR_GT_FATAL_SLM,
+	[DRM_XE_GENL_GT_ERROR_FATAL_EU_IC] = XE_HW_ERR_GT_FATAL_EU_IC,
+	[DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF] = XE_HW_ERR_GT_FATAL_EU_GRF,
+	[DRM_XE_GENL_GT_ERROR_FATAL_FPU] = XE_HW_ERR_GT_FATAL_FPU,
+	[DRM_XE_GENL_GT_ERROR_FATAL_TLB] = XE_HW_ERR_GT_FATAL_TLB,
+	[DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = XE_HW_ERR_GT_FATAL_L3_FABRIC,
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = XE_HW_ERR_GT_CORR_SUBSLICE,
+	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = XE_HW_ERR_GT_CORR_L3BANK,
+	[DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE] = XE_HW_ERR_GT_FATAL_SUBSLICE,
+	[DRM_XE_GENL_GT_ERROR_FATAL_L3BANK] = XE_HW_ERR_GT_FATAL_L3BANK,
+	[DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE] = XE_HW_ERR_TILE_CORR_SGUNIT,
+	[DRM_XE_GENL_SGUNIT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGUNIT,
+	[DRM_XE_GENL_SGUNIT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGUNIT,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMD,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMP,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_REQ,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = XE_HW_ERR_SOC_NONFATAL_ANR_MDFI,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2T,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2C,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL7,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL7,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL7,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL7,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMD,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMP,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_FATAL_CSC_PSF_REQ,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT] = XE_HW_ERR_SOC_FATAL_PUNIT,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMD,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMP,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_REQ,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = XE_HW_ERR_SOC_FATAL_ANR_MDFI,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = XE_HW_ERR_SOC_FATAL_MDFI_T2T,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = XE_HW_ERR_SOC_FATAL_MDFI_T2C,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = XE_HW_ERR_SOC_FATAL_PCIE_AER,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = XE_HW_ERR_SOC_FATAL_PCIE_ERR,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND] = XE_HW_ERR_SOC_FATAL_UR_COND,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = XE_HW_ERR_SOC_FATAL_SERR_SRCS,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL7,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL7,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL7,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL0,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL1,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL2,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL3,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL4,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL5,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL6,
+	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL7,
+	[DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = XE_HW_ERR_GSC_CORR_SRAM,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = XE_HW_ERR_GSC_NONFATAL_MIA_SHUTDOWN,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = XE_HW_ERR_GSC_NONFATAL_MIA_INTERNAL,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = XE_HW_ERR_GSC_NONFATAL_SRAM,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = XE_HW_ERR_GSC_NONFATAL_WDG,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = XE_HW_ERR_GSC_NONFATAL_ROM_PARITY,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = XE_HW_ERR_GSC_NONFATAL_UCODE_PARITY,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = XE_HW_ERR_GSC_NONFATAL_VLT_GLITCH,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = XE_HW_ERR_GSC_NONFATAL_FUSE_PULL,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = XE_HW_ERR_GSC_NONFATAL_FUSE_CRC,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = XE_HW_ERR_GSC_NONFATAL_SELF_MBIST,
+	[DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = XE_HW_ERR_GSC_NONFATAL_AON_RF_PARITY,
+	[DRM_XE_GENL_SGGI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGGI,
+	[DRM_XE_GENL_SGLI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGLI,
+	[DRM_XE_GENL_SGCI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGCI,
+	[DRM_XE_GENL_MERT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_MERT,
+	[DRM_XE_GENL_SGGI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGGI,
+	[DRM_XE_GENL_SGLI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGLI,
+	[DRM_XE_GENL_SGCI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGCI,
+	[DRM_XE_GENL_MERT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_MERT,
+};
+
+static unsigned int config_gt_id(const u64 config)
+{
+	return config >> __XE_GENL_GT_SHIFT;
+}
+
+static u64 config_counter(const u64 config)
+{
+	return config & ~(~0ULL << __XE_GENL_GT_SHIFT);
+}
+
+static bool is_gt_error(const u64 config)
+{
+	unsigned int error;
+
+	error = config_counter(config);
+	if (error <= DRM_XE_GENL_GT_ERROR_FATAL_FPU)
+		return true;
+
+	return false;
+}
+
+static bool is_gt_vector_error(const u64 config)
+{
+	unsigned int error;
+
+	error = config_counter(config);
+	if (error >= DRM_XE_GENL_GT_ERROR_FATAL_TLB &&
+	    error <= DRM_XE_GENL_GT_ERROR_FATAL_L3BANK)
+		return true;
+
+	return false;
+}
+
+static bool is_pvc_invalid_gt_errors(const u64 config)
+{
+	switch (config_counter(config)) {
+	case DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG:
+	case DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER:
+	case DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST:
+	case DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB:
+	case DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK:
+	case DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR:
+	case DRM_XE_GENL_GT_ERROR_FATAL_SQIDI:
+	case DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER:
+	case DRM_XE_GENL_GT_ERROR_FATAL_EU_IC:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static bool is_gsc_hw_error(const u64 config)
+{
+	if (config_counter(config) >= DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
+	    config_counter(config) <= DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY)
+		return true;
+
+	return false;
+}
+
+static bool is_soc_error(const u64 config)
 {
+	if (config_counter(config) >= DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD &&
+	    config_counter(config) <= DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15))
+		return true;
+
+	return false;
+}
+
+static int
+config_status(struct xe_device *xe, u64 config)
+{
+	unsigned int gt_id = config_gt_id(config);
+	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
+
+	if (!IS_DGFX(xe))
+		return -ENODEV;
+
+	if (gt->info.type == XE_GT_TYPE_UNINITIALIZED)
+		return -ENOENT;
+
+	/* GSC HW ERRORS are present on root tile of
+	 * platform supporting MEMORY SPARING only
+	 */
+	if (is_gsc_hw_error(config) && !(xe->info.platform == XE_PVC && !gt_id))
+		return -ENODEV;
+
+	/* GT vectors error  are valid on Platforms supporting error vectors only */
+	if (is_gt_vector_error(config) && xe->info.platform != XE_PVC)
+		return -ENODEV;
+
+	/* Skip gt errors not supported on pvc */
+	if (is_pvc_invalid_gt_errors(config) && xe->info.platform == XE_PVC)
+		return  -ENODEV;
+
+	/* FATAL FPU error is valid on PVC only */
+	if (config_counter(config) == DRM_XE_GENL_GT_ERROR_FATAL_FPU &&
+	    !(xe->info.platform == XE_PVC))
+		return -ENODEV;
+
+	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
+		return -ENODEV;
+
+	return (config_counter(config) >=
+			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
+}
+
+static u64 get_counter_value(struct xe_device *xe, u64 config)
+{
+	const unsigned int gt_id = config_gt_id(config);
+	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
+	unsigned int id = config_counter(config);
+
+	if (is_gt_error(config) || is_gt_vector_error(config))
+		return xa_to_value(xa_load(&gt->errors.hw_error, xe_hw_error_map[id]));
+
+	return xa_to_value(xa_load(&gt->tile->errors.hw_error, xe_hw_error_map[id]));
+}
+
+static int fill_error_details(struct xe_device *xe, struct genl_info *info, struct sk_buff *new_msg)
+{
+	struct nlattr *entry_attr;
+	bool counter = false;
+	struct xe_gt *gt;
+	int i, j;
+
+	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
+		     ARRAY_SIZE(xe_hw_error_map));
+
+	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL)
+		counter = true;
+
+	entry_attr = nla_nest_start(new_msg, DRM_RAS_ATTR_QUERY_REPLY);
+	if (!entry_attr)
+		return -EMSGSIZE;
+
+	for_each_gt(gt, xe, j) {
+		char str[MAX_ERROR_NAME];
+		u64 val;
+
+		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
+			u64 config = DRM_XE_HW_ERROR(j, i);
+
+			if (config_status(xe, config))
+				continue;
+
+			/* should this be cleared everytime */
+			snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]);
+
+			if (nla_put_string(new_msg, DRM_RAS_ATTR_ERROR_NAME, str))
+				goto err;
+			if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
+				goto err;
+			if (counter) {
+				val = get_counter_value(xe, config);
+				if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val,
+						      DRM_ATTR_PAD))
+					goto err;
+			}
+		}
+	}
+
+	nla_nest_end(new_msg, entry_attr);
+
 	return 0;
+err:
+	drm_dbg_driver(&xe->drm, "msg buff is small\n");
+	nla_nest_cancel(new_msg, entry_attr);
+	nlmsg_free(new_msg);
+
+	return -EMSGSIZE;
+}
+
+static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+{
+	struct xe_device *xe = to_xe_device(drm);
+	size_t msg_size = NLMSG_DEFAULT_SIZE;
+	enum drm_cmd_request_type query_type;
+	struct sk_buff *new_msg;
+	int retries = 2;
+	void *usrhdr;
+	int ret = 0;
+
+	if (!IS_DGFX(xe))
+		return -ENODEV;
+
+	/* Support verbose only errors */
+	query_type = nla_get_u8(info->attrs[DRM_RAS_ATTR_QUERY]);
+	if (query_type == DRM_RAS_CMD_QUERY_NORMAL)
+		return -EOPNOTSUPP;
+
+	do {
+		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
+		if (!new_msg)
+			return -ENOMEM;
+
+		ret = fill_error_details(xe, info, new_msg);
+		if (!ret)
+			break;
+
+		msg_size += NLMSG_DEFAULT_SIZE;
+	} while (retries--);
+
+	if (!ret)
+		ret = drm_genl_reply(new_msg, info, usrhdr);
+
+	return ret;
 }
 
 static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
 {
-	return 0;
+	struct xe_device *xe = to_xe_device(drm);
+	size_t msg_size = NLMSG_DEFAULT_SIZE;
+	struct sk_buff *new_msg;
+	void *usrhdr;
+	int ret = 0;
+	int retries = 2;
+	u64 config, val;
+
+	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_BLOCK)
+		return -EOPNOTSUPP;
+
+	config = nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_ID]);
+	ret = config_status(xe, config);
+	if (ret)
+		return ret;
+	do {
+		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
+		if (!new_msg)
+			return -ENOMEM;
+
+		val = get_counter_value(xe, config);
+		if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
+			msg_size += NLMSG_DEFAULT_SIZE;
+			continue;
+		}
+
+		break;
+	} while (retries--);
+
+	ret = drm_genl_reply(new_msg, info, usrhdr);
+
+	return ret;
 }
 
 /* driver callbacks to DRM netlink commands*/
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index e2426413488f..d352a96e4826 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -1974,6 +1974,91 @@ struct drm_xe_query_eu_stall {
 	__u64 sampling_rates[];
 };
 
+/*
+ * Top bits of every counter are GT id.
+ */
+#define __XE_GENL_GT_SHIFT	(56)
+/**
+ * DOC: XE GENL netlink event IDs
+ * TODO: Add more details
+ */
+#define DRM_XE_HW_ERROR(gt, id) \
+	((id) | ((__u64)(gt) << __XE_GENL_GT_SHIFT))
+
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG		(0)
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC			(1)
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER		(2)
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM			(3)
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC		(4)
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF		(5)
+#define DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST			(6)
+#define DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB			(7)
+#define DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK		(8)
+#define DRM_XE_GENL_GT_ERROR_FATAL_GUC			(9)
+#define DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR			(10)
+#define DRM_XE_GENL_GT_ERROR_FATAL_SQIDI			(11)
+#define DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER			(12)
+#define DRM_XE_GENL_GT_ERROR_FATAL_SLM			(13)
+#define DRM_XE_GENL_GT_ERROR_FATAL_EU_IC			(14)
+#define DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF			(15)
+#define DRM_XE_GENL_GT_ERROR_FATAL_FPU			(16)
+#define DRM_XE_GENL_GT_ERROR_FATAL_TLB			(17)
+#define DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC			(18)
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE		(19)
+#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK		(20)
+#define DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE			(21)
+#define DRM_XE_GENL_GT_ERROR_FATAL_L3BANK			(22)
+#define DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE			(23)
+#define DRM_XE_GENL_SGUNIT_ERROR_NONFATAL			(24)
+#define DRM_XE_GENL_SGUNIT_ERROR_FATAL			(25)
+#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD		(26)
+#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP		(27)
+#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ		(28)
+#define DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI		(29)
+#define DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T		(30)
+#define DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C		(31)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD		(32)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP		(33)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ		(34)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT			(35)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD		(36)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP		(37)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ		(38)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI			(39)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T			(40)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C			(41)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER			(42)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR			(43)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND			(44)
+#define DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS		(45)
+
+#define DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(ss, n)\
+		(DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS + 0x1 + (ss) * 0x10 + (n))
+#define DRM_XE_GENL_SOC_ERROR_FATAL_HBM(ss, n)\
+		(DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15) + 0x1 + (ss) * 0x10 + (n))
+
+/* 109 is the last ID used by SOC errors */
+#define DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC		(110)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN		(111)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL		(112)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC		(113)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT		(114)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY		(115)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY		(116)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH		(117)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL		(118)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(119)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST		(120)
+#define DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY		(121)
+#define DRM_XE_GENL_SGGI_ERROR_NONFATAL			(122)
+#define DRM_XE_GENL_SGLI_ERROR_NONFATAL			(123)
+#define DRM_XE_GENL_SGCI_ERROR_NONFATAL			(124)
+#define DRM_XE_GENL_MERT_ERROR_NONFATAL			(125)
+#define DRM_XE_GENL_SGGI_ERROR_FATAL				(126)
+#define DRM_XE_GENL_SGLI_ERROR_FATAL				(127)
+#define DRM_XE_GENL_SGCI_ERROR_FATAL				(128)
+#define DRM_XE_GENL_MERT_ERROR_FATAL				(129)
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC v5 4/5] drm/netlink: Define multicast groups
  2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (2 preceding siblings ...)
  2025-07-30  6:49 ` [RFC v5 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
@ 2025-07-30  6:49 ` Aravind Iddamsetty
  2025-08-15 22:00   ` Rodrigo Vivi
  2025-07-30  6:49 ` [RFC v5 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-07-30  6:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

Netlink subsystem supports event notifications to userspace. we define
two multicast groups for correctable and uncorrectable errors to which
userspace can subscribe and be notified when any of those errors happen.
The group names are local to the driver's genl netlink family.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/drm_netlink.c  | 7 +++++++
 include/drm/drm_netlink.h      | 5 +++++
 include/uapi/drm/drm_netlink.h | 4 ++++
 3 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
index da4bfde32a22..a7c0a4401ca9 100644
--- a/drivers/gpu/drm/drm_netlink.c
+++ b/drivers/gpu/drm/drm_netlink.c
@@ -15,6 +15,11 @@
 
 DEFINE_XARRAY(drm_dev_xarray);
 
+static const struct genl_multicast_group drm_event_mcgrps[] = {
+	[DRM_GENL_MCAST_CORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, },
+	[DRM_GENL_MCAST_UNCORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, },
+};
+
 /**
  * drm_genl_reply - response to a request
  * @msg: socket buffer
@@ -156,6 +161,8 @@ static void drm_genl_family_init(struct drm_device *dev)
 	dev->drm_genl_family->ops = drm_genl_ops;
 	dev->drm_genl_family->n_ops = ARRAY_SIZE(drm_genl_ops);
 	dev->drm_genl_family->maxattr = DRM_ATTR_MAX;
+	dev->drm_genl_family->mcgrps = drm_event_mcgrps;
+	dev->drm_genl_family->n_mcgrps = ARRAY_SIZE(drm_event_mcgrps);
 	dev->drm_genl_family->module = dev->dev->driver->owner;
 }
 
diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
index 4a746222337a..9e48147d0d36 100644
--- a/include/drm/drm_netlink.h
+++ b/include/drm/drm_netlink.h
@@ -12,6 +12,11 @@ struct drm_device;
 struct genl_info;
 struct sk_buff;
 
+enum mcgrps_events {
+	DRM_GENL_MCAST_CORR_ERR,
+	DRM_GENL_MCAST_UNCORR_ERR,
+};
+
 struct driver_genl_ops {
 	int		       (*doit)(struct drm_device *dev,
 				       struct sk_buff *skb,
diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
index 58afb6e8d84a..c978efaab124 100644
--- a/include/uapi/drm/drm_netlink.h
+++ b/include/uapi/drm/drm_netlink.h
@@ -26,6 +26,8 @@
 #define _DRM_NETLINK_H_
 
 #define DRM_GENL_VERSION 1
+#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR	"drm_corr_err"
+#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR	"drm_uncorr_err"
 
 #if defined(__cplusplus)
 extern "C" {
@@ -50,6 +52,8 @@ enum drm_genl_error_cmds {
 	DRM_RAS_CMD_READ_BLOCK,
 	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
 	DRM_RAS_CMD_READ_ALL,
+	/** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of multicast event */
+	DRM_RAS_CMD_ERROR_EVENT,
 
 	__DRM_CMD_MAX,
 	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC v5 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (3 preceding siblings ...)
  2025-07-30  6:49 ` [RFC v5 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
@ 2025-07-30  6:49 ` Aravind Iddamsetty
  2025-08-15 22:01   ` Rodrigo Vivi
  2025-07-30 21:00 ` [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Lukas Wunner
  2025-08-13 20:21 ` Rodrigo Vivi
  6 siblings, 1 reply; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-07-30  6:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

Whenever a correctable or an uncorrectable error happens an event is sent
to the corresponding listeners of these groups.

v2: Rebase
v3: protect with CONFIG_NET define.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v2
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_hw_error.c | 41 ++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index bdd9c88674b2..e6e2e6250b70 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -2,6 +2,8 @@
 /*
  * Copyright © 2023 Intel Corporation
  */
+#include <net/genetlink.h>
+#include <uapi/drm/drm_netlink.h>
 
 #include "xe_gt_printk.h"
 #include "xe_hw_error.h"
@@ -776,6 +778,43 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 				(HARDWARE_ERROR_MAX << 1) + 1);
 }
 
+#ifdef CONFIG_NET
+static void
+generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
+{
+	struct sk_buff *msg;
+	void *hdr;
+
+	if (!xe->drm.drm_genl_family)
+		return;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
+	if (!msg) {
+		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
+		return;
+	}
+
+	hdr = genlmsg_put(msg, 0, 0, xe->drm.drm_genl_family, 0, DRM_RAS_CMD_ERROR_EVENT);
+	if (!hdr) {
+		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
+		nlmsg_free(msg);
+		return;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	genlmsg_multicast(xe->drm.drm_genl_family, msg, 0,
+			  hw_err ?
+			  DRM_GENL_MCAST_UNCORR_ERR
+			  : DRM_GENL_MCAST_CORR_ERR,
+			  GFP_ATOMIC);
+}
+#else
+static void
+generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
+{}
+#endif
+
 static void
 xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
@@ -837,6 +876,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
 	}
 
 	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
+
+	generate_netlink_event(tile_to_xe(tile), hw_err);
 unlock:
 	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (4 preceding siblings ...)
  2025-07-30  6:49 ` [RFC v5 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
@ 2025-07-30 21:00 ` Lukas Wunner
  2025-07-31 15:30   ` Aravind Iddamsetty
  2025-08-13 20:21 ` Rodrigo Vivi
  6 siblings, 1 reply; 24+ messages in thread
From: Lukas Wunner @ 2025-07-30 21:00 UTC (permalink / raw)
  To: Aravind Iddamsetty
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl,
	Michael J, Riana Tauro, Anshuman Gupta

On Wed, Jul 30, 2025 at 12:19:51PM +0530, Aravind Iddamsetty wrote:
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> reporting the errors to the host, which the KMD processes and exposes a
> set of error counters which can be used by observability tools to take 
> corrective actions or repairs. Traditionally there were being exposed 
> via PMU (for relative counters) and sysfs interface (for absolute 
> value) in our internal branch. But, due to the limitations in this 
> approach to use two interfaces and also not able to have an event based 
> reporting or configurability, an alternative approach to try netlink 
> was suggested by community for drm subsystem wide UAPI for RAS and 
> telemetry as discussed in [2]. 
> 
> This [2] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too.

It seems this series was originally conceived in 2023.  In the meantime,
tooling has been introduced to auto-generate all the netlink boilerplate
code from a YAML description in Documentation/netlink/specs/.  I *think*
using it is mandatory for all newly introduced Netlink protocols.

Basically you create the uapi and kernel header files plus kernel source
like this:

tools/net/ynl/pyynl/ynl_gen_c.py --spec Documentation/netlink/specs/drm.yaml \
  --mode uapi --header
tools/net/ynl/pyynl/ynl_gen_c.py --spec Documentation/netlink/specs/drm.yaml \
  --mode kernel --header
tools/net/ynl/pyynl/ynl_gen_c.py --spec Documentation/netlink/specs/drm.yaml \
  --mode kernel --source

And then you add both the YAML file as well as the generated files to
the commit.  The reason you have to do that is because Python is
optional for building the kernel per Documentation/process/changes.rst,
so the files cannot be generated at compile time.  It is possible though
to regenerate them with tools/net/ynl/ynl-regen.sh whenever the YAML file
is changed.

ynl_gen_c.py is capable of auto-generating code for user space applications
as well.  And there's tools/net/ynl/pyynl/cli.py to listen to events or
send requests without having to write any code.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2025-07-30 21:00 ` [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Lukas Wunner
@ 2025-07-31 15:30   ` Aravind Iddamsetty
  0 siblings, 0 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-07-31 15:30 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Rodrigo Vivi, Hawking Zhang, Lijo Lazar,
	Michael J, Riana Tauro, Anshuman Gupta


On 31-07-2025 02:30, Lukas Wunner wrote:
> On Wed, Jul 30, 2025 at 12:19:51PM +0530, Aravind Iddamsetty wrote:
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> reporting the errors to the host, which the KMD processes and exposes a
>> set of error counters which can be used by observability tools to take 
>> corrective actions or repairs. Traditionally there were being exposed 
>> via PMU (for relative counters) and sysfs interface (for absolute 
>> value) in our internal branch. But, due to the limitations in this 
>> approach to use two interfaces and also not able to have an event based 
>> reporting or configurability, an alternative approach to try netlink 
>> was suggested by community for drm subsystem wide UAPI for RAS and 
>> telemetry as discussed in [2]. 
>>
>> This [2] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too.
> It seems this series was originally conceived in 2023.  In the meantime,
> tooling has been introduced to auto-generate all the netlink boilerplate
> code from a YAML description in Documentation/netlink/specs/.  I *think*
> using it is mandatory for all newly introduced Netlink protocols.

Thanks Lukas letting me know this. Will do the necessary.

Regards,
Aravind.
>
> Basically you create the uapi and kernel header files plus kernel source
> like this:
>
> tools/net/ynl/pyynl/ynl_gen_c.py --spec Documentation/netlink/specs/drm.yaml \
>   --mode uapi --header
> tools/net/ynl/pyynl/ynl_gen_c.py --spec Documentation/netlink/specs/drm.yaml \
>   --mode kernel --header
> tools/net/ynl/pyynl/ynl_gen_c.py --spec Documentation/netlink/specs/drm.yaml \
>   --mode kernel --source
>
> And then you add both the YAML file as well as the generated files to
> the commit.  The reason you have to do that is because Python is
> optional for building the kernel per Documentation/process/changes.rst,
> so the files cannot be generated at compile time.  It is possible though
> to regenerate them with tools/net/ynl/ynl-regen.sh whenever the YAML file
> is changed.
>
> ynl_gen_c.py is capable of auto-generating code for user space applications
> as well.  And there's tools/net/ynl/pyynl/cli.py to listen to events or
> send requests without having to write any code.
>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (5 preceding siblings ...)
  2025-07-30 21:00 ` [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Lukas Wunner
@ 2025-08-13 20:21 ` Rodrigo Vivi
  2025-08-15 21:24   ` Rodrigo Vivi
  2025-08-25  9:38   ` Aravind Iddamsetty
  6 siblings, 2 replies; 24+ messages in thread
From: Rodrigo Vivi @ 2025-08-13 20:21 UTC (permalink / raw)
  To: Aravind Iddamsetty, Dave Airlie, Joonas Lahtinen
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

On Wed, Jul 30, 2025 at 12:19:51PM +0530, Aravind Iddamsetty wrote:
> Revisiting this patch series to address pending feedback and help move
> the discussion towards a conclusion. This revision includes updates
> based on previous comments[1] and aims to clarify outstanding concerns.
> Specifically added command to facility reporting errors from IP blocks
> to support AMDGPU driver model of RAS.
> [1]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> 
> I sincerely appreciate everyones patience and thoughtful reviews so
> far, and I hope this refreshed series facilitates the final evaluation
> and acceptance.
> 
> Please feel free to share any further suggestions or questions.
> 
> Thank you for your continued consideration.
> ----------------------------------------------------------------------
> 
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> reporting the errors to the host, which the KMD processes and exposes a
> set of error counters which can be used by observability tools to take 
> corrective actions or repairs. Traditionally there were being exposed 
> via PMU (for relative counters) and sysfs interface (for absolute 
> value) in our internal branch. But, due to the limitations in this 
> approach to use two interfaces and also not able to have an event based 
> reporting or configurability, an alternative approach to try netlink 
> was suggested by community for drm subsystem wide UAPI for RAS and 
> telemetry as discussed in [2]. 
> 
> This [2] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.
> 
> [2]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

I'm bringing some thoughts below and I'd like to get inputs from folks involved
in the original discussions here please.
Any thought is welcomed so we can move faster towards a real GPU standard RAS
solution.

> 
> This series is on top [3] series which introduces error counting infra in Xe
> driver.
> [3]: https://lore.kernel.org/all/20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com/
> 
> V5:
> Add support to read error corresponding to an IP BLOCK

I honestly don't believe that this version solves all the concerns raised by
AMD folks in the previous reviews. It is true that this is bringing ways of
reading errors per IP block, but if I understood them correctly, they would
like better (and separate) ways to declare and handle the errors coming from
different IP block, rather than simply reading/querying for them filtered out.

So, I have som grouping ideas below.

> 
> v4:
> 1. Rebase
> 2. rename drm_genl_send to drm_genl_reply

But before going to the ideas below I'd like to also raise the naming issue
that I see with this proposal.

I was recently running some experiments to devlink with this and similar
cases. I don't believe that devlink is a good fit for our drm-ras. It is
way too much centric on network devices and any addition there to our
GPU RAS would be a heavy lift. But, there are some good things from there
that we could perhaps get inspiration from.

Starting from the name. devlink is the name of the tool and the name
of the framework. It uses netlink on the back, but totally abstracting
that. Here in this version we can see:
drm_ras: the tool
drm_netlink: the abstraction
drm_genl_*: the wrapper?

So, I believe that as devlink we should have a single name for everything
and avoid wrappers but providing the real module registration, with
groups, and functions. Entirely abstracting the netlink and focusing
on the RAS functionalities.

I'm terrible with naming, but playing a bit with AI for some suggestions,
I'd say that my favorites are:
drmras - no '_' like most of the tools, but not only for the tool, but also for
the files and functions.
drmlink - more link, but less ras :/
grill - GPU RAS Interface Link Layer

For the rest of the examples below I'm going with grill, but let me know your
preferences.

> 3. catch error from xa_store and handle appropriately
> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
> having an early check of IS_DGFX (Michael J. Ruhl)
> 
> v3:
> 1. Rebase on latest RAS series for XE
> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
> register to netlink subsystem
> 
> v2: define common interfaces to genl netlink subsystem that all drm drivers
> can leverage.
> 
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
> https://lore.kernel.org/all/20250730061342.1380217-2-aravind.iddamsetty@linux.intel.com/
> 
> read single error counter:
> 
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> counter value 0

no need for --device, that should be mandatory argument.
And we could accept BDF or card identification

$ grill list
00:03:00.0 - card0 - xe

$ grill 00:03:00.0 list # Querying available modules.
monitor - global
erros - gt
erros - soc

Yes, my idea is that driver should be able to register modules and group per module

GRILL would be designed to accommodate multiple kinds of RAS modules, each module,
with groups, categories and operations.

Modules: monitor, error, flash?!, etc?!
Groups: Global or per IP block depending on the HW underneath
Categories: Sub-groups like correctable-error vs uncorrectable-error for instance if/where
	    it makes sense.
Operations: Monitor: set-threshold / listen (listen is just a tool operation, but every monitor
	    needs to provide events over netlink)
	    Error: read, clear, logs


$ grill 00:03:00.0 error global counter list
# List all available counters in this gpu

$ grill 00:03:00.0 error global counter show soc_fatal_hbm2_chnl0
# Show a specific counter.

$ grill 00:03:00.0 error global log
# Print all the stashed CPER logs (stash can be hw/fw/sw or a combination -
  	    		     	   in AMD case it is a dump of their debugfs ring)


So, I'm sure the next question is what if the log is global, but the counters
are not? Well, perhaps we should have different Modules for error-counter
split from error-logging ?!

So yes, my thoughts still have some opens, but I'd like to hear your thoughts
and opinions on the overall idea here.

Thanks in advance,
Rodrigo.

> 
> read all error counters:
> 
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name                                                    config-id               counter
> 
> error-gt0-correctable-guc                               0x0000000000000001      0
> error-gt0-correctable-slm                               0x0000000000000003      0
> error-gt0-correctable-eu-ic                             0x0000000000000004      0
> error-gt0-correctable-eu-grf                            0x0000000000000005      0
> error-gt0-fatal-guc                                     0x0000000000000009      0
> error-gt0-fatal-slm                                     0x000000000000000d      0
> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> error-gt0-fatal-fpu                                     0x0000000000000010      0
> error-gt0-fatal-tlb                                     0x0000000000000011      0
> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> error-gt0-correctable-subslice                          0x0000000000000013      0
> error-gt0-correctable-l3bank                            0x0000000000000014      0
> error-gt0-fatal-subslice                                0x0000000000000015      0
> error-gt0-fatal-l3bank                                  0x0000000000000016      0
> error-gt0-sgunit-correctable                            0x0000000000000017      0
> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> error-gt0-sgunit-fatal                                  0x0000000000000019      0
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> error-gt0-soc-fatal-punit                               0x000000000000001d      0
> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> error-gt1-correctable-guc                               0x1000000000000001      0
> error-gt1-correctable-slm                               0x1000000000000003      0
> error-gt1-correctable-eu-ic                             0x1000000000000004      0
> error-gt1-correctable-eu-grf                            0x1000000000000005      0
> error-gt1-fatal-guc                                     0x1000000000000009      0
> error-gt1-fatal-slm                                     0x100000000000000d      0
> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> error-gt1-fatal-fpu                                     0x1000000000000010      0
> error-gt1-fatal-tlb                                     0x1000000000000011      0
> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> error-gt1-correctable-subslice                          0x1000000000000013      0
> error-gt1-correctable-l3bank                            0x1000000000000014      0
> error-gt1-fatal-subslice                                0x1000000000000015      0
> error-gt1-fatal-l3bank                                  0x1000000000000016      0
> error-gt1-sgunit-correctable                            0x1000000000000017      0
> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> error-gt1-sgunit-fatal                                  0x1000000000000019      0
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> error-gt1-soc-fatal-punit                               0x100000000000001d      0
> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
> 
> wait on a error event:
> 
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> waiting for error event
> error event received
> counter value 0
> 
> list all errors:
> 
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name                                                    config-id
> 
> error-gt0-correctable-guc                               0x0000000000000001
> error-gt0-correctable-slm                               0x0000000000000003
> error-gt0-correctable-eu-ic                             0x0000000000000004
> error-gt0-correctable-eu-grf                            0x0000000000000005
> error-gt0-fatal-guc                                     0x0000000000000009
> error-gt0-fatal-slm                                     0x000000000000000d
> error-gt0-fatal-eu-grf                                  0x000000000000000f
> error-gt0-fatal-fpu                                     0x0000000000000010
> error-gt0-fatal-tlb                                     0x0000000000000011
> error-gt0-fatal-l3-fabric                               0x0000000000000012
> error-gt0-correctable-subslice                          0x0000000000000013
> error-gt0-correctable-l3bank                            0x0000000000000014
> error-gt0-fatal-subslice                                0x0000000000000015
> error-gt0-fatal-l3bank                                  0x0000000000000016
> error-gt0-sgunit-correctable                            0x0000000000000017
> error-gt0-sgunit-nonfatal                               0x0000000000000018
> error-gt0-sgunit-fatal                                  0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> error-gt0-soc-fatal-punit                               0x000000000000001d
> error-gt0-soc-fatal-psf-0                               0x000000000000001e
> error-gt0-soc-fatal-psf-1                               0x000000000000001f
> error-gt0-soc-fatal-psf-2                               0x0000000000000020
> error-gt0-soc-fatal-cd0                                 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> error-gt1-correctable-guc                               0x1000000000000001
> error-gt1-correctable-slm                               0x1000000000000003
> error-gt1-correctable-eu-ic                             0x1000000000000004
> error-gt1-correctable-eu-grf                            0x1000000000000005
> error-gt1-fatal-guc                                     0x1000000000000009
> error-gt1-fatal-slm                                     0x100000000000000d
> error-gt1-fatal-eu-grf                                  0x100000000000000f
> error-gt1-fatal-fpu                                     0x1000000000000010
> error-gt1-fatal-tlb                                     0x1000000000000011
> error-gt1-fatal-l3-fabric                               0x1000000000000012
> error-gt1-correctable-subslice                          0x1000000000000013
> error-gt1-correctable-l3bank                            0x1000000000000014
> error-gt1-fatal-subslice                                0x1000000000000015
> error-gt1-fatal-l3bank                                  0x1000000000000016
> error-gt1-sgunit-correctable                            0x1000000000000017
> error-gt1-sgunit-nonfatal                               0x1000000000000018
> error-gt1-sgunit-fatal                                  0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> error-gt1-soc-fatal-punit                               0x100000000000001d
> error-gt1-soc-fatal-psf-0                               0x100000000000001e
> error-gt1-soc-fatal-psf-1                               0x100000000000001f
> error-gt1-soc-fatal-psf-2                               0x1000000000000020
> error-gt1-soc-fatal-cd0                                 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
> 
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
> Cc: Riana Tauro <riana.tauro@intel.com>
> Cc: Anshuman Gupta <anshuman.gupta@intel.com>
> 
> 
> Aravind Iddamsetty (5):
>   drm/netlink: Add netlink infrastructure
>   drm/xe/RAS: Register netlink capability
>   drm/xe/RAS: Expose the error counters
>   drm/netlink: Define multicast groups
>   drm/xe/RAS: send multicast event on occurrence of an error
> 
>  drivers/gpu/drm/Makefile             |   1 +
>  drivers/gpu/drm/drm_drv.c            |   7 +
>  drivers/gpu/drm/drm_netlink.c        | 219 +++++++++++
>  drivers/gpu/drm/xe/Makefile          |   2 +
>  drivers/gpu/drm/xe/xe_device.c       |   6 +
>  drivers/gpu/drm/xe/xe_device_types.h |   1 +
>  drivers/gpu/drm/xe/xe_hw_error.c     |  56 ++-
>  drivers/gpu/drm/xe/xe_netlink.c      | 531 +++++++++++++++++++++++++++
>  include/drm/drm_device.h             |  10 +
>  include/drm/drm_drv.h                |   7 +
>  include/drm/drm_netlink.h            |  46 +++
>  include/uapi/drm/drm_netlink.h       | 105 ++++++
>  include/uapi/drm/xe_drm.h            |  85 +++++
>  13 files changed, 1071 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/gpu/drm/drm_netlink.c
>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>  create mode 100644 include/drm/drm_netlink.h
>  create mode 100644 include/uapi/drm/drm_netlink.h
> 
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 1/5] drm/netlink: Add netlink infrastructure
  2025-07-30  6:49 ` [RFC v5 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
@ 2025-08-15 17:07   ` Zack McKevitt
  2025-08-21  9:45     ` Aravind Iddamsetty
  2025-08-15 21:48   ` Rodrigo Vivi
  1 sibling, 1 reply; 24+ messages in thread
From: Zack McKevitt @ 2025-08-15 17:07 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

On 7/30/2025 12:49 AM, Aravind Iddamsetty wrote:
> +static void drm_genl_family_init(struct drm_device *dev)
> +{
> +	dev->drm_genl_family = drmm_kzalloc(dev, sizeof(struct genl_family),
> +					    GFP_KERNEL);
> +
> +	/* Use drm primary node name eg: card0 to name the genl family */
> +	snprintf(dev->drm_genl_family->name, sizeof(dev->drm_genl_family->name),
> +		 "%s", dev->primary->kdev->kobj.name);
> +	dev->drm_genl_family->version = DRM_GENL_VERSION;
> +	dev->drm_genl_family->parallel_ops = true;
> +	dev->drm_genl_family->ops = drm_genl_ops;
> +	dev->drm_genl_family->n_ops = ARRAY_SIZE(drm_genl_ops);
> +	dev->drm_genl_family->maxattr = DRM_ATTR_MAX;
> +	dev->drm_genl_family->module = dev->dev->driver->owner;
> +}

We are interested in using this infrastructure at Qualcomm to 
communicate telemetry information for the AI100 accelerators. It would 
be nice if this function could support drm_minor accel nodes 
(dev->accel) as well.

Thanks,

Zack

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2025-08-13 20:21 ` Rodrigo Vivi
@ 2025-08-15 21:24   ` Rodrigo Vivi
  2025-08-26  4:42     ` Aravind Iddamsetty
  2025-08-25  9:38   ` Aravind Iddamsetty
  1 sibling, 1 reply; 24+ messages in thread
From: Rodrigo Vivi @ 2025-08-15 21:24 UTC (permalink / raw)
  To: Aravind Iddamsetty, Dave Airlie, Joonas Lahtinen
  Cc: intel-xe, dri-devel, Alex Deucher, Simona Vetter, Hawking Zhang,
	Lijo Lazar, Ruhl, Michael J, Riana Tauro, Anshuman Gupta

On Wed, Aug 13, 2025 at 04:21:03PM -0400, Rodrigo Vivi wrote:
> On Wed, Jul 30, 2025 at 12:19:51PM +0530, Aravind Iddamsetty wrote:
> > Revisiting this patch series to address pending feedback and help move
> > the discussion towards a conclusion. This revision includes updates
> > based on previous comments[1] and aims to clarify outstanding concerns.
> > Specifically added command to facility reporting errors from IP blocks
> > to support AMDGPU driver model of RAS.
> > [1]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> > 
> > I sincerely appreciate everyones patience and thoughtful reviews so
> > far, and I hope this refreshed series facilitates the final evaluation
> > and acceptance.
> > 
> > Please feel free to share any further suggestions or questions.
> > 
> > Thank you for your continued consideration.
> > ----------------------------------------------------------------------
> > 
> > Our hardware supports RAS(Reliability, Availability, Serviceability) by
> > reporting the errors to the host, which the KMD processes and exposes a
> > set of error counters which can be used by observability tools to take 
> > corrective actions or repairs. Traditionally there were being exposed 
> > via PMU (for relative counters) and sysfs interface (for absolute 
> > value) in our internal branch. But, due to the limitations in this 
> > approach to use two interfaces and also not able to have an event based 
> > reporting or configurability, an alternative approach to try netlink 
> > was suggested by community for drm subsystem wide UAPI for RAS and 
> > telemetry as discussed in [2]. 
> > 
> > This [2] is the inspiration to this series. It uses the generic
> > netlink(genl) family subsystem and exposes a set of commands that can
> > be used by every drm driver, the framework provides a means to have
> > custom commands too. Each drm driver instance in this example xe driver
> > instance registers a family and operations to the genl subsystem through
> > which it enumerates and reports the error counters. An event based
> > notification is also supported to which userpace can subscribe to and
> > be notified when any error occurs and read the error counter this avoids
> > continuous polling on error counter. This can also be extended to
> > threshold based notification.
> > 
> > [2]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> 
> I'm bringing some thoughts below and I'd like to get inputs from folks involved
> in the original discussions here please.
> Any thought is welcomed so we can move faster towards a real GPU standard RAS
> solution.
> 
> > 
> > This series is on top [3] series which introduces error counting infra in Xe
> > driver.
> > [3]: https://lore.kernel.org/all/20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com/
> > 
> > V5:
> > Add support to read error corresponding to an IP BLOCK
> 
> I honestly don't believe that this version solves all the concerns raised by
> AMD folks in the previous reviews. It is true that this is bringing ways of
> reading errors per IP block, but if I understood them correctly, they would
> like better (and separate) ways to declare and handle the errors coming from
> different IP block, rather than simply reading/querying for them filtered out.
> 
> So, I have som grouping ideas below.
> 
> > 
> > v4:
> > 1. Rebase
> > 2. rename drm_genl_send to drm_genl_reply
> 
> But before going to the ideas below I'd like to also raise the naming issue
> that I see with this proposal.
> 
> I was recently running some experiments to devlink with this and similar
> cases. I don't believe that devlink is a good fit for our drm-ras. It is
> way too much centric on network devices and any addition there to our
> GPU RAS would be a heavy lift. But, there are some good things from there
> that we could perhaps get inspiration from.
> 
> Starting from the name. devlink is the name of the tool and the name
> of the framework. It uses netlink on the back, but totally abstracting
> that. Here in this version we can see:
> drm_ras: the tool
> drm_netlink: the abstraction
> drm_genl_*: the wrapper?
> 
> So, I believe that as devlink we should have a single name for everything
> and avoid wrappers but providing the real module registration, with
> groups, and functions. Entirely abstracting the netlink and focusing
> on the RAS functionalities.
> 
> I'm terrible with naming, but playing a bit with AI for some suggestions,
> I'd say that my favorites are:
> drmras - no '_' like most of the tools, but not only for the tool, but also for
> the files and functions.
> drmlink - more link, but less ras :/
> grill - GPU RAS Interface Link Layer
> 
> For the rest of the examples below I'm going with grill, but let me know your
> preferences.
> 
> > 3. catch error from xa_store and handle appropriately
> > 4. presently xe_list_errors fills blank data for IGFX, prevent it by
> > having an early check of IS_DGFX (Michael J. Ruhl)
> > 
> > v3:
> > 1. Rebase on latest RAS series for XE
> > 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
> > register to netlink subsystem
> > 
> > v2: define common interfaces to genl netlink subsystem that all drm drivers
> > can leverage.
> > 
> > Below is an example tool drm_ras which demonstrates the use of the
> > supported commands. The tool will be sent to ML with the subject
> > "[RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
> > https://lore.kernel.org/all/20250730061342.1380217-2-aravind.iddamsetty@linux.intel.com/
> > 
> > read single error counter:
> > 
> > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> > counter value 0
> 
> no need for --device, that should be mandatory argument.
> And we could accept BDF or card identification
> 
> $ grill list
> 00:03:00.0 - card0 - xe
> 
> $ grill 00:03:00.0 list # Querying available modules.
> monitor - global
> erros - gt
> erros - soc
> 
> Yes, my idea is that driver should be able to register modules and group per module

Please allow me to emphasize here that the group registration
is just to make this extensible and also to accommodate the AMD case,
but not change the original essence of the goal which is to
create the drm-ras solution.

> 
> GRILL would be designed to accommodate multiple kinds of RAS modules, each module,
> with groups, categories and operations.

also let me take back on the naming here.

Please let's go with the obvious drm_ras.

Perhaps drm_ras for the code here and drmras (one-word) for the IGT tool.

> 
> Modules: monitor, error, flash?!, etc?!

Here, please let me also tune it down a bit. The overall goal continue
to be the creation of our drm-ras framework using netlink to report
error counters and events.

Any addition is a good-to-have, but shouldn't delay the main goal.
Also, any addition should be carefully reviewed individually in the
future. My only wish here at this moment is to think from the very
beginning in something that is expansible.

Thanks,
Rodrigo.

> Groups: Global or per IP block depending on the HW underneath
> Categories: Sub-groups like correctable-error vs uncorrectable-error for instance if/where
> 	    it makes sense.
> Operations: Monitor: set-threshold / listen (listen is just a tool operation, but every monitor
> 	    needs to provide events over netlink)
> 	    Error: read, clear, logs
> 
> 
> $ grill 00:03:00.0 error global counter list
> # List all available counters in this gpu
> 
> $ grill 00:03:00.0 error global counter show soc_fatal_hbm2_chnl0
> # Show a specific counter.
> 
> $ grill 00:03:00.0 error global log
> # Print all the stashed CPER logs (stash can be hw/fw/sw or a combination -
>   	    		     	   in AMD case it is a dump of their debugfs ring)
> 
> 
> So, I'm sure the next question is what if the log is global, but the counters
> are not? Well, perhaps we should have different Modules for error-counter
> split from error-logging ?!
> 
> So yes, my thoughts still have some opens, but I'd like to hear your thoughts
> and opinions on the overall idea here.
> 
> Thanks in advance,
> Rodrigo.
> 
> > 
> > read all error counters:
> > 
> > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> > name                                                    config-id               counter
> > 
> > error-gt0-correctable-guc                               0x0000000000000001      0
> > error-gt0-correctable-slm                               0x0000000000000003      0
> > error-gt0-correctable-eu-ic                             0x0000000000000004      0
> > error-gt0-correctable-eu-grf                            0x0000000000000005      0
> > error-gt0-fatal-guc                                     0x0000000000000009      0
> > error-gt0-fatal-slm                                     0x000000000000000d      0
> > error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> > error-gt0-fatal-fpu                                     0x0000000000000010      0
> > error-gt0-fatal-tlb                                     0x0000000000000011      0
> > error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> > error-gt0-correctable-subslice                          0x0000000000000013      0
> > error-gt0-correctable-l3bank                            0x0000000000000014      0
> > error-gt0-fatal-subslice                                0x0000000000000015      0
> > error-gt0-fatal-l3bank                                  0x0000000000000016      0
> > error-gt0-sgunit-correctable                            0x0000000000000017      0
> > error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> > error-gt0-sgunit-fatal                                  0x0000000000000019      0
> > error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> > error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> > error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> > error-gt0-soc-fatal-punit                               0x000000000000001d      0
> > error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> > error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> > error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> > error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> > error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> > error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> > error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> > error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> > error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> > error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> > error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> > error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> > error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> > error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> > error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> > error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> > error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> > error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> > error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> > error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> > error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> > error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> > error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> > error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> > error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> > error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> > error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> > error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> > error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> > error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> > error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> > error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> > error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> > error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> > error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> > error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> > error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> > error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> > error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> > error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> > error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> > error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> > error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> > error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> > error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> > error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> > error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> > error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> > error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> > error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> > error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> > error-gt1-correctable-guc                               0x1000000000000001      0
> > error-gt1-correctable-slm                               0x1000000000000003      0
> > error-gt1-correctable-eu-ic                             0x1000000000000004      0
> > error-gt1-correctable-eu-grf                            0x1000000000000005      0
> > error-gt1-fatal-guc                                     0x1000000000000009      0
> > error-gt1-fatal-slm                                     0x100000000000000d      0
> > error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> > error-gt1-fatal-fpu                                     0x1000000000000010      0
> > error-gt1-fatal-tlb                                     0x1000000000000011      0
> > error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> > error-gt1-correctable-subslice                          0x1000000000000013      0
> > error-gt1-correctable-l3bank                            0x1000000000000014      0
> > error-gt1-fatal-subslice                                0x1000000000000015      0
> > error-gt1-fatal-l3bank                                  0x1000000000000016      0
> > error-gt1-sgunit-correctable                            0x1000000000000017      0
> > error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> > error-gt1-sgunit-fatal                                  0x1000000000000019      0
> > error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> > error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> > error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> > error-gt1-soc-fatal-punit                               0x100000000000001d      0
> > error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> > error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> > error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> > error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> > error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> > error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> > error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> > error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> > error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> > error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> > error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> > error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> > error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> > error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> > error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> > error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> > error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> > error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> > error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> > error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> > error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> > error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> > error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> > error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> > error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> > error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> > error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> > error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> > error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> > error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> > error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> > error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> > error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> > error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> > error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> > error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> > error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> > error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> > error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
> > 
> > wait on a error event:
> > 
> > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> > waiting for error event
> > error event received
> > counter value 0
> > 
> > list all errors:
> > 
> > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> > name                                                    config-id
> > 
> > error-gt0-correctable-guc                               0x0000000000000001
> > error-gt0-correctable-slm                               0x0000000000000003
> > error-gt0-correctable-eu-ic                             0x0000000000000004
> > error-gt0-correctable-eu-grf                            0x0000000000000005
> > error-gt0-fatal-guc                                     0x0000000000000009
> > error-gt0-fatal-slm                                     0x000000000000000d
> > error-gt0-fatal-eu-grf                                  0x000000000000000f
> > error-gt0-fatal-fpu                                     0x0000000000000010
> > error-gt0-fatal-tlb                                     0x0000000000000011
> > error-gt0-fatal-l3-fabric                               0x0000000000000012
> > error-gt0-correctable-subslice                          0x0000000000000013
> > error-gt0-correctable-l3bank                            0x0000000000000014
> > error-gt0-fatal-subslice                                0x0000000000000015
> > error-gt0-fatal-l3bank                                  0x0000000000000016
> > error-gt0-sgunit-correctable                            0x0000000000000017
> > error-gt0-sgunit-nonfatal                               0x0000000000000018
> > error-gt0-sgunit-fatal                                  0x0000000000000019
> > error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> > error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> > error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> > error-gt0-soc-fatal-punit                               0x000000000000001d
> > error-gt0-soc-fatal-psf-0                               0x000000000000001e
> > error-gt0-soc-fatal-psf-1                               0x000000000000001f
> > error-gt0-soc-fatal-psf-2                               0x0000000000000020
> > error-gt0-soc-fatal-cd0                                 0x0000000000000021
> > error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> > error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> > error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> > error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> > error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> > error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> > error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> > error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> > error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> > error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> > error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> > error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> > error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> > error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> > error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> > error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> > error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> > error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> > error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> > error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> > error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> > error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> > error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> > error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> > error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> > error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> > error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> > error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> > error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> > error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> > error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> > error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> > error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> > error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> > error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> > error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> > error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> > error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> > error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> > error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> > error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> > error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> > error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> > error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> > error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> > error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> > error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> > error-gt1-correctable-guc                               0x1000000000000001
> > error-gt1-correctable-slm                               0x1000000000000003
> > error-gt1-correctable-eu-ic                             0x1000000000000004
> > error-gt1-correctable-eu-grf                            0x1000000000000005
> > error-gt1-fatal-guc                                     0x1000000000000009
> > error-gt1-fatal-slm                                     0x100000000000000d
> > error-gt1-fatal-eu-grf                                  0x100000000000000f
> > error-gt1-fatal-fpu                                     0x1000000000000010
> > error-gt1-fatal-tlb                                     0x1000000000000011
> > error-gt1-fatal-l3-fabric                               0x1000000000000012
> > error-gt1-correctable-subslice                          0x1000000000000013
> > error-gt1-correctable-l3bank                            0x1000000000000014
> > error-gt1-fatal-subslice                                0x1000000000000015
> > error-gt1-fatal-l3bank                                  0x1000000000000016
> > error-gt1-sgunit-correctable                            0x1000000000000017
> > error-gt1-sgunit-nonfatal                               0x1000000000000018
> > error-gt1-sgunit-fatal                                  0x1000000000000019
> > error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> > error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> > error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> > error-gt1-soc-fatal-punit                               0x100000000000001d
> > error-gt1-soc-fatal-psf-0                               0x100000000000001e
> > error-gt1-soc-fatal-psf-1                               0x100000000000001f
> > error-gt1-soc-fatal-psf-2                               0x1000000000000020
> > error-gt1-soc-fatal-cd0                                 0x1000000000000021
> > error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> > error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> > error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> > error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> > error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> > error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> > error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> > error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> > error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> > error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> > error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> > error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> > error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> > error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> > error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> > error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> > error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> > error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> > error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> > error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> > error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> > error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> > error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> > error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> > error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> > error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> > error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> > error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> > error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> > error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> > error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> > error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> > error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> > error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> > error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
> > 
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona@ffwll.ch>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > Cc: Lijo Lazar <lijo.lazar@amd.com>
> > Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
> > Cc: Riana Tauro <riana.tauro@intel.com>
> > Cc: Anshuman Gupta <anshuman.gupta@intel.com>
> > 
> > 
> > Aravind Iddamsetty (5):
> >   drm/netlink: Add netlink infrastructure
> >   drm/xe/RAS: Register netlink capability
> >   drm/xe/RAS: Expose the error counters
> >   drm/netlink: Define multicast groups
> >   drm/xe/RAS: send multicast event on occurrence of an error
> > 
> >  drivers/gpu/drm/Makefile             |   1 +
> >  drivers/gpu/drm/drm_drv.c            |   7 +
> >  drivers/gpu/drm/drm_netlink.c        | 219 +++++++++++
> >  drivers/gpu/drm/xe/Makefile          |   2 +
> >  drivers/gpu/drm/xe/xe_device.c       |   6 +
> >  drivers/gpu/drm/xe/xe_device_types.h |   1 +
> >  drivers/gpu/drm/xe/xe_hw_error.c     |  56 ++-
> >  drivers/gpu/drm/xe/xe_netlink.c      | 531 +++++++++++++++++++++++++++
> >  include/drm/drm_device.h             |  10 +
> >  include/drm/drm_drv.h                |   7 +
> >  include/drm/drm_netlink.h            |  46 +++
> >  include/uapi/drm/drm_netlink.h       | 105 ++++++
> >  include/uapi/drm/xe_drm.h            |  85 +++++
> >  13 files changed, 1071 insertions(+), 5 deletions(-)
> >  create mode 100644 drivers/gpu/drm/drm_netlink.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
> >  create mode 100644 include/drm/drm_netlink.h
> >  create mode 100644 include/uapi/drm/drm_netlink.h
> > 
> > -- 
> > 2.25.1
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 1/5] drm/netlink: Add netlink infrastructure
  2025-07-30  6:49 ` [RFC v5 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
  2025-08-15 17:07   ` Zack McKevitt
@ 2025-08-15 21:48   ` Rodrigo Vivi
  2025-08-26  5:58     ` Aravind Iddamsetty
  1 sibling, 1 reply; 24+ messages in thread
From: Rodrigo Vivi @ 2025-08-15 21:48 UTC (permalink / raw)
  To: Aravind Iddamsetty
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

On Wed, Jul 30, 2025 at 12:19:52PM +0530, Aravind Iddamsetty wrote:
> Define the netlink registration interface and commands, attributes that
> can be commonly used across by drm drivers. This patch intends to use
> the generic netlink family to expose various stats of device. At present
> it defines some commands that shall be used to expose RAS error counters.
> 
> v2:
> define common interfaces to genl netlink subsystem that all drm drivers
> can leverage.(Tomer Tayar)
> 
> v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
> register to netlink subsystem (Daniel Vetter)
> 
> v4:(Michael J. Ruhl)
> 1. rename drm_genl_send to drm_genl_reply
> 2. catch error from xa_store and handle appropriately
> 
> v5:
> 1. compile only if CONFIG_NET is enabled
> 
> V6: Add support for reading an IP block errors
> 
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v4
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> ---
>  drivers/gpu/drm/Makefile       |   1 +
>  drivers/gpu/drm/drm_drv.c      |   7 ++
>  drivers/gpu/drm/drm_netlink.c  | 212 +++++++++++++++++++++++++++++++++
>  include/drm/drm_device.h       |  10 ++
>  include/drm/drm_drv.h          |   7 ++
>  include/drm/drm_netlink.h      |  41 +++++++
>  include/uapi/drm/drm_netlink.h | 101 ++++++++++++++++
>  7 files changed, 379 insertions(+)
>  create mode 100644 drivers/gpu/drm/drm_netlink.c
>  create mode 100644 include/drm/drm_netlink.h
>  create mode 100644 include/uapi/drm/drm_netlink.h
> 
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index 4dafbdc8f86a..39d5183ab35c 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -77,6 +77,7 @@ drm-$(CONFIG_DRM_CLIENT) += \
>  	drm_client.o \
>  	drm_client_event.o \
>  	drm_client_modeset.o
> +drm-$(CONFIG_NET) += drm_netlink.o
>  drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
>  drm-$(CONFIG_COMPAT) += drm_ioc32.o
>  drm-$(CONFIG_DRM_PANEL) += drm_panel.o
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 02556363e918..cce55423141c 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -1088,6 +1088,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
>  	if (ret)
>  		goto err_minors;
>  
> +	if (driver->genl_ops) {
> +		ret = drm_genl_register(dev);
> +		if (ret)
> +			goto err_minors;
> +	}

Even if we don't go with multiple 'groups' I believe that the driver should
explicitly call the netlink registration.

> +
>  	ret = create_compat_control_link(dev);
>  	if (ret)
>  		goto err_minors;
> @@ -1229,6 +1235,7 @@ static void drm_core_exit(void)
>  	drm_privacy_screen_lookup_exit();
>  	drm_panic_exit();
>  	accel_core_exit();
> +	drm_genl_exit();
>  	unregister_chrdev(DRM_MAJOR, "drm");
>  	debugfs_remove(drm_debugfs_root);
>  	drm_sysfs_destroy();
> diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
> new file mode 100644
> index 000000000000..da4bfde32a22
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_netlink.c

drm_ras.c ?

> @@ -0,0 +1,212 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation

2025 here and in any other file

> + */
> +
> +#include <net/genetlink.h>
> +#include <uapi/drm/drm_netlink.h>

uapi/drm/drm_ras.h ?!

like we don't have a drm_ioctl.h but drm_mode.h

> +
> +#include <drm/drm_device.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_managed.h>
> +#include <drm/drm_netlink.h>
> +#include <drm/drm_print.h>
> +
> +DEFINE_XARRAY(drm_dev_xarray);
> +
> +/**
> + * drm_genl_reply - response to a request
> + * @msg: socket buffer
> + * @info: receiver information
> + * @usrhdr: pointer to user specific header in the message buffer
> + *
> + * RETURNS:
> + * 0 on success and negative error code on failure
> + */
> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)

drm_ras_reply so we standardize in a single namespace everywhere ?!

and same for all other functions and structs, except for things
that are declared outside drm

> +{
> +	int ret;
> +
> +	genlmsg_end(msg, usrhdr);
> +
> +	ret = genlmsg_reply(msg, info);
> +	if (ret)
> +		nlmsg_free(msg);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(drm_genl_reply);
> +
> +/**
> + * drm_genl_alloc_msg - allocate genl message buffer
> + * @dev: drm_device for which the message is being allocated
> + * @info: receiver information
> + * @msg_size: size of the msg buffer that needs to be allocated
> + * @usrhdr: pointer to user specific header in the message buffer
> + *
> + * RETURNS:
> + * pointer to new allocated buffer on success, NULL on failure
> + */
> +struct sk_buff *
> +drm_genl_alloc_msg(struct drm_device *dev,
> +		   struct genl_info *info,
> +		   size_t msg_size, void **usrhdr)
> +{
> +	struct sk_buff *new_msg;
> +
> +	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
> +	if (!new_msg)
> +		return new_msg;
> +
> +	*usrhdr = genlmsg_put_reply(new_msg, info, dev->drm_genl_family, 0, info->genlhdr->cmd);
> +	if (!*usrhdr) {
> +		nlmsg_free(new_msg);
> +		new_msg = NULL;
> +	}
> +
> +	return new_msg;
> +}
> +EXPORT_SYMBOL(drm_genl_alloc_msg);
> +
> +static struct drm_device *genl_to_dev(struct genl_info *info)
> +{
> +	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
> +}
> +
> +static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
> +{
> +	struct drm_device *dev = genl_to_dev(info);
> +
> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL) {
> +		if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_READ_ALL))
> +			return -EINVAL;
> +	} else {
> +		if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_QUERY))
> +			return -EINVAL;
> +	}
> +
> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
> +		return -EOPNOTSUPP;
> +
> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
> +}
> +
> +static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
> +{
> +	struct drm_device *dev = genl_to_dev(info);
> +
> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
> +		return -EINVAL;
> +
> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
> +		return -EOPNOTSUPP;
> +
> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
> +}
> +
> +/* attribute policies */
> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
> +	[DRM_RAS_ATTR_QUERY] = { .type = NLA_U8 },
> +};
> +
> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
> +	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
> +};
> +
> +static const struct nla_policy drm_attr_policy_read_all[DRM_ATTR_MAX + 1] = {
> +	[DRM_RAS_ATTR_READ_ALL] = { .type = NLA_U8 },
> +};
> +
> +/* drm genl operations definition */
> +const struct genl_ops drm_genl_ops[] = {
> +	{
> +		.cmd = DRM_RAS_CMD_QUERY,
> +		.doit = drm_genl_list_errors,
> +		.policy = drm_attr_policy_query,
> +	},
> +	{
> +		.cmd = DRM_RAS_CMD_READ_ONE,
> +		.doit = drm_genl_read_error,
> +		.policy = drm_attr_policy_read_one,
> +	},
> +	{
> +		.cmd = DRM_RAS_CMD_READ_ALL,
> +		.doit = drm_genl_list_errors,
> +		.policy = drm_attr_policy_read_all,
> +	},
> +	{
> +		.cmd = DRM_RAS_CMD_READ_BLOCK,
> +		.doit = drm_genl_read_error,
> +		.policy = drm_attr_policy_read_one,
> +	},
> +
> +};
> +
> +static void drm_genl_family_init(struct drm_device *dev)
> +{
> +	dev->drm_genl_family = drmm_kzalloc(dev, sizeof(struct genl_family),
> +					    GFP_KERNEL);
> +
> +	/* Use drm primary node name eg: card0 to name the genl family */
> +	snprintf(dev->drm_genl_family->name, sizeof(dev->drm_genl_family->name),
> +		 "%s", dev->primary->kdev->kobj.name);

for the family name I believe we deserve the 'drmras', then
the card minor number, then the group name.

For instance, but not necessarily suggesting xe to do it:

drmras-0-gt
drmras-0-soc

.....

driver can select their own name...


> +	dev->drm_genl_family->version = DRM_GENL_VERSION;

I believe driver could control their own version so if something changes in
the group names for instance or supported commands they can change it.

> +	dev->drm_genl_family->parallel_ops = true;
> +	dev->drm_genl_family->ops = drm_genl_ops;
> +	dev->drm_genl_family->n_ops = ARRAY_SIZE(drm_genl_ops);
> +	dev->drm_genl_family->maxattr = DRM_ATTR_MAX;
> +	dev->drm_genl_family->module = dev->dev->driver->owner;
> +}
> +
> +static void drm_genl_deregister(struct drm_device *dev, void *arg)
> +{
> +	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family->name);
> +
> +	xa_erase(&drm_dev_xarray, dev->drm_genl_family->id);
> +
> +	genl_unregister_family(dev->drm_genl_family);
> +}
> +
> +/**
> + * drm_genl_register - Register genl family
> + * @dev: drm_device for which genl family needs to be registered
> + *
> + * RETURNS:
> + * 0 on success and negative error code on failure
> + */
> +int drm_genl_register(struct drm_device *dev)
> +{
> +	int ret;
> +
> +	drm_genl_family_init(dev);
> +
> +	ret = genl_register_family(dev->drm_genl_family);
> +	if (ret < 0) {
> +		drm_warn(dev, "genl family registration failed\n");
> +		return ret;
> +	}
> +
> +	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family->id,
> +		       dev->drm_genl_family->name);
> +
> +	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family->id, dev, GFP_KERNEL));
> +	if (ret)
> +		goto genl_unregister;
> +
> +	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
> +
> +	return ret;
> +
> +genl_unregister:
> +	genl_unregister_family(dev->drm_genl_family);
> +	return ret;
> +}
> +
> +/**
> + * drm_genl_exit: destroy drm_dev_xarray
> + */
> +void drm_genl_exit(void)
> +{
> +	xa_destroy(&drm_dev_xarray);
> +}
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index 08b3b2467c4c..8b60a17e4156 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -8,6 +8,7 @@
>  #include <linux/sched.h>
>  
>  #include <drm/drm_mode_config.h>
> +#include <drm/drm_netlink.h>
>  
>  struct drm_driver;
>  struct drm_minor;
> @@ -22,6 +23,8 @@ struct inode;
>  struct pci_dev;
>  struct pci_controller;
>  
> +struct genl_family;
> +
>  /*
>   * Recovery methods for wedged device in order of less to more side-effects.
>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> @@ -356,6 +359,13 @@ struct drm_device {
>  	 * Root directory for debugfs files.
>  	 */
>  	struct dentry *debugfs_root;
> +
> +	/**
> +	 * @drm_genl_family:
> +	 *
> +	 * Generic netlink family registration structure.
> +	 */
> +	struct genl_family *drm_genl_family;

we should probably have this inside a struct drm_ras and without the 1-1
tie here


>  };
>  
>  void drm_dev_set_dma_dev(struct drm_device *dev, struct device *dma_dev);
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 3f76a32d6b84..908888ac0db2 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -431,6 +431,13 @@ struct drm_driver {
>  	 * some examples.
>  	 */
>  	const struct file_operations *fops;
> +
> +	/**
> +	 * @genl_ops:
> +	 *
> +	 * Drivers private callback to genl commands
> +	 */
> +	const struct driver_genl_ops *genl_ops;

as well the ops should be encapsulated in the drm_ras struct

>  };
>  
>  void *__devm_drm_dev_alloc(struct device *parent,
> diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
> new file mode 100644
> index 000000000000..4a746222337a
> --- /dev/null
> +++ b/include/drm/drm_netlink.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#ifndef __DRM_NETLINK_H__
> +#define __DRM_NETLINK_H__
> +
> +#include <linux/types.h>
> +
> +struct drm_device;
> +struct genl_info;
> +struct sk_buff;
> +
> +struct driver_genl_ops {
> +	int		       (*doit)(struct drm_device *dev,

when I first saw the doit I was going to complain about it,
until I learned this is part of netlink definition :)

> +				       struct sk_buff *skb,
> +				       struct genl_info *info);
> +};
> +
> +#if IS_ENABLED(CONFIG_NET)
> +int drm_genl_register(struct drm_device *dev);
> +void drm_genl_exit(void);
> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
> +struct sk_buff *
> +drm_genl_alloc_msg(struct drm_device *dev,
> +		   struct genl_info *info,
> +		   size_t msg_size, void **usrhdr);
> +#else
> +static inline int drm_genl_register(struct drm_device *dev) { return 0; }
> +static inline void drm_genl_exit(void) {}
> +static inline int drm_genl_reply(struct sk_buff *msg,
> +				 struct genl_info *info,
> +				 void *usrhdr) { return 0; }
> +static inline struct skb_buff *
> +drm_genl_alloc_msg(struct drm_device *dev,
> +		   struct genl_info *info,
> +		   size_t msg_size, void **usrhdr) { return NULL; }
> +#endif
> +
> +#endif
> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
> new file mode 100644
> index 000000000000..58afb6e8d84a
> --- /dev/null
> +++ b/include/uapi/drm/drm_netlink.h
> @@ -0,0 +1,101 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright 2023 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.

This header kind of conflicts/overlaps the MIT SPDX above. We should remove it
and go only with the SPDX imho, unless I'm missing something

> + */
> +
> +#ifndef _DRM_NETLINK_H_
> +#define _DRM_NETLINK_H_
> +
> +#define DRM_GENL_VERSION 1
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/**
> + * enum drm_genl_error_cmds - Supported error commands
> + *
> + */
> +enum drm_genl_error_cmds {
> +	DRM_CMD_UNSPEC,
> +	/**
> +	 * @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id in verbose mode.
> +	 * In normal mode will list IP blocks, total instances available and error types supported
> +	 */
> +	DRM_RAS_CMD_QUERY,

here is the part where naming inconsistency is more visible, file has one
namespacing, struct has another, and command has even a third one.

drm_ras everywhere to solve this please.

> +	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */
> +	DRM_RAS_CMD_READ_ONE,
> +	/** @DRM_RAS_CMD_READ_BLOCK: Command to get a counter of specific error type from an IP
> +	 * block
> +	 */
> +	DRM_RAS_CMD_READ_BLOCK,

here is the part that I believe this API already shows how it is not
expansible. you had to create an argument to filter the type of errors
instead of declaring the errors per ip block like AMD folks had asked for.

> +	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
> +	DRM_RAS_CMD_READ_ALL,
> +
> +	__DRM_CMD_MAX,
> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
> +};
> +
> +enum drm_cmd_request_type {
> +	DRM_RAS_CMD_QUERY_VERBOSE = 1,
> +	DRM_RAS_CMD_QUERY_NORMAL = 2,
> +};

I don't understand why we need verbose vs normal. Perhaps this should
be a separate path or explain with examples?

it took me a while to realize that the drm_ras igt tool would only
list my available errors if I was using --verbose, otherwise we would
return in the begin of the list_error functions in xe...

> +
> +/**
> + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
> + *
> + */
> +enum drm_error_attr {
> +	DRM_ATTR_UNSPEC,
> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
> +	/**
> +	 * @DRM_RAS_ATTR_QUERY: Should be used with DRM_RAS_CMD_QUERY,
> +	 * DRM_RAS_CMD_READ_ALL
> +	 */
> +	DRM_RAS_ATTR_QUERY, /* NLA_U8 */
> +	/**
> +	 * @DRM_RAS_ATTR_READ_ALL: Should be used with DRM_RAS_CMD_READ_ALL
> +	 */
> +	DRM_RAS_ATTR_READ_ALL, /* NLA_U8 */
> +	/**
> +	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
> +	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands.
> +	 */
> +	DRM_RAS_ATTR_QUERY_REPLY, /* NLA_NESTED */
> +	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
> +	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
> +	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id, should be used with
> +	 * DRM_RAS_CMD_READ_ONE, DRM_RAS_CMD_READ_BLOCK
> +	 */
> +	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
> +	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
> +	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */


I'm also confused on all of the errors here and why we would need them
and it also looks not expansible...

> +
> +	__DRM_ATTR_MAX,
> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
> +};
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> +
> +#endif
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 2/5] drm/xe/RAS: Register netlink capability
  2025-07-30  6:49 ` [RFC v5 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
@ 2025-08-15 21:52   ` Rodrigo Vivi
  2025-08-26  9:01     ` Aravind Iddamsetty
  0 siblings, 1 reply; 24+ messages in thread
From: Rodrigo Vivi @ 2025-08-15 21:52 UTC (permalink / raw)
  To: Aravind Iddamsetty
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

On Wed, Jul 30, 2025 at 12:19:53PM +0530, Aravind Iddamsetty wrote:
> Register netlink capability with the DRM and register the driver
> callbacks to DRM RAS netlink commands.
> 
> v2:
> Move the netlink registration parts to DRM susbsytem (Tomer Tayar)
> 
> v3: compile only if CONFIG_NET is enabled
> 
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v2
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile          |  2 ++
>  drivers/gpu/drm/xe/xe_device.c       |  6 ++++++
>  drivers/gpu/drm/xe/xe_device_types.h |  1 +
>  drivers/gpu/drm/xe/xe_netlink.c      | 26 ++++++++++++++++++++++++++
>  4 files changed, 35 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 80eecd35e807..e960c2dbe658 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -304,6 +304,8 @@ xe-$(CONFIG_DRM_XE_DISPLAY) += \
>  	i915-display/skl_universal_plane.o \
>  	i915-display/skl_watermark.o
>  
> +xe-$(CONFIG_NET) += xe_netlink.o
> +
>  ifeq ($(CONFIG_ACPI),y)
>  	xe-$(CONFIG_DRM_XE_DISPLAY) += \
>  		i915-display/intel_acpi.o \
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 806dbdf8118c..ca7a17c16aa5 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -363,6 +363,8 @@ static const struct file_operations xe_driver_fops = {
>  	.fop_flags = FOP_UNSIGNED_OFFSET,
>  };
>  
> +extern const struct driver_genl_ops xe_genl_ops[];
> +
>  static struct drm_driver driver = {
>  	/* Don't use MTRRs here; the Xserver or userspace app should
>  	 * deal with them for Intel hardware.
> @@ -381,6 +383,10 @@ static struct drm_driver driver = {
>  #ifdef CONFIG_PROC_FS
>  	.show_fdinfo = xe_drm_client_fdinfo,
>  #endif
> +#ifdef CONFIG_NET
> +	.genl_ops = xe_genl_ops,
> +#endif
> +

we should definitely have a drm function to register it instead of hard-coding
it here, regardless if we go with the group split or not.
It is not okay forcing this to every platform, even the ones without any RAS
available for instance.

>  	.ioctls = xe_ioctls,
>  	.num_ioctls = ARRAY_SIZE(xe_ioctls),
>  	.fops = &xe_driver_fops,
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 3a851c7a55dd..08d3e53e4b37 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -10,6 +10,7 @@
>  
>  #include <drm/drm_device.h>
>  #include <drm/drm_file.h>
> +#include <drm/drm_netlink.h>
>  #include <drm/ttm/ttm_device.h>
>  
>  #include "xe_devcoredump_types.h"
> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
> new file mode 100644
> index 000000000000..9e588fb19631
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_netlink.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <net/genetlink.h>
> +#include <uapi/drm/drm_netlink.h>
> +
> +#include "xe_device.h"
> +
> +static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
> +{
> +	return 0;
> +}
> +
> +static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
> +{
> +	return 0;
> +}
> +
> +/* driver callbacks to DRM netlink commands*/
> +const struct driver_genl_ops xe_genl_ops[] = {
> +	[DRM_RAS_CMD_QUERY] =		{ .doit = xe_genl_list_errors },
> +	[DRM_RAS_CMD_READ_ONE] =	{ .doit = xe_genl_read_error },
> +	[DRM_RAS_CMD_READ_ALL] =	{ .doit = xe_genl_list_errors, },
> +};

this is another space that is strange. you declare it here and drm
magically uses it. Another reason for more explicity registration.
and with the struct drm_ras where these commands are part of that.
as well as the group name, etc.

> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 3/5] drm/xe/RAS: Expose the error counters
  2025-07-30  6:49 ` [RFC v5 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
@ 2025-08-15 21:58   ` Rodrigo Vivi
  2025-08-26  9:26     ` Aravind Iddamsetty
  0 siblings, 1 reply; 24+ messages in thread
From: Rodrigo Vivi @ 2025-08-15 21:58 UTC (permalink / raw)
  To: Aravind Iddamsetty
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

On Wed, Jul 30, 2025 at 12:19:54PM +0530, Aravind Iddamsetty wrote:
> We expose the various error counters supported on a hardware via genl
> subsytem through the registered commands to userspace. The
> DRM_RAS_CMD_QUERY lists the error names with config id,
> DRM_RAD_CMD_READ_ONE returns the counter value for the requested config
> id and the DRM_RAS_CMD_READ_ALL lists the counters for all errors along
> with their names and config ids.
> 
> v2: Rebase
> 
> v3:
> 1. presently xe_list_errors fills blank data for IGFX, prevent it by
> having an early check of IS_DGFX (Michael J. Ruhl)
> 2. update errors from all sources
> 
> v4: Check QUERY param, if its normal return not supported.
> 
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> ---
>  drivers/gpu/drm/xe/xe_hw_error.c |  15 +-
>  drivers/gpu/drm/xe/xe_netlink.c  | 509 ++++++++++++++++++++++++++++++-
>  include/uapi/drm/xe_drm.h        |  85 ++++++
>  3 files changed, 602 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 6a7cd59caac1..bdd9c88674b2 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -531,16 +531,21 @@ static void xe_clear_all_soc_errors(struct xe_device *xe)
>  
>  		while (hw_err < HARDWARE_ERROR_MAX) {
>  			for (i = 0; i < XE_SOC_NUM_IEH; i++)
> -				xe_mmio_write32(&gt->tile->mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
> +				xe_mmio_write32(&gt->tile->mmio,
> +						SOC_GSYSEVTCTL_REG(base, slave_base, i),
>  						~REG_BIT(hw_err));
>  
> -			xe_mmio_write32(&gt->tile->mmio, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
> +			xe_mmio_write32(&gt->tile->mmio,
> +					SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
>  					REG_GENMASK(31, 0));
> -			xe_mmio_write32(&gt->tile->mmio, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
> +			xe_mmio_write32(&gt->tile->mmio,
> +					SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
>  					REG_GENMASK(31, 0));
> -			xe_mmio_write32(&gt->tile->mmio, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
> +			xe_mmio_write32(&gt->tile->mmio,
> +					SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>  					REG_GENMASK(31, 0));
> -			xe_mmio_write32(&gt->tile->mmio, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
> +			xe_mmio_write32(&gt->tile->mmio,
> +					SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>  					REG_GENMASK(31, 0));

probably a fixup for the patch in the other series?
Which btw it was hard to understand the dependency. We should make this series indepentent
of the other one.

>  			hw_err++;
>  		}
> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
> index 9e588fb19631..20240875284a 100644
> --- a/drivers/gpu/drm/xe/xe_netlink.c
> +++ b/drivers/gpu/drm/xe/xe_netlink.c
> @@ -6,16 +6,521 @@
>  #include <net/genetlink.h>
>  #include <uapi/drm/drm_netlink.h>
>  
> +#include <drm/xe_drm.h>
> +
>  #include "xe_device.h"
>  
> -static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
> +#define MAX_ERROR_NAME	100
> +
> +static const char * const xe_hw_error_events[] = {
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng",
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc",
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler",
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm",
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic",
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_GUC] = "fatal-guc",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_SLM] = "fatal-slm",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_FPU] = "fatal-fpu",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_TLB] = "fatal-tlb",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice",
> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
> +		[DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable",
> +		[DRM_XE_GENL_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
> +		[DRM_XE_GENL_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = "soc-nonfatal-csc-psf-cmd-parity",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = "soc-nonfatal-csc-psf-unexpected-completion",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = "soc-nonfatal-csc-psf-unsupported-request",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = "soc-nonfatal-anr-mdfi",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = "soc-nonfatal-mdfi-t2t",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = "soc-nonfatal-mdfi-t2c",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = "soc-nonfatal-hbm-ss0-0",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = "soc-nonfatal-hbm-ss0-1",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = "soc-nonfatal-hbm-ss0-2",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = "soc-nonfatal-hbm-ss0-3",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = "soc-nonfatal-hbm-ss0-4",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = "soc-nonfatal-hbm-ss0-5",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = "soc-nonfatal-hbm-ss0-6",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = "soc-nonfatal-hbm-ss0-7",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = "soc-nonfatal-hbm-ss1-0",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = "soc-nonfatal-hbm-ss1-1",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = "soc-nonfatal-hbm-ss1-2",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = "soc-nonfatal-hbm-ss1-3",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = "soc-nonfatal-hbm-ss1-4",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = "soc-nonfatal-hbm-ss1-5",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = "soc-nonfatal-hbm-ss1-6",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = "soc-nonfatal-hbm-ss1-7",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = "soc-nonfatal-hbm-ss2-0",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = "soc-nonfatal-hbm-ss2-1",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = "soc-nonfatal-hbm-ss2-2",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = "soc-nonfatal-hbm-ss2-3",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = "soc-nonfatal-hbm-ss2-4",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = "soc-nonfatal-hbm-ss2-5",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = "soc-nonfatal-hbm-ss2-6",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = "soc-nonfatal-hbm-ss2-7",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = "soc-nonfatal-hbm-ss3-0",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = "soc-nonfatal-hbm-ss3-1",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = "soc-nonfatal-hbm-ss3-2",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = "soc-nonfatal-hbm-ss3-3",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = "soc-nonfatal-hbm-ss3-4",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = "soc-nonfatal-hbm-ss3-5",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = "soc-nonfatal-hbm-ss3-6",
> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = "soc-nonfatal-hbm-ss3-7",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = "soc-fatal-csc-psf-cmd-parity",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = "soc-fatal-csc-psf-unexpected-completion",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = "soc-fatal-csc-psf-unsupported-request",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = "soc-fatal-pcie-psf-command-parity",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = "soc-fatal-pcie-psf-unexpected-completion",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = "soc-fatal-pcie-psf-unsupported-request",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = "soc-fatal-anr-mdfi",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = "soc-fatal-mdfi-t2t",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = "soc-fatal-mdfi-t2c",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = "soc-fatal-malformed-pcie-aer",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = "soc-fatal-malformed-pcie-err",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND] = "soc-fatal-ur-condition-ieh",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = "soc-fatal-from-serr-sources",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6",
> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7",
> +		[DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = "gsc-nonfatal-mia-internal",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = "gsc-nonfatal-vlt-glitch",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = "gsc-nonfatal-self-mbist",
> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = "gsc-nonfatal-aon-parity",
> +		[DRM_XE_GENL_SGGI_ERROR_NONFATAL] = "sggi-nonfatal-data-parity",
> +		[DRM_XE_GENL_SGLI_ERROR_NONFATAL] = "sgli-nonfatal-data-parity",
> +		[DRM_XE_GENL_SGCI_ERROR_NONFATAL] = "sgci-nonfatal-data-parity",
> +		[DRM_XE_GENL_MERT_ERROR_NONFATAL] = "mert-nonfatal-data-parity",
> +		[DRM_XE_GENL_SGGI_ERROR_FATAL] = "sggi-fatal-data-parity",
> +		[DRM_XE_GENL_SGLI_ERROR_FATAL] = "sgli-fatal-data-parity",
> +		[DRM_XE_GENL_SGCI_ERROR_FATAL] = "sgci-fatal-data-parity",
> +		[DRM_XE_GENL_MERT_ERROR_FATAL] = "mert-nonfatal-data-parity",
> +};
> +
> +static const unsigned long xe_hw_error_map[] = {
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = XE_HW_ERR_GT_CORR_L3_SNG,
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC] = XE_HW_ERR_GT_CORR_GUC,
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = XE_HW_ERR_GT_CORR_SAMPLER,
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM] = XE_HW_ERR_GT_CORR_SLM,
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = XE_HW_ERR_GT_CORR_EU_IC,
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = XE_HW_ERR_GT_CORR_EU_GRF,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST] = XE_HW_ERR_GT_FATAL_ARR_BIST,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB] = XE_HW_ERR_GT_FATAL_L3_DOUB,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = XE_HW_ERR_GT_FATAL_L3_ECC_CHK,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_GUC] = XE_HW_ERR_GT_FATAL_GUC,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR] = XE_HW_ERR_GT_FATAL_IDI_PAR,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_SQIDI] = XE_HW_ERR_GT_FATAL_SQIDI,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER] = XE_HW_ERR_GT_FATAL_SAMPLER,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_SLM] = XE_HW_ERR_GT_FATAL_SLM,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_EU_IC] = XE_HW_ERR_GT_FATAL_EU_IC,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF] = XE_HW_ERR_GT_FATAL_EU_GRF,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_FPU] = XE_HW_ERR_GT_FATAL_FPU,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_TLB] = XE_HW_ERR_GT_FATAL_TLB,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = XE_HW_ERR_GT_FATAL_L3_FABRIC,
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = XE_HW_ERR_GT_CORR_SUBSLICE,
> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = XE_HW_ERR_GT_CORR_L3BANK,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE] = XE_HW_ERR_GT_FATAL_SUBSLICE,
> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3BANK] = XE_HW_ERR_GT_FATAL_L3BANK,
> +	[DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE] = XE_HW_ERR_TILE_CORR_SGUNIT,
> +	[DRM_XE_GENL_SGUNIT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGUNIT,
> +	[DRM_XE_GENL_SGUNIT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGUNIT,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMD,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMP,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_REQ,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = XE_HW_ERR_SOC_NONFATAL_ANR_MDFI,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2T,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2C,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL7,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL7,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL7,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL7,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMD,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMP,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_FATAL_CSC_PSF_REQ,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT] = XE_HW_ERR_SOC_FATAL_PUNIT,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMD,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMP,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_REQ,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = XE_HW_ERR_SOC_FATAL_ANR_MDFI,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = XE_HW_ERR_SOC_FATAL_MDFI_T2T,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = XE_HW_ERR_SOC_FATAL_MDFI_T2C,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = XE_HW_ERR_SOC_FATAL_PCIE_AER,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = XE_HW_ERR_SOC_FATAL_PCIE_ERR,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND] = XE_HW_ERR_SOC_FATAL_UR_COND,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = XE_HW_ERR_SOC_FATAL_SERR_SRCS,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL7,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL7,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL7,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL0,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL1,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL2,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL3,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL4,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL5,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL6,
> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL7,
> +	[DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = XE_HW_ERR_GSC_CORR_SRAM,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = XE_HW_ERR_GSC_NONFATAL_MIA_SHUTDOWN,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = XE_HW_ERR_GSC_NONFATAL_MIA_INTERNAL,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = XE_HW_ERR_GSC_NONFATAL_SRAM,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = XE_HW_ERR_GSC_NONFATAL_WDG,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = XE_HW_ERR_GSC_NONFATAL_ROM_PARITY,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = XE_HW_ERR_GSC_NONFATAL_UCODE_PARITY,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = XE_HW_ERR_GSC_NONFATAL_VLT_GLITCH,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = XE_HW_ERR_GSC_NONFATAL_FUSE_PULL,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = XE_HW_ERR_GSC_NONFATAL_FUSE_CRC,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = XE_HW_ERR_GSC_NONFATAL_SELF_MBIST,
> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = XE_HW_ERR_GSC_NONFATAL_AON_RF_PARITY,
> +	[DRM_XE_GENL_SGGI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGGI,
> +	[DRM_XE_GENL_SGLI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGLI,
> +	[DRM_XE_GENL_SGCI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGCI,
> +	[DRM_XE_GENL_MERT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_MERT,
> +	[DRM_XE_GENL_SGGI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGGI,
> +	[DRM_XE_GENL_SGLI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGLI,
> +	[DRM_XE_GENL_SGCI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGCI,
> +	[DRM_XE_GENL_MERT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_MERT,
> +};

probably deserves a separate header file?

> +
> +static unsigned int config_gt_id(const u64 config)
> +{
> +	return config >> __XE_GENL_GT_SHIFT;
> +}
> +
> +static u64 config_counter(const u64 config)
> +{
> +	return config & ~(~0ULL << __XE_GENL_GT_SHIFT);
> +}
> +
> +static bool is_gt_error(const u64 config)
> +{
> +	unsigned int error;
> +
> +	error = config_counter(config);
> +	if (error <= DRM_XE_GENL_GT_ERROR_FATAL_FPU)
> +		return true;
> +
> +	return false;
> +}
> +
> +static bool is_gt_vector_error(const u64 config)
> +{
> +	unsigned int error;
> +
> +	error = config_counter(config);
> +	if (error >= DRM_XE_GENL_GT_ERROR_FATAL_TLB &&
> +	    error <= DRM_XE_GENL_GT_ERROR_FATAL_L3BANK)
> +		return true;
> +
> +	return false;
> +}
> +
> +static bool is_pvc_invalid_gt_errors(const u64 config)
> +{
> +	switch (config_counter(config)) {
> +	case DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG:
> +	case DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER:
> +	case DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST:
> +	case DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB:
> +	case DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK:
> +	case DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR:
> +	case DRM_XE_GENL_GT_ERROR_FATAL_SQIDI:
> +	case DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER:
> +	case DRM_XE_GENL_GT_ERROR_FATAL_EU_IC:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static bool is_gsc_hw_error(const u64 config)
> +{
> +	if (config_counter(config) >= DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
> +	    config_counter(config) <= DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY)
> +		return true;
> +
> +	return false;
> +}
> +
> +static bool is_soc_error(const u64 config)
>  {
> +	if (config_counter(config) >= DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD &&
> +	    config_counter(config) <= DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15))
> +		return true;
> +
> +	return false;
> +}
> +
> +static int
> +config_status(struct xe_device *xe, u64 config)
> +{
> +	unsigned int gt_id = config_gt_id(config);
> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
> +
> +	if (!IS_DGFX(xe))
> +		return -ENODEV;
> +
> +	if (gt->info.type == XE_GT_TYPE_UNINITIALIZED)
> +		return -ENOENT;
> +
> +	/* GSC HW ERRORS are present on root tile of
> +	 * platform supporting MEMORY SPARING only
> +	 */
> +	if (is_gsc_hw_error(config) && !(xe->info.platform == XE_PVC && !gt_id))
> +		return -ENODEV;
> +
> +	/* GT vectors error  are valid on Platforms supporting error vectors only */
> +	if (is_gt_vector_error(config) && xe->info.platform != XE_PVC)
> +		return -ENODEV;
> +
> +	/* Skip gt errors not supported on pvc */
> +	if (is_pvc_invalid_gt_errors(config) && xe->info.platform == XE_PVC)
> +		return  -ENODEV;
> +
> +	/* FATAL FPU error is valid on PVC only */
> +	if (config_counter(config) == DRM_XE_GENL_GT_ERROR_FATAL_FPU &&
> +	    !(xe->info.platform == XE_PVC))
> +		return -ENODEV;
> +
> +	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
> +		return -ENODEV;
> +
> +	return (config_counter(config) >=
> +			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
> +}
> +
> +static u64 get_counter_value(struct xe_device *xe, u64 config)
> +{
> +	const unsigned int gt_id = config_gt_id(config);
> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
> +	unsigned int id = config_counter(config);
> +
> +	if (is_gt_error(config) || is_gt_vector_error(config))
> +		return xa_to_value(xa_load(&gt->errors.hw_error, xe_hw_error_map[id]));
> +
> +	return xa_to_value(xa_load(&gt->tile->errors.hw_error, xe_hw_error_map[id]));
> +}
> +
> +static int fill_error_details(struct xe_device *xe, struct genl_info *info, struct sk_buff *new_msg)
> +{
> +	struct nlattr *entry_attr;
> +	bool counter = false;
> +	struct xe_gt *gt;
> +	int i, j;
> +
> +	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
> +		     ARRAY_SIZE(xe_hw_error_map));
> +
> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL)
> +		counter = true;
> +
> +	entry_attr = nla_nest_start(new_msg, DRM_RAS_ATTR_QUERY_REPLY);
> +	if (!entry_attr)
> +		return -EMSGSIZE;
> +
> +	for_each_gt(gt, xe, j) {
> +		char str[MAX_ERROR_NAME];
> +		u64 val;
> +
> +		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
> +			u64 config = DRM_XE_HW_ERROR(j, i);
> +
> +			if (config_status(xe, config))
> +				continue;
> +
> +			/* should this be cleared everytime */
> +			snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]);
> +
> +			if (nla_put_string(new_msg, DRM_RAS_ATTR_ERROR_NAME, str))
> +				goto err;
> +			if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
> +				goto err;
> +			if (counter) {
> +				val = get_counter_value(xe, config);
> +				if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val,
> +						      DRM_ATTR_PAD))
> +					goto err;
> +			}
> +		}
> +	}
> +
> +	nla_nest_end(new_msg, entry_attr);
> +
>  	return 0;
> +err:
> +	drm_dbg_driver(&xe->drm, "msg buff is small\n");
> +	nla_nest_cancel(new_msg, entry_attr);
> +	nlmsg_free(new_msg);
> +
> +	return -EMSGSIZE;
> +}
> +
> +static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
> +{
> +	struct xe_device *xe = to_xe_device(drm);
> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
> +	enum drm_cmd_request_type query_type;
> +	struct sk_buff *new_msg;
> +	int retries = 2;
> +	void *usrhdr;
> +	int ret = 0;
> +
> +	if (!IS_DGFX(xe))
> +		return -ENODEV;
> +
> +	/* Support verbose only errors */
> +	query_type = nla_get_u8(info->attrs[DRM_RAS_ATTR_QUERY]);
> +	if (query_type == DRM_RAS_CMD_QUERY_NORMAL)
> +		return -EOPNOTSUPP;
> +
> +	do {
> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
> +		if (!new_msg)
> +			return -ENOMEM;
> +
> +		ret = fill_error_details(xe, info, new_msg);
> +		if (!ret)
> +			break;
> +
> +		msg_size += NLMSG_DEFAULT_SIZE;
> +	} while (retries--);
> +
> +	if (!ret)
> +		ret = drm_genl_reply(new_msg, info, usrhdr);
> +
> +	return ret;
>  }
>  
>  static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>  {
> -	return 0;
> +	struct xe_device *xe = to_xe_device(drm);
> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
> +	struct sk_buff *new_msg;
> +	void *usrhdr;
> +	int ret = 0;
> +	int retries = 2;
> +	u64 config, val;
> +
> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_BLOCK)
> +		return -EOPNOTSUPP;
> +
> +	config = nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_ID]);
> +	ret = config_status(xe, config);
> +	if (ret)
> +		return ret;
> +	do {
> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
> +		if (!new_msg)
> +			return -ENOMEM;
> +
> +		val = get_counter_value(xe, config);
> +		if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
> +			msg_size += NLMSG_DEFAULT_SIZE;
> +			continue;
> +		}
> +
> +		break;
> +	} while (retries--);
> +
> +	ret = drm_genl_reply(new_msg, info, usrhdr);
> +
> +	return ret;
>  }

this .c without any public function and no .h was what draw my attention that
there was something wrong with the registration process. We need to have init and finish
per component.

>  
>  /* driver callbacks to DRM netlink commands*/
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index e2426413488f..d352a96e4826 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -1974,6 +1974,91 @@ struct drm_xe_query_eu_stall {
>  	__u64 sampling_rates[];
>  };
>  
> +/*
> + * Top bits of every counter are GT id.
> + */
> +#define __XE_GENL_GT_SHIFT	(56)
> +/**
> + * DOC: XE GENL netlink event IDs
> + * TODO: Add more details

yes, please it is hard to understand why here.

And also I have the feeling that this deserves to be together with
the other definitions above all together in a separate header.
perhaps even per platform?!

> + */
> +#define DRM_XE_HW_ERROR(gt, id) \
> +	((id) | ((__u64)(gt) << __XE_GENL_GT_SHIFT))
> +
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG		(0)
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC			(1)
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER		(2)
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM			(3)
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC		(4)
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF		(5)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST			(6)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB			(7)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK		(8)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_GUC			(9)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR			(10)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_SQIDI			(11)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER			(12)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_SLM			(13)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_EU_IC			(14)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF			(15)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_FPU			(16)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_TLB			(17)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC			(18)
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE		(19)
> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK		(20)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE			(21)
> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3BANK			(22)
> +#define DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE			(23)
> +#define DRM_XE_GENL_SGUNIT_ERROR_NONFATAL			(24)
> +#define DRM_XE_GENL_SGUNIT_ERROR_FATAL			(25)
> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD		(26)
> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP		(27)
> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ		(28)
> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI		(29)
> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T		(30)
> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C		(31)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD		(32)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP		(33)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ		(34)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT			(35)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD		(36)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP		(37)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ		(38)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI			(39)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T			(40)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C			(41)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER			(42)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR			(43)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND			(44)
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS		(45)
> +
> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(ss, n)\
> +		(DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS + 0x1 + (ss) * 0x10 + (n))
> +#define DRM_XE_GENL_SOC_ERROR_FATAL_HBM(ss, n)\
> +		(DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15) + 0x1 + (ss) * 0x10 + (n))
> +
> +/* 109 is the last ID used by SOC errors */
> +#define DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC		(110)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN		(111)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL		(112)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC		(113)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT		(114)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY		(115)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY		(116)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH		(117)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL		(118)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(119)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST		(120)
> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY		(121)
> +#define DRM_XE_GENL_SGGI_ERROR_NONFATAL			(122)
> +#define DRM_XE_GENL_SGLI_ERROR_NONFATAL			(123)
> +#define DRM_XE_GENL_SGCI_ERROR_NONFATAL			(124)
> +#define DRM_XE_GENL_MERT_ERROR_NONFATAL			(125)
> +#define DRM_XE_GENL_SGGI_ERROR_FATAL				(126)
> +#define DRM_XE_GENL_SGLI_ERROR_FATAL				(127)
> +#define DRM_XE_GENL_SGCI_ERROR_FATAL				(128)
> +#define DRM_XE_GENL_MERT_ERROR_FATAL				(129)
> +
>  #if defined(__cplusplus)
>  }
>  #endif
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 4/5] drm/netlink: Define multicast groups
  2025-07-30  6:49 ` [RFC v5 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
@ 2025-08-15 22:00   ` Rodrigo Vivi
  0 siblings, 0 replies; 24+ messages in thread
From: Rodrigo Vivi @ 2025-08-15 22:00 UTC (permalink / raw)
  To: Aravind Iddamsetty
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

On Wed, Jul 30, 2025 at 12:19:55PM +0530, Aravind Iddamsetty wrote:
> Netlink subsystem supports event notifications to userspace. we define
> two multicast groups for correctable and uncorrectable errors to which
> userspace can subscribe and be notified when any of those errors happen.
> The group names are local to the driver's genl netlink family.
> 
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> ---
>  drivers/gpu/drm/drm_netlink.c  | 7 +++++++
>  include/drm/drm_netlink.h      | 5 +++++
>  include/uapi/drm/drm_netlink.h | 4 ++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
> index da4bfde32a22..a7c0a4401ca9 100644
> --- a/drivers/gpu/drm/drm_netlink.c
> +++ b/drivers/gpu/drm/drm_netlink.c
> @@ -15,6 +15,11 @@
>  
>  DEFINE_XARRAY(drm_dev_xarray);
>  
> +static const struct genl_multicast_group drm_event_mcgrps[] = {
> +	[DRM_GENL_MCAST_CORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, },
> +	[DRM_GENL_MCAST_UNCORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, },
> +};

this was the thing I thought for that 'monitor' but well, that can be ignored
and we can indeed leave this per error component. but I also don't like it
to be forced. It should be a driver definition and driver adoption.

> +
>  /**
>   * drm_genl_reply - response to a request
>   * @msg: socket buffer
> @@ -156,6 +161,8 @@ static void drm_genl_family_init(struct drm_device *dev)
>  	dev->drm_genl_family->ops = drm_genl_ops;
>  	dev->drm_genl_family->n_ops = ARRAY_SIZE(drm_genl_ops);
>  	dev->drm_genl_family->maxattr = DRM_ATTR_MAX;
> +	dev->drm_genl_family->mcgrps = drm_event_mcgrps;
> +	dev->drm_genl_family->n_mcgrps = ARRAY_SIZE(drm_event_mcgrps);
>  	dev->drm_genl_family->module = dev->dev->driver->owner;
>  }
>  
> diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
> index 4a746222337a..9e48147d0d36 100644
> --- a/include/drm/drm_netlink.h
> +++ b/include/drm/drm_netlink.h
> @@ -12,6 +12,11 @@ struct drm_device;
>  struct genl_info;
>  struct sk_buff;
>  
> +enum mcgrps_events {
> +	DRM_GENL_MCAST_CORR_ERR,
> +	DRM_GENL_MCAST_UNCORR_ERR,
> +};
> +
>  struct driver_genl_ops {
>  	int		       (*doit)(struct drm_device *dev,
>  				       struct sk_buff *skb,
> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
> index 58afb6e8d84a..c978efaab124 100644
> --- a/include/uapi/drm/drm_netlink.h
> +++ b/include/uapi/drm/drm_netlink.h
> @@ -26,6 +26,8 @@
>  #define _DRM_NETLINK_H_
>  
>  #define DRM_GENL_VERSION 1
> +#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR	"drm_corr_err"
> +#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR	"drm_uncorr_err"
>  
>  #if defined(__cplusplus)
>  extern "C" {
> @@ -50,6 +52,8 @@ enum drm_genl_error_cmds {
>  	DRM_RAS_CMD_READ_BLOCK,
>  	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
>  	DRM_RAS_CMD_READ_ALL,
> +	/** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of multicast event */
> +	DRM_RAS_CMD_ERROR_EVENT,
>  
>  	__DRM_CMD_MAX,
>  	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2025-07-30  6:49 ` [RFC v5 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
@ 2025-08-15 22:01   ` Rodrigo Vivi
  2025-08-26  9:34     ` Aravind Iddamsetty
  0 siblings, 1 reply; 24+ messages in thread
From: Rodrigo Vivi @ 2025-08-15 22:01 UTC (permalink / raw)
  To: Aravind Iddamsetty
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta

On Wed, Jul 30, 2025 at 12:19:56PM +0530, Aravind Iddamsetty wrote:
> Whenever a correctable or an uncorrectable error happens an event is sent
> to the corresponding listeners of these groups.
> 
> v2: Rebase
> v3: protect with CONFIG_NET define.
> 
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v2
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> ---
>  drivers/gpu/drm/xe/xe_hw_error.c | 41 ++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index bdd9c88674b2..e6e2e6250b70 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -2,6 +2,8 @@
>  /*
>   * Copyright © 2023 Intel Corporation
>   */
> +#include <net/genetlink.h>
> +#include <uapi/drm/drm_netlink.h>
>  
>  #include "xe_gt_printk.h"
>  #include "xe_hw_error.h"
> @@ -776,6 +778,43 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>  				(HARDWARE_ERROR_MAX << 1) + 1);
>  }
>  
> +#ifdef CONFIG_NET
> +static void
> +generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
> +{
> +	struct sk_buff *msg;
> +	void *hdr;
> +
> +	if (!xe->drm.drm_genl_family)
> +		return;
> +
> +	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
> +	if (!msg) {
> +		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
> +		return;
> +	}
> +
> +	hdr = genlmsg_put(msg, 0, 0, xe->drm.drm_genl_family, 0, DRM_RAS_CMD_ERROR_EVENT);

this is something that could be wrapped up in the drm_ras

> +	if (!hdr) {
> +		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
> +		nlmsg_free(msg);
> +		return;
> +	}
> +
> +	genlmsg_end(msg, hdr);
> +
> +	genlmsg_multicast(xe->drm.drm_genl_family, msg, 0,
> +			  hw_err ?
> +			  DRM_GENL_MCAST_UNCORR_ERR
> +			  : DRM_GENL_MCAST_CORR_ERR,
> +			  GFP_ATOMIC);
> +}
> +#else
> +static void
> +generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
> +{}
> +#endif
> +
>  static void
>  xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>  {
> @@ -837,6 +876,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
>  	}
>  
>  	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
> +
> +	generate_netlink_event(tile_to_xe(tile), hw_err);
>  unlock:
>  	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>  }
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 1/5] drm/netlink: Add netlink infrastructure
  2025-08-15 17:07   ` Zack McKevitt
@ 2025-08-21  9:45     ` Aravind Iddamsetty
  2025-08-25 17:31       ` Zack McKevitt
  0 siblings, 1 reply; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-08-21  9:45 UTC (permalink / raw)
  To: Zack McKevitt, intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Michael J, Riana Tauro,
	Anshuman Gupta


On 15-08-2025 22:37, Zack McKevitt wrote:
> On 7/30/2025 12:49 AM, Aravind Iddamsetty wrote:
>> +static void drm_genl_family_init(struct drm_device *dev)
>> +{
>> +    dev->drm_genl_family = drmm_kzalloc(dev, sizeof(struct
>> genl_family),
>> +                        GFP_KERNEL);
>> +
>> +    /* Use drm primary node name eg: card0 to name the genl family */
>> +    snprintf(dev->drm_genl_family->name,
>> sizeof(dev->drm_genl_family->name),
>> +         "%s", dev->primary->kdev->kobj.name);
>> +    dev->drm_genl_family->version = DRM_GENL_VERSION;
>> +    dev->drm_genl_family->parallel_ops = true;
>> +    dev->drm_genl_family->ops = drm_genl_ops;
>> +    dev->drm_genl_family->n_ops = ARRAY_SIZE(drm_genl_ops);
>> +    dev->drm_genl_family->maxattr = DRM_ATTR_MAX;
>> +    dev->drm_genl_family->module = dev->dev->driver->owner;
>> +}
>
> We are interested in using this infrastructure at Qualcomm to
> communicate telemetry information for the AI100 accelerators. It would
> be nice if this function could support drm_minor accel nodes
> (dev->accel) as well. 

Glad to know the interest,  at present the code does create drm netlink
family for accel device as well, but it is tries to register with the
drm primary node name which might not be present for dev->accel,
checking for the "DRIVER_COMPUTE_ACCEL" and registering with the name
will fix that.

But also to bring to your attention the current series focuses on
reporting RAS errors and hence the commands are as well oriented towards
errors, anything specific to telemetry needs to be added. Do you have
any thought as to what kind of data and what kind of command support you
might need.

Thanks,
Aravind.
>
> Thanks,
>
> Zack

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2025-08-13 20:21 ` Rodrigo Vivi
  2025-08-15 21:24   ` Rodrigo Vivi
@ 2025-08-25  9:38   ` Aravind Iddamsetty
  1 sibling, 0 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-08-25  9:38 UTC (permalink / raw)
  To: Rodrigo Vivi, Dave Airlie, Joonas Lahtinen
  Cc: intel-xe, dri-devel, Alex Deucher, Simona Vetter, Hawking Zhang,
	Lijo Lazar, Michael J, Riana Tauro, Anshuman Gupta

[-- Attachment #1: Type: text/plain, Size: 5201 bytes --]


On 14-08-2025 01:51, Rodrigo Vivi wrote:
> On Wed, Jul 30, 2025 at 12:19:51PM +0530, Aravind Iddamsetty wrote:
>> Revisiting this patch series to address pending feedback and help move
>> the discussion towards a conclusion. This revision includes updates
>> based on previous comments[1] and aims to clarify outstanding concerns.
>> Specifically added command to facility reporting errors from IP blocks
>> to support AMDGPU driver model of RAS.
>> [1]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
>>
>> I sincerely appreciate everyones patience and thoughtful reviews so
>> far, and I hope this refreshed series facilitates the final evaluation
>> and acceptance.
>>
>> Please feel free to share any further suggestions or questions.
>>
>> Thank you for your continued consideration.
>> ----------------------------------------------------------------------
>>
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> reporting the errors to the host, which the KMD processes and exposes a
>> set of error counters which can be used by observability tools to take 
>> corrective actions or repairs. Traditionally there were being exposed 
>> via PMU (for relative counters) and sysfs interface (for absolute 
>> value) in our internal branch. But, due to the limitations in this 
>> approach to use two interfaces and also not able to have an event based 
>> reporting or configurability, an alternative approach to try netlink 
>> was suggested by community for drm subsystem wide UAPI for RAS and 
>> telemetry as discussed in [2]. 
>>
>> This [2] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.
>>
>> [2]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> I'm bringing some thoughts below and I'd like to get inputs from folks involved
> in the original discussions here please.
> Any thought is welcomed so we can move faster towards a real GPU standard RAS
> solution.
>
>> This series is on top [3] series which introduces error counting infra in Xe
>> driver.
>> [3]: https://lore.kernel.org/all/20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com/
>>
>> V5:
>> Add support to read error corresponding to an IP BLOCK
> I honestly don't believe that this version solves all the concerns raised by
> AMD folks in the previous reviews. It is true that this is bringing ways of
> reading errors per IP block, but if I understood them correctly, they would
> like better (and separate) ways to declare and handle the errors coming from
> different IP block, rather than simply reading/querying for them filtered out.
>
> So, I have som grouping ideas below.

As per the comment from Lijo,
https://lore.kernel.org/all/aa23f0ef-a4ab-ca73-5ab3-ef23d6e36e89@amd.com/

the errors are grouped per bitmask, they are not expecting a separation
at netlink level.

<31:24> = Block Id
<23:16> subblock id
<15:8> - interested instance
<7:0> - error_type

The interface should  support errors per IP block and instance, which
the current series support via DRM_RAS_CMD_READ_BLOCK.
when driver receives the command DRM_RAS_CMD_READ_BLOCK it is supposed
to decipher the bits based on the above bitsmask.
The query command is expected to list the blocks and instances
available, the counters of which will be read via DRM_RAS_CMD_READ_BLOCK.
>
>> v4:
>> 1. Rebase
>> 2. rename drm_genl_send to drm_genl_reply
> But before going to the ideas below I'd like to also raise the naming issue
> that I see with this proposal.
>
> I was recently running some experiments to devlink with this and similar
> cases. I don't believe that devlink is a good fit for our drm-ras. It is
> way too much centric on network devices and any addition there to our
> GPU RAS would be a heavy lift. But, there are some good things from there
> that we could perhaps get inspiration from.
>
> Starting from the name. devlink is the name of the tool and the name
> of the framework. It uses netlink on the back, but totally abstracting
> that. Here in this version we can see:
> drm_ras: the tool
> drm_netlink: the abstraction
> drm_genl_*: the wrapper?
>
> So, I believe that as devlink we should have a single name for everything
> and avoid wrappers but providing the real module registration, with
> groups, and functions. Entirely abstracting the netlink and focusing
> on the RAS functionalities.
sounds interesting and I feel it looks clean too. But that does mean we
completely handle
the netlink framework inside the drm layer and not at the driver and
expose callback ops to
drm drivers.

Thanks,
Aravind.

[-- Attachment #2: Type: text/html, Size: 7215 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 1/5] drm/netlink: Add netlink infrastructure
  2025-08-21  9:45     ` Aravind Iddamsetty
@ 2025-08-25 17:31       ` Zack McKevitt
  0 siblings, 0 replies; 24+ messages in thread
From: Zack McKevitt @ 2025-08-25 17:31 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel
  Cc: Alex Deucher, David Airlie, Simona Vetter, Joonas Lahtinen,
	Rodrigo Vivi, Hawking Zhang, Lijo Lazar, Michael J, Riana Tauro,
	Anshuman Gupta



On 8/21/2025 3:45 AM, Aravind Iddamsetty wrote:
> Glad to know the interest,  at present the code does create drm netlink
> family for accel device as well, but it is tries to register with the
> drm primary node name which might not be present for dev->accel,
> checking for the "DRIVER_COMPUTE_ACCEL" and registering with the name
> will fix that.

This is correct, trying to access dev->primary->kdev->kobj.name through 
an accel device will cause a fault. I believe your solution will work, 
and the node name can instead be retrieved via dev->accel->kdev->kobj.name.
> But also to bring to your attention the current series focuses on
> reporting RAS errors and hence the commands are as well oriented towards
> errors, anything specific to telemetry needs to be added. Do you have
> any thought as to what kind of data and what kind of command support you
> might need.

Understood. We will likely be interested in RAS functionality in the 
future but thought this would be a good avenue for telemetry as well 
since our device currently has a functional RAS implementation.

An early prototype for this might add a new command for telemetry and a 
new policy with 4 new attributes: the type of telemetry to read/write 
(e.g., SoC temp), whether the request is a read or write from/to the 
device, the status of the request, and the telemetry value read or 
written. As actual telemetry fields are likely device specific, these 
can be defined in the driver's uapi header and passed opaquely through 
the netlink interface.

The above description was sufficient for an initial prototype on top of 
our driver. Mostly, however, we want to reiterate our interest in these 
changes and will keep an eye out for a future patchset that incorporates 
generated boilerplate from a YAML description.

Thanks again,

Zack

> Thanks,
> Aravind.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2025-08-15 21:24   ` Rodrigo Vivi
@ 2025-08-26  4:42     ` Aravind Iddamsetty
  0 siblings, 0 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-08-26  4:42 UTC (permalink / raw)
  To: Rodrigo Vivi, Dave Airlie, Joonas Lahtinen
  Cc: intel-xe, dri-devel, Alex Deucher, Simona Vetter, Hawking Zhang,
	Lijo Lazar, Michael J, Riana Tauro, Anshuman Gupta


On 16-08-2025 02:54, Rodrigo Vivi wrote:
> On Wed, Aug 13, 2025 at 04:21:03PM -0400, Rodrigo Vivi wrote:
>> On Wed, Jul 30, 2025 at 12:19:51PM +0530, Aravind Iddamsetty wrote:
>>> Revisiting this patch series to address pending feedback and help move
>>> the discussion towards a conclusion. This revision includes updates
>>> based on previous comments[1] and aims to clarify outstanding concerns.
>>> Specifically added command to facility reporting errors from IP blocks
>>> to support AMDGPU driver model of RAS.
>>> [1]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
>>>
>>> I sincerely appreciate everyones patience and thoughtful reviews so
>>> far, and I hope this refreshed series facilitates the final evaluation
>>> and acceptance.
>>>
>>> Please feel free to share any further suggestions or questions.
>>>
>>> Thank you for your continued consideration.
>>> ----------------------------------------------------------------------
>>>
>>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>>> reporting the errors to the host, which the KMD processes and exposes a
>>> set of error counters which can be used by observability tools to take 
>>> corrective actions or repairs. Traditionally there were being exposed 
>>> via PMU (for relative counters) and sysfs interface (for absolute 
>>> value) in our internal branch. But, due to the limitations in this 
>>> approach to use two interfaces and also not able to have an event based 
>>> reporting or configurability, an alternative approach to try netlink 
>>> was suggested by community for drm subsystem wide UAPI for RAS and 
>>> telemetry as discussed in [2]. 
>>>
>>> This [2] is the inspiration to this series. It uses the generic
>>> netlink(genl) family subsystem and exposes a set of commands that can
>>> be used by every drm driver, the framework provides a means to have
>>> custom commands too. Each drm driver instance in this example xe driver
>>> instance registers a family and operations to the genl subsystem through
>>> which it enumerates and reports the error counters. An event based
>>> notification is also supported to which userpace can subscribe to and
>>> be notified when any error occurs and read the error counter this avoids
>>> continuous polling on error counter. This can also be extended to
>>> threshold based notification.
>>>
>>> [2]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>> I'm bringing some thoughts below and I'd like to get inputs from folks involved
>> in the original discussions here please.
>> Any thought is welcomed so we can move faster towards a real GPU standard RAS
>> solution.
>>
>>> This series is on top [3] series which introduces error counting infra in Xe
>>> driver.
>>> [3]: https://lore.kernel.org/all/20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com/
>>>
>>> V5:
>>> Add support to read error corresponding to an IP BLOCK
>> I honestly don't believe that this version solves all the concerns raised by
>> AMD folks in the previous reviews. It is true that this is bringing ways of
>> reading errors per IP block, but if I understood them correctly, they would
>> like better (and separate) ways to declare and handle the errors coming from
>> different IP block, rather than simply reading/querying for them filtered out.
>>
>> So, I have som grouping ideas below.
>>
>>> v4:
>>> 1. Rebase
>>> 2. rename drm_genl_send to drm_genl_reply
>> But before going to the ideas below I'd like to also raise the naming issue
>> that I see with this proposal.
>>
>> I was recently running some experiments to devlink with this and similar
>> cases. I don't believe that devlink is a good fit for our drm-ras. It is
>> way too much centric on network devices and any addition there to our
>> GPU RAS would be a heavy lift. But, there are some good things from there
>> that we could perhaps get inspiration from.
>>
>> Starting from the name. devlink is the name of the tool and the name
>> of the framework. It uses netlink on the back, but totally abstracting
>> that. Here in this version we can see:
>> drm_ras: the tool
>> drm_netlink: the abstraction
>> drm_genl_*: the wrapper?
>>
>> So, I believe that as devlink we should have a single name for everything
>> and avoid wrappers but providing the real module registration, with
>> groups, and functions. Entirely abstracting the netlink and focusing
>> on the RAS functionalities.
>>
>> I'm terrible with naming, but playing a bit with AI for some suggestions,
>> I'd say that my favorites are:
>> drmras - no '_' like most of the tools, but not only for the tool, but also for
>> the files and functions.
>> drmlink - more link, but less ras :/
>> grill - GPU RAS Interface Link Layer
>>
>> For the rest of the examples below I'm going with grill, but let me know your
>> preferences.
>>
>>> 3. catch error from xa_store and handle appropriately
>>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>>> having an early check of IS_DGFX (Michael J. Ruhl)
>>>
>>> v3:
>>> 1. Rebase on latest RAS series for XE
>>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>>> register to netlink subsystem
>>>
>>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>>> can leverage.
>>>
>>> Below is an example tool drm_ras which demonstrates the use of the
>>> supported commands. The tool will be sent to ML with the subject
>>> "[RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>> https://lore.kernel.org/all/20250730061342.1380217-2-aravind.iddamsetty@linux.intel.com/
>>>
>>> read single error counter:
>>>
>>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>>> counter value 0
>> no need for --device, that should be mandatory argument.
>> And we could accept BDF or card identification
>>
>> $ grill list
>> 00:03:00.0 - card0 - xe
>>
>> $ grill 00:03:00.0 list # Querying available modules.
>> monitor - global
>> erros - gt
>> erros - soc
>>
>> Yes, my idea is that driver should be able to register modules and group per module
> Please allow me to emphasize here that the group registration
> is just to make this extensible and also to accommodate the AMD case,
> but not change the original essence of the goal which is to
> create the drm-ras solution.
I'm not sure if i correctly understood the group reservation you are
referring too.
Like i mentioned earlier the AMD's request is not register a group of IP
as a separate
netlink family but rather have the ability to present the errors at IP
block level
with the commands this series proposes.

IIUC the modules have different functionalities,I believe having them as
a separate 
netlink family might be an overkill, I believe we can support them as
commands, 
looking at devlink too flash update, health are all commands they are not
registered as separate family.
>
>> GRILL would be designed to accommodate multiple kinds of RAS modules, each module,
>> with groups, categories and operations.
> also let me take back on the naming here.
>
> Please let's go with the obvious drm_ras.
>
> Perhaps drm_ras for the code here and drmras (one-word) for the IGT tool.
>
>> Modules: monitor, error, flash?!, etc?!
> Here, please let me also tune it down a bit. The overall goal continue
> to be the creation of our drm-ras framework using netlink to report
> error counters and events.
>
> Any addition is a good-to-have, but shouldn't delay the main goal.
> Also, any addition should be carefully reviewed individually in the
> future. My only wish here at this moment is to think from the very
> beginning in something that is expansible.
agree, but as you mentioned let's focus on error reporting and get it
going as
we are designing it we can plan to make it extensible to any future
requirements.

Thanks,
Aravind.
>
> Thanks,
> Rodrigo.
>
>> Groups: Global or per IP block depending on the HW underneath
>> Categories: Sub-groups like correctable-error vs uncorrectable-error for instance if/where
>> 	    it makes sense.
>> Operations: Monitor: set-threshold / listen (listen is just a tool operation, but every monitor
>> 	    needs to provide events over netlink)
>> 	    Error: read, clear, logs
>>
>>
>> $ grill 00:03:00.0 error global counter list
>> # List all available counters in this gpu
>>
>> $ grill 00:03:00.0 error global counter show soc_fatal_hbm2_chnl0
>> # Show a specific counter.
>>
>> $ grill 00:03:00.0 error global log
>> # Print all the stashed CPER logs (stash can be hw/fw/sw or a combination -
>>   	    		     	   in AMD case it is a dump of their debugfs ring)
>>
>>
>> So, I'm sure the next question is what if the log is global, but the counters
>> are not? Well, perhaps we should have different Modules for error-counter
>> split from error-logging ?!
>>
>> So yes, my thoughts still have some opens, but I'd like to hear your thoughts
>> and opinions on the overall idea here.
>>
>> Thanks in advance,
>> Rodrigo.
>>
>>> read all error counters:
>>>
>>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>>> name                                                    config-id               counter
>>>
>>> error-gt0-correctable-guc                               0x0000000000000001      0
>>> error-gt0-correctable-slm                               0x0000000000000003      0
>>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>>> error-gt0-fatal-guc                                     0x0000000000000009      0
>>> error-gt0-fatal-slm                                     0x000000000000000d      0
>>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>>> error-gt0-correctable-subslice                          0x0000000000000013      0
>>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>>> error-gt0-fatal-subslice                                0x0000000000000015      0
>>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>>> error-gt1-correctable-guc                               0x1000000000000001      0
>>> error-gt1-correctable-slm                               0x1000000000000003      0
>>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>>> error-gt1-fatal-guc                                     0x1000000000000009      0
>>> error-gt1-fatal-slm                                     0x100000000000000d      0
>>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>>> error-gt1-correctable-subslice                          0x1000000000000013      0
>>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>>> error-gt1-fatal-subslice                                0x1000000000000015      0
>>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>>
>>> wait on a error event:
>>>
>>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>>> waiting for error event
>>> error event received
>>> counter value 0
>>>
>>> list all errors:
>>>
>>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>>> name                                                    config-id
>>>
>>> error-gt0-correctable-guc                               0x0000000000000001
>>> error-gt0-correctable-slm                               0x0000000000000003
>>> error-gt0-correctable-eu-ic                             0x0000000000000004
>>> error-gt0-correctable-eu-grf                            0x0000000000000005
>>> error-gt0-fatal-guc                                     0x0000000000000009
>>> error-gt0-fatal-slm                                     0x000000000000000d
>>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>>> error-gt0-fatal-fpu                                     0x0000000000000010
>>> error-gt0-fatal-tlb                                     0x0000000000000011
>>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>>> error-gt0-correctable-subslice                          0x0000000000000013
>>> error-gt0-correctable-l3bank                            0x0000000000000014
>>> error-gt0-fatal-subslice                                0x0000000000000015
>>> error-gt0-fatal-l3bank                                  0x0000000000000016
>>> error-gt0-sgunit-correctable                            0x0000000000000017
>>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>>> error-gt0-sgunit-fatal                                  0x0000000000000019
>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>>> error-gt0-soc-fatal-punit                               0x000000000000001d
>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>>> error-gt1-correctable-guc                               0x1000000000000001
>>> error-gt1-correctable-slm                               0x1000000000000003
>>> error-gt1-correctable-eu-ic                             0x1000000000000004
>>> error-gt1-correctable-eu-grf                            0x1000000000000005
>>> error-gt1-fatal-guc                                     0x1000000000000009
>>> error-gt1-fatal-slm                                     0x100000000000000d
>>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>>> error-gt1-fatal-fpu                                     0x1000000000000010
>>> error-gt1-fatal-tlb                                     0x1000000000000011
>>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>>> error-gt1-correctable-subslice                          0x1000000000000013
>>> error-gt1-correctable-l3bank                            0x1000000000000014
>>> error-gt1-fatal-subslice                                0x1000000000000015
>>> error-gt1-fatal-l3bank                                  0x1000000000000016
>>> error-gt1-sgunit-correctable                            0x1000000000000017
>>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>>> error-gt1-sgunit-fatal                                  0x1000000000000019
>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>>> error-gt1-soc-fatal-punit                               0x100000000000001d
>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>> Cc: Riana Tauro <riana.tauro@intel.com>
>>> Cc: Anshuman Gupta <anshuman.gupta@intel.com>
>>>
>>>
>>> Aravind Iddamsetty (5):
>>>   drm/netlink: Add netlink infrastructure
>>>   drm/xe/RAS: Register netlink capability
>>>   drm/xe/RAS: Expose the error counters
>>>   drm/netlink: Define multicast groups
>>>   drm/xe/RAS: send multicast event on occurrence of an error
>>>
>>>  drivers/gpu/drm/Makefile             |   1 +
>>>  drivers/gpu/drm/drm_drv.c            |   7 +
>>>  drivers/gpu/drm/drm_netlink.c        | 219 +++++++++++
>>>  drivers/gpu/drm/xe/Makefile          |   2 +
>>>  drivers/gpu/drm/xe/xe_device.c       |   6 +
>>>  drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>>  drivers/gpu/drm/xe/xe_hw_error.c     |  56 ++-
>>>  drivers/gpu/drm/xe/xe_netlink.c      | 531 +++++++++++++++++++++++++++
>>>  include/drm/drm_device.h             |  10 +
>>>  include/drm/drm_drv.h                |   7 +
>>>  include/drm/drm_netlink.h            |  46 +++
>>>  include/uapi/drm/drm_netlink.h       | 105 ++++++
>>>  include/uapi/drm/xe_drm.h            |  85 +++++
>>>  13 files changed, 1071 insertions(+), 5 deletions(-)
>>>  create mode 100644 drivers/gpu/drm/drm_netlink.c
>>>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>>  create mode 100644 include/drm/drm_netlink.h
>>>  create mode 100644 include/uapi/drm/drm_netlink.h
>>>
>>> -- 
>>> 2.25.1
>>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 1/5] drm/netlink: Add netlink infrastructure
  2025-08-15 21:48   ` Rodrigo Vivi
@ 2025-08-26  5:58     ` Aravind Iddamsetty
  0 siblings, 0 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-08-26  5:58 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Michael J,
	Riana Tauro, Anshuman Gupta


On 16-08-2025 03:18, Rodrigo Vivi wrote:
> On Wed, Jul 30, 2025 at 12:19:52PM +0530, Aravind Iddamsetty wrote:
>> Define the netlink registration interface and commands, attributes that
>> can be commonly used across by drm drivers. This patch intends to use
>> the generic netlink family to expose various stats of device. At present
>> it defines some commands that shall be used to expose RAS error counters.
>>
>> v2:
>> define common interfaces to genl netlink subsystem that all drm drivers
>> can leverage.(Tomer Tayar)
>>
>> v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>> register to netlink subsystem (Daniel Vetter)
>>
>> v4:(Michael J. Ruhl)
>> 1. rename drm_genl_send to drm_genl_reply
>> 2. catch error from xa_store and handle appropriately
>>
>> v5:
>> 1. compile only if CONFIG_NET is enabled
>>
>> V6: Add support for reading an IP block errors
>>
>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v4
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> ---
>>  drivers/gpu/drm/Makefile       |   1 +
>>  drivers/gpu/drm/drm_drv.c      |   7 ++
>>  drivers/gpu/drm/drm_netlink.c  | 212 +++++++++++++++++++++++++++++++++
>>  include/drm/drm_device.h       |  10 ++
>>  include/drm/drm_drv.h          |   7 ++
>>  include/drm/drm_netlink.h      |  41 +++++++
>>  include/uapi/drm/drm_netlink.h | 101 ++++++++++++++++
>>  7 files changed, 379 insertions(+)
>>  create mode 100644 drivers/gpu/drm/drm_netlink.c
>>  create mode 100644 include/drm/drm_netlink.h
>>  create mode 100644 include/uapi/drm/drm_netlink.h
>>
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index 4dafbdc8f86a..39d5183ab35c 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -77,6 +77,7 @@ drm-$(CONFIG_DRM_CLIENT) += \
>>  	drm_client.o \
>>  	drm_client_event.o \
>>  	drm_client_modeset.o
>> +drm-$(CONFIG_NET) += drm_netlink.o
>>  drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
>>  drm-$(CONFIG_COMPAT) += drm_ioc32.o
>>  drm-$(CONFIG_DRM_PANEL) += drm_panel.o
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 02556363e918..cce55423141c 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -1088,6 +1088,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
>>  	if (ret)
>>  		goto err_minors;
>>  
>> +	if (driver->genl_ops) {
>> +		ret = drm_genl_register(dev);
>> +		if (ret)
>> +			goto err_minors;
>> +	}
> Even if we don't go with multiple 'groups' I believe that the driver should
> explicitly call the netlink registration.

the check is if the driver is supporting netlink callbacks or not and
only then register the netlink family for the device.

Also, I believe we want drivers to explicitly call drm_ras_register, see
below

>
>> +
>>  	ret = create_compat_control_link(dev);
>>  	if (ret)
>>  		goto err_minors;
>> @@ -1229,6 +1235,7 @@ static void drm_core_exit(void)
>>  	drm_privacy_screen_lookup_exit();
>>  	drm_panic_exit();
>>  	accel_core_exit();
>> +	drm_genl_exit();
>>  	unregister_chrdev(DRM_MAJOR, "drm");
>>  	debugfs_remove(drm_debugfs_root);
>>  	drm_sysfs_destroy();
>> diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
>> new file mode 100644
>> index 000000000000..da4bfde32a22
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_netlink.c
> drm_ras.c ?
ok 
>
>> @@ -0,0 +1,212 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2023 Intel Corporation
> 2025 here and in any other file
by bad i just respined the old series without checking for this will fix
it.
>
>> + */
>> +
>> +#include <net/genetlink.h>
>> +#include <uapi/drm/drm_netlink.h>
> uapi/drm/drm_ras.h ?!
>
> like we don't have a drm_ioctl.h but drm_mode.h
ok makes sense.
>
>> +
>> +#include <drm/drm_device.h>
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_file.h>
>> +#include <drm/drm_managed.h>
>> +#include <drm/drm_netlink.h>
>> +#include <drm/drm_print.h>
>> +
>> +DEFINE_XARRAY(drm_dev_xarray);
>> +
>> +/**
>> + * drm_genl_reply - response to a request
>> + * @msg: socket buffer
>> + * @info: receiver information
>> + * @usrhdr: pointer to user specific header in the message buffer
>> + *
>> + * RETURNS:
>> + * 0 on success and negative error code on failure
>> + */
>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
> drm_ras_reply so we standardize in a single namespace everywhere ?!
>
> and same for all other functions and structs, except for things
> that are declared outside drm
ok so the way i understood based on your comment in the first patch is
netlink constructs 
should be abstracted out from drm drivers so this interface should not
expect sk_buff or genl
things and handle all that things internally. 
>
>> +{
>> +	int ret;
>> +
>> +	genlmsg_end(msg, usrhdr);
>> +
>> +	ret = genlmsg_reply(msg, info);
>> +	if (ret)
>> +		nlmsg_free(msg);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(drm_genl_reply);
>> +
>> +/**
>> + * drm_genl_alloc_msg - allocate genl message buffer
>> + * @dev: drm_device for which the message is being allocated
>> + * @info: receiver information
>> + * @msg_size: size of the msg buffer that needs to be allocated
>> + * @usrhdr: pointer to user specific header in the message buffer
>> + *
>> + * RETURNS:
>> + * pointer to new allocated buffer on success, NULL on failure
>> + */
>> +struct sk_buff *
>> +drm_genl_alloc_msg(struct drm_device *dev,
>> +		   struct genl_info *info,
>> +		   size_t msg_size, void **usrhdr)
>> +{
>> +	struct sk_buff *new_msg;
>> +
>> +	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
>> +	if (!new_msg)
>> +		return new_msg;
>> +
>> +	*usrhdr = genlmsg_put_reply(new_msg, info, dev->drm_genl_family, 0, info->genlhdr->cmd);
>> +	if (!*usrhdr) {
>> +		nlmsg_free(new_msg);
>> +		new_msg = NULL;
>> +	}
>> +
>> +	return new_msg;
>> +}
>> +EXPORT_SYMBOL(drm_genl_alloc_msg);
>> +
>> +static struct drm_device *genl_to_dev(struct genl_info *info)
>> +{
>> +	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
>> +}
>> +
>> +static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct drm_device *dev = genl_to_dev(info);
>> +
>> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL) {
>> +		if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_READ_ALL))
>> +			return -EINVAL;
>> +	} else {
>> +		if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_QUERY))
>> +			return -EINVAL;
>> +	}
>> +
>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>> +		return -EOPNOTSUPP;
>> +
>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
>> +}
>> +
>> +static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct drm_device *dev = genl_to_dev(info);
>> +
>> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
>> +		return -EINVAL;
>> +
>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>> +		return -EOPNOTSUPP;
>> +
>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
>> +}
>> +
>> +/* attribute policies */
>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>> +	[DRM_RAS_ATTR_QUERY] = { .type = NLA_U8 },
>> +};
>> +
>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
>> +	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
>> +};
>> +
>> +static const struct nla_policy drm_attr_policy_read_all[DRM_ATTR_MAX + 1] = {
>> +	[DRM_RAS_ATTR_READ_ALL] = { .type = NLA_U8 },
>> +};
>> +
>> +/* drm genl operations definition */
>> +const struct genl_ops drm_genl_ops[] = {
>> +	{
>> +		.cmd = DRM_RAS_CMD_QUERY,
>> +		.doit = drm_genl_list_errors,
>> +		.policy = drm_attr_policy_query,
>> +	},
>> +	{
>> +		.cmd = DRM_RAS_CMD_READ_ONE,
>> +		.doit = drm_genl_read_error,
>> +		.policy = drm_attr_policy_read_one,
>> +	},
>> +	{
>> +		.cmd = DRM_RAS_CMD_READ_ALL,
>> +		.doit = drm_genl_list_errors,
>> +		.policy = drm_attr_policy_read_all,
>> +	},
>> +	{
>> +		.cmd = DRM_RAS_CMD_READ_BLOCK,
>> +		.doit = drm_genl_read_error,
>> +		.policy = drm_attr_policy_read_one,
>> +	},
>> +
>> +};
>> +
>> +static void drm_genl_family_init(struct drm_device *dev)
>> +{
>> +	dev->drm_genl_family = drmm_kzalloc(dev, sizeof(struct genl_family),
>> +					    GFP_KERNEL);
>> +
>> +	/* Use drm primary node name eg: card0 to name the genl family */
>> +	snprintf(dev->drm_genl_family->name, sizeof(dev->drm_genl_family->name),
>> +		 "%s", dev->primary->kdev->kobj.name);
> for the family name I believe we deserve the 'drmras', then
> the card minor number, then the group name.
>
> For instance, but not necessarily suggesting xe to do it:
>
> drmras-0-gt
> drmras-0-soc
>
> .....
>
> driver can select their own name...
ok the drm_ras_register that we will introduce shall take a name.
As I mentioned earlier we needn't expose IP as group.
>
>
>> +	dev->drm_genl_family->version = DRM_GENL_VERSION;
> I believe driver could control their own version so if something changes in
> the group names for instance or supported commands they can change it.
but the commands itself are common, I'm not sure if we want drivers to
have their private set of commands.
>
>> +	dev->drm_genl_family->parallel_ops = true;
>> +	dev->drm_genl_family->ops = drm_genl_ops;
>> +	dev->drm_genl_family->n_ops = ARRAY_SIZE(drm_genl_ops);
>> +	dev->drm_genl_family->maxattr = DRM_ATTR_MAX;
>> +	dev->drm_genl_family->module = dev->dev->driver->owner;
>> +}
>> +
>> +static void drm_genl_deregister(struct drm_device *dev, void *arg)
>> +{
>> +	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family->name);
>> +
>> +	xa_erase(&drm_dev_xarray, dev->drm_genl_family->id);
>> +
>> +	genl_unregister_family(dev->drm_genl_family);
>> +}
>> +
>> +/**
>> + * drm_genl_register - Register genl family
>> + * @dev: drm_device for which genl family needs to be registered
>> + *
>> + * RETURNS:
>> + * 0 on success and negative error code on failure
>> + */
>> +int drm_genl_register(struct drm_device *dev)
>> +{
>> +	int ret;
>> +
>> +	drm_genl_family_init(dev);
>> +
>> +	ret = genl_register_family(dev->drm_genl_family);
>> +	if (ret < 0) {
>> +		drm_warn(dev, "genl family registration failed\n");
>> +		return ret;
>> +	}
>> +
>> +	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family->id,
>> +		       dev->drm_genl_family->name);
>> +
>> +	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family->id, dev, GFP_KERNEL));
>> +	if (ret)
>> +		goto genl_unregister;
>> +
>> +	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
>> +
>> +	return ret;
>> +
>> +genl_unregister:
>> +	genl_unregister_family(dev->drm_genl_family);
>> +	return ret;
>> +}
>> +
>> +/**
>> + * drm_genl_exit: destroy drm_dev_xarray
>> + */
>> +void drm_genl_exit(void)
>> +{
>> +	xa_destroy(&drm_dev_xarray);
>> +}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index 08b3b2467c4c..8b60a17e4156 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -8,6 +8,7 @@
>>  #include <linux/sched.h>
>>  
>>  #include <drm/drm_mode_config.h>
>> +#include <drm/drm_netlink.h>
>>  
>>  struct drm_driver;
>>  struct drm_minor;
>> @@ -22,6 +23,8 @@ struct inode;
>>  struct pci_dev;
>>  struct pci_controller;
>>  
>> +struct genl_family;
>> +
>>  /*
>>   * Recovery methods for wedged device in order of less to more side-effects.
>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>> @@ -356,6 +359,13 @@ struct drm_device {
>>  	 * Root directory for debugfs files.
>>  	 */
>>  	struct dentry *debugfs_root;
>> +
>> +	/**
>> +	 * @drm_genl_family:
>> +	 *
>> +	 * Generic netlink family registration structure.
>> +	 */
>> +	struct genl_family *drm_genl_family;
> we should probably have this inside a struct drm_ras and without the 1-1
> tie here
Ok.
>
>
>>  };
>>  
>>  void drm_dev_set_dma_dev(struct drm_device *dev, struct device *dma_dev);
>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>> index 3f76a32d6b84..908888ac0db2 100644
>> --- a/include/drm/drm_drv.h
>> +++ b/include/drm/drm_drv.h
>> @@ -431,6 +431,13 @@ struct drm_driver {
>>  	 * some examples.
>>  	 */
>>  	const struct file_operations *fops;
>> +
>> +	/**
>> +	 * @genl_ops:
>> +	 *
>> +	 * Drivers private callback to genl commands
>> +	 */
>> +	const struct driver_genl_ops *genl_ops;
> as well the ops should be encapsulated in the drm_ras struct
got it.
>
>>  };
>>  
>>  void *__devm_drm_dev_alloc(struct device *parent,
>> diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
>> new file mode 100644
>> index 000000000000..4a746222337a
>> --- /dev/null
>> +++ b/include/drm/drm_netlink.h
>> @@ -0,0 +1,41 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_NETLINK_H__
>> +#define __DRM_NETLINK_H__
>> +
>> +#include <linux/types.h>
>> +
>> +struct drm_device;
>> +struct genl_info;
>> +struct sk_buff;
>> +
>> +struct driver_genl_ops {
>> +	int		       (*doit)(struct drm_device *dev,
> when I first saw the doit I was going to complain about it,
> until I learned this is part of netlink definition :)
>
>> +				       struct sk_buff *skb,
>> +				       struct genl_info *info);
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET)
>> +int drm_genl_register(struct drm_device *dev);
>> +void drm_genl_exit(void);
>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
>> +struct sk_buff *
>> +drm_genl_alloc_msg(struct drm_device *dev,
>> +		   struct genl_info *info,
>> +		   size_t msg_size, void **usrhdr);
>> +#else
>> +static inline int drm_genl_register(struct drm_device *dev) { return 0; }
>> +static inline void drm_genl_exit(void) {}
>> +static inline int drm_genl_reply(struct sk_buff *msg,
>> +				 struct genl_info *info,
>> +				 void *usrhdr) { return 0; }
>> +static inline struct skb_buff *
>> +drm_genl_alloc_msg(struct drm_device *dev,
>> +		   struct genl_info *info,
>> +		   size_t msg_size, void **usrhdr) { return NULL; }
>> +#endif
>> +
>> +#endif
>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>> new file mode 100644
>> index 000000000000..58afb6e8d84a
>> --- /dev/null
>> +++ b/include/uapi/drm/drm_netlink.h
>> @@ -0,0 +1,101 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright 2023 Intel Corporation
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a
>> + * copy of this software and associated documentation files (the "Software"),
>> + * to deal in the Software without restriction, including without limitation
>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>> + * and/or sell copies of the Software, and to permit persons to whom the
>> + * Software is furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice (including the next
>> + * paragraph) shall be included in all copies or substantial portions of the
>> + * Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
>> + * OTHER DEALINGS IN THE SOFTWARE.
> This header kind of conflicts/overlaps the MIT SPDX above. We should remove it
> and go only with the SPDX imho, unless I'm missing something
My bad may be i clubbed the two taking from some header.
>
>> + */
>> +
>> +#ifndef _DRM_NETLINK_H_
>> +#define _DRM_NETLINK_H_
>> +
>> +#define DRM_GENL_VERSION 1
>> +
>> +#if defined(__cplusplus)
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * enum drm_genl_error_cmds - Supported error commands
>> + *
>> + */
>> +enum drm_genl_error_cmds {
>> +	DRM_CMD_UNSPEC,
>> +	/**
>> +	 * @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id in verbose mode.
>> +	 * In normal mode will list IP blocks, total instances available and error types supported
>> +	 */
>> +	DRM_RAS_CMD_QUERY,
> here is the part where naming inconsistency is more visible, file has one
> namespacing, struct has another, and command has even a third one.
>
> drm_ras everywhere to solve this please.
Ok makes sense.
>
>> +	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */
>> +	DRM_RAS_CMD_READ_ONE,
>> +	/** @DRM_RAS_CMD_READ_BLOCK: Command to get a counter of specific error type from an IP
>> +	 * block
>> +	 */
>> +	DRM_RAS_CMD_READ_BLOCK,
> here is the part that I believe this API already shows how it is not
> expansible. you had to create an argument to filter the type of errors
> instead of declaring the errors per ip block like AMD folks had asked for.
Based on the comment  here
https://lore.kernel.org/all/aa23f0ef-a4ab-ca73-5ab3-ef23d6e36e89@amd.com/
they want to extract the details from a command itself using bitmask.
Let me know if you think i misunderstood.
>
>> +	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
>> +	DRM_RAS_CMD_READ_ALL,
>> +
>> +	__DRM_CMD_MAX,
>> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>> +};
>> +
>> +enum drm_cmd_request_type {
>> +	DRM_RAS_CMD_QUERY_VERBOSE = 1,
>> +	DRM_RAS_CMD_QUERY_NORMAL = 2,
>> +};
> I don't understand why we need verbose vs normal. Perhaps this should
> be a separate path or explain with examples?
>
> it took me a while to realize that the drm_ras igt tool would only
> list my available errors if I was using --verbose, otherwise we would
> return in the begin of the list_error functions in xe...
The VERBOSE is to support the current error model in Xe where each error is
enumerated separately and NORMAL for reporting groups for AMDs model.

By default drm_ras tool uses normal that is why it doesn't work with xe
unless --verbose is passed.

>
>> +
>> +/**
>> + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
>> + *
>> + */
>> +enum drm_error_attr {
>> +	DRM_ATTR_UNSPEC,
>> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>> +	/**
>> +	 * @DRM_RAS_ATTR_QUERY: Should be used with DRM_RAS_CMD_QUERY,
>> +	 * DRM_RAS_CMD_READ_ALL
>> +	 */
>> +	DRM_RAS_ATTR_QUERY, /* NLA_U8 */
>> +	/**
>> +	 * @DRM_RAS_ATTR_READ_ALL: Should be used with DRM_RAS_CMD_READ_ALL
>> +	 */
>> +	DRM_RAS_ATTR_READ_ALL, /* NLA_U8 */
>> +	/**
>> +	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
>> +	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands.
>> +	 */
>> +	DRM_RAS_ATTR_QUERY_REPLY, /* NLA_NESTED */
>> +	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
>> +	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>> +	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id, should be used with
>> +	 * DRM_RAS_CMD_READ_ONE, DRM_RAS_CMD_READ_BLOCK
>> +	 */
>> +	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
>> +	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
>> +	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
>
> I'm also confused on all of the errors here and why we would need them
> and it also looks not expansible...
Netlink commands have attributes which enforce a policy and type
checking the command can accept.
and attributes can be specific to the command.

These attributes are used in request and response commands as mentioned
in the comments.

/* attribute policies */
static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
        [DRM_RAS_ATTR_QUERY] = { .type = NLA_U8 },
};

static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX +
1] = {
        [DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
};

static const struct nla_policy drm_attr_policy_read_all[DRM_ATTR_MAX +
1] = {
        [DRM_RAS_ATTR_READ_ALL] = { .type = NLA_U8 },
};

Please let me know if it is unclear.

Thanks,
Aravind.
>
>> +
>> +	__DRM_ATTR_MAX,
>> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>> +};
>> +
>> +#if defined(__cplusplus)
>> +}
>> +#endif
>> +
>> +#endif
>> -- 
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 2/5] drm/xe/RAS: Register netlink capability
  2025-08-15 21:52   ` Rodrigo Vivi
@ 2025-08-26  9:01     ` Aravind Iddamsetty
  0 siblings, 0 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-08-26  9:01 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Ruhl, Michael J,
	Riana Tauro, Anshuman Gupta


On 16-08-2025 03:22, Rodrigo Vivi wrote:
> On Wed, Jul 30, 2025 at 12:19:53PM +0530, Aravind Iddamsetty wrote:
>> Register netlink capability with the DRM and register the driver
>> callbacks to DRM RAS netlink commands.
>>
>> v2:
>> Move the netlink registration parts to DRM susbsytem (Tomer Tayar)
>>
>> v3: compile only if CONFIG_NET is enabled
>>
>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v2
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> ---
>>  drivers/gpu/drm/xe/Makefile          |  2 ++
>>  drivers/gpu/drm/xe/xe_device.c       |  6 ++++++
>>  drivers/gpu/drm/xe/xe_device_types.h |  1 +
>>  drivers/gpu/drm/xe/xe_netlink.c      | 26 ++++++++++++++++++++++++++
>>  4 files changed, 35 insertions(+)
>>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> index 80eecd35e807..e960c2dbe658 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -304,6 +304,8 @@ xe-$(CONFIG_DRM_XE_DISPLAY) += \
>>  	i915-display/skl_universal_plane.o \
>>  	i915-display/skl_watermark.o
>>  
>> +xe-$(CONFIG_NET) += xe_netlink.o
>> +
>>  ifeq ($(CONFIG_ACPI),y)
>>  	xe-$(CONFIG_DRM_XE_DISPLAY) += \
>>  		i915-display/intel_acpi.o \
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 806dbdf8118c..ca7a17c16aa5 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -363,6 +363,8 @@ static const struct file_operations xe_driver_fops = {
>>  	.fop_flags = FOP_UNSIGNED_OFFSET,
>>  };
>>  
>> +extern const struct driver_genl_ops xe_genl_ops[];
>> +
>>  static struct drm_driver driver = {
>>  	/* Don't use MTRRs here; the Xserver or userspace app should
>>  	 * deal with them for Intel hardware.
>> @@ -381,6 +383,10 @@ static struct drm_driver driver = {
>>  #ifdef CONFIG_PROC_FS
>>  	.show_fdinfo = xe_drm_client_fdinfo,
>>  #endif
>> +#ifdef CONFIG_NET
>> +	.genl_ops = xe_genl_ops,
>> +#endif
>> +
> we should definitely have a drm function to register it instead of hard-coding
> it here, regardless if we go with the group split or not.
> It is not okay forcing this to every platform, even the ones without any RAS
> available for instance.
ok.
>>  	.ioctls = xe_ioctls,
>>  	.num_ioctls = ARRAY_SIZE(xe_ioctls),
>>  	.fops = &xe_driver_fops,
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index 3a851c7a55dd..08d3e53e4b37 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <drm/drm_device.h>
>>  #include <drm/drm_file.h>
>> +#include <drm/drm_netlink.h>
>>  #include <drm/ttm/ttm_device.h>
>>  
>>  #include "xe_devcoredump_types.h"
>> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
>> new file mode 100644
>> index 000000000000..9e588fb19631
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_netlink.c
>> @@ -0,0 +1,26 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#include <net/genetlink.h>
>> +#include <uapi/drm/drm_netlink.h>
>> +
>> +#include "xe_device.h"
>> +
>> +static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	return 0;
>> +}
>> +
>> +static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	return 0;
>> +}
>> +
>> +/* driver callbacks to DRM netlink commands*/
>> +const struct driver_genl_ops xe_genl_ops[] = {
>> +	[DRM_RAS_CMD_QUERY] =		{ .doit = xe_genl_list_errors },
>> +	[DRM_RAS_CMD_READ_ONE] =	{ .doit = xe_genl_read_error },
>> +	[DRM_RAS_CMD_READ_ALL] =	{ .doit = xe_genl_list_errors, },
>> +};
> this is another space that is strange. you declare it here and drm
> magically uses it. Another reason for more explicity registration.
> and with the struct drm_ras where these commands are part of that.
> as well as the group name, etc.

agree, this shall be part of explicit registration.

Thanks,
Aravind.
>
>> -- 
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 3/5] drm/xe/RAS: Expose the error counters
  2025-08-15 21:58   ` Rodrigo Vivi
@ 2025-08-26  9:26     ` Aravind Iddamsetty
  0 siblings, 0 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-08-26  9:26 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Michael J,
	Riana Tauro, Anshuman Gupta


On 16-08-2025 03:28, Rodrigo Vivi wrote:
> On Wed, Jul 30, 2025 at 12:19:54PM +0530, Aravind Iddamsetty wrote:
>> We expose the various error counters supported on a hardware via genl
>> subsytem through the registered commands to userspace. The
>> DRM_RAS_CMD_QUERY lists the error names with config id,
>> DRM_RAD_CMD_READ_ONE returns the counter value for the requested config
>> id and the DRM_RAS_CMD_READ_ALL lists the counters for all errors along
>> with their names and config ids.
>>
>> v2: Rebase
>>
>> v3:
>> 1. presently xe_list_errors fills blank data for IGFX, prevent it by
>> having an early check of IS_DGFX (Michael J. Ruhl)
>> 2. update errors from all sources
>>
>> v4: Check QUERY param, if its normal return not supported.
>>
>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> ---
>>  drivers/gpu/drm/xe/xe_hw_error.c |  15 +-
>>  drivers/gpu/drm/xe/xe_netlink.c  | 509 ++++++++++++++++++++++++++++++-
>>  include/uapi/drm/xe_drm.h        |  85 ++++++
>>  3 files changed, 602 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index 6a7cd59caac1..bdd9c88674b2 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -531,16 +531,21 @@ static void xe_clear_all_soc_errors(struct xe_device *xe)
>>  
>>  		while (hw_err < HARDWARE_ERROR_MAX) {
>>  			for (i = 0; i < XE_SOC_NUM_IEH; i++)
>> -				xe_mmio_write32(&gt->tile->mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
>> +				xe_mmio_write32(&gt->tile->mmio,
>> +						SOC_GSYSEVTCTL_REG(base, slave_base, i),
>>  						~REG_BIT(hw_err));
>>  
>> -			xe_mmio_write32(&gt->tile->mmio, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
>> +			xe_mmio_write32(&gt->tile->mmio,
>> +					SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
>>  					REG_GENMASK(31, 0));
>> -			xe_mmio_write32(&gt->tile->mmio, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
>> +			xe_mmio_write32(&gt->tile->mmio,
>> +					SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
>>  					REG_GENMASK(31, 0));
>> -			xe_mmio_write32(&gt->tile->mmio, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>> +			xe_mmio_write32(&gt->tile->mmio,
>> +					SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>>  					REG_GENMASK(31, 0));
>> -			xe_mmio_write32(&gt->tile->mmio, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>> +			xe_mmio_write32(&gt->tile->mmio,
>> +					SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>>  					REG_GENMASK(31, 0));
> probably a fixup for the patch in the other series?
> Which btw it was hard to understand the dependency. We should make this series indepentent
> of the other one.
agree this shouldn't have been part of this series, I might have messed
it as part of rebasing.
>>  			hw_err++;
>>  		}
>> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
>> index 9e588fb19631..20240875284a 100644
>> --- a/drivers/gpu/drm/xe/xe_netlink.c
>> +++ b/drivers/gpu/drm/xe/xe_netlink.c
>> @@ -6,16 +6,521 @@
>>  #include <net/genetlink.h>
>>  #include <uapi/drm/drm_netlink.h>
>>  
>> +#include <drm/xe_drm.h>
>> +
>>  #include "xe_device.h"
>>  
>> -static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>> +#define MAX_ERROR_NAME	100
>> +
>> +static const char * const xe_hw_error_events[] = {
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng",
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc",
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler",
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm",
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic",
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_GUC] = "fatal-guc",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_SLM] = "fatal-slm",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_FPU] = "fatal-fpu",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_TLB] = "fatal-tlb",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice",
>> +		[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
>> +		[DRM_XE_GENL_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
>> +		[DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable",
>> +		[DRM_XE_GENL_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
>> +		[DRM_XE_GENL_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = "soc-nonfatal-csc-psf-cmd-parity",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = "soc-nonfatal-csc-psf-unexpected-completion",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = "soc-nonfatal-csc-psf-unsupported-request",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = "soc-nonfatal-anr-mdfi",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = "soc-nonfatal-mdfi-t2t",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = "soc-nonfatal-mdfi-t2c",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = "soc-nonfatal-hbm-ss0-0",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = "soc-nonfatal-hbm-ss0-1",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = "soc-nonfatal-hbm-ss0-2",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = "soc-nonfatal-hbm-ss0-3",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = "soc-nonfatal-hbm-ss0-4",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = "soc-nonfatal-hbm-ss0-5",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = "soc-nonfatal-hbm-ss0-6",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = "soc-nonfatal-hbm-ss0-7",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = "soc-nonfatal-hbm-ss1-0",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = "soc-nonfatal-hbm-ss1-1",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = "soc-nonfatal-hbm-ss1-2",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = "soc-nonfatal-hbm-ss1-3",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = "soc-nonfatal-hbm-ss1-4",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = "soc-nonfatal-hbm-ss1-5",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = "soc-nonfatal-hbm-ss1-6",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = "soc-nonfatal-hbm-ss1-7",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = "soc-nonfatal-hbm-ss2-0",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = "soc-nonfatal-hbm-ss2-1",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = "soc-nonfatal-hbm-ss2-2",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = "soc-nonfatal-hbm-ss2-3",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = "soc-nonfatal-hbm-ss2-4",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = "soc-nonfatal-hbm-ss2-5",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = "soc-nonfatal-hbm-ss2-6",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = "soc-nonfatal-hbm-ss2-7",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = "soc-nonfatal-hbm-ss3-0",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = "soc-nonfatal-hbm-ss3-1",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = "soc-nonfatal-hbm-ss3-2",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = "soc-nonfatal-hbm-ss3-3",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = "soc-nonfatal-hbm-ss3-4",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = "soc-nonfatal-hbm-ss3-5",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = "soc-nonfatal-hbm-ss3-6",
>> +		[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = "soc-nonfatal-hbm-ss3-7",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = "soc-fatal-csc-psf-cmd-parity",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = "soc-fatal-csc-psf-unexpected-completion",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = "soc-fatal-csc-psf-unsupported-request",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = "soc-fatal-pcie-psf-command-parity",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = "soc-fatal-pcie-psf-unexpected-completion",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = "soc-fatal-pcie-psf-unsupported-request",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = "soc-fatal-anr-mdfi",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = "soc-fatal-mdfi-t2t",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = "soc-fatal-mdfi-t2c",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = "soc-fatal-malformed-pcie-aer",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = "soc-fatal-malformed-pcie-err",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND] = "soc-fatal-ur-condition-ieh",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = "soc-fatal-from-serr-sources",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6",
>> +		[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7",
>> +		[DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = "gsc-nonfatal-mia-internal",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = "gsc-nonfatal-vlt-glitch",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = "gsc-nonfatal-self-mbist",
>> +		[DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = "gsc-nonfatal-aon-parity",
>> +		[DRM_XE_GENL_SGGI_ERROR_NONFATAL] = "sggi-nonfatal-data-parity",
>> +		[DRM_XE_GENL_SGLI_ERROR_NONFATAL] = "sgli-nonfatal-data-parity",
>> +		[DRM_XE_GENL_SGCI_ERROR_NONFATAL] = "sgci-nonfatal-data-parity",
>> +		[DRM_XE_GENL_MERT_ERROR_NONFATAL] = "mert-nonfatal-data-parity",
>> +		[DRM_XE_GENL_SGGI_ERROR_FATAL] = "sggi-fatal-data-parity",
>> +		[DRM_XE_GENL_SGLI_ERROR_FATAL] = "sgli-fatal-data-parity",
>> +		[DRM_XE_GENL_SGCI_ERROR_FATAL] = "sgci-fatal-data-parity",
>> +		[DRM_XE_GENL_MERT_ERROR_FATAL] = "mert-nonfatal-data-parity",
>> +};
>> +
>> +static const unsigned long xe_hw_error_map[] = {
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = XE_HW_ERR_GT_CORR_L3_SNG,
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC] = XE_HW_ERR_GT_CORR_GUC,
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = XE_HW_ERR_GT_CORR_SAMPLER,
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM] = XE_HW_ERR_GT_CORR_SLM,
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = XE_HW_ERR_GT_CORR_EU_IC,
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = XE_HW_ERR_GT_CORR_EU_GRF,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST] = XE_HW_ERR_GT_FATAL_ARR_BIST,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB] = XE_HW_ERR_GT_FATAL_L3_DOUB,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = XE_HW_ERR_GT_FATAL_L3_ECC_CHK,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_GUC] = XE_HW_ERR_GT_FATAL_GUC,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR] = XE_HW_ERR_GT_FATAL_IDI_PAR,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_SQIDI] = XE_HW_ERR_GT_FATAL_SQIDI,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER] = XE_HW_ERR_GT_FATAL_SAMPLER,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_SLM] = XE_HW_ERR_GT_FATAL_SLM,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_EU_IC] = XE_HW_ERR_GT_FATAL_EU_IC,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF] = XE_HW_ERR_GT_FATAL_EU_GRF,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_FPU] = XE_HW_ERR_GT_FATAL_FPU,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_TLB] = XE_HW_ERR_GT_FATAL_TLB,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = XE_HW_ERR_GT_FATAL_L3_FABRIC,
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = XE_HW_ERR_GT_CORR_SUBSLICE,
>> +	[DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = XE_HW_ERR_GT_CORR_L3BANK,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE] = XE_HW_ERR_GT_FATAL_SUBSLICE,
>> +	[DRM_XE_GENL_GT_ERROR_FATAL_L3BANK] = XE_HW_ERR_GT_FATAL_L3BANK,
>> +	[DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE] = XE_HW_ERR_TILE_CORR_SGUNIT,
>> +	[DRM_XE_GENL_SGUNIT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGUNIT,
>> +	[DRM_XE_GENL_SGUNIT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGUNIT,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMD,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMP,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_REQ,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = XE_HW_ERR_SOC_NONFATAL_ANR_MDFI,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2T,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2C,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL7,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL7,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL7,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL7,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMD,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMP,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_FATAL_CSC_PSF_REQ,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT] = XE_HW_ERR_SOC_FATAL_PUNIT,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMD,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMP,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_REQ,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = XE_HW_ERR_SOC_FATAL_ANR_MDFI,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = XE_HW_ERR_SOC_FATAL_MDFI_T2T,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = XE_HW_ERR_SOC_FATAL_MDFI_T2C,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = XE_HW_ERR_SOC_FATAL_PCIE_AER,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = XE_HW_ERR_SOC_FATAL_PCIE_ERR,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND] = XE_HW_ERR_SOC_FATAL_UR_COND,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = XE_HW_ERR_SOC_FATAL_SERR_SRCS,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL7,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL7,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL7,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL0,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL1,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL2,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL3,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL4,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL5,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL6,
>> +	[DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL7,
>> +	[DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = XE_HW_ERR_GSC_CORR_SRAM,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = XE_HW_ERR_GSC_NONFATAL_MIA_SHUTDOWN,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = XE_HW_ERR_GSC_NONFATAL_MIA_INTERNAL,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = XE_HW_ERR_GSC_NONFATAL_SRAM,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = XE_HW_ERR_GSC_NONFATAL_WDG,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = XE_HW_ERR_GSC_NONFATAL_ROM_PARITY,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = XE_HW_ERR_GSC_NONFATAL_UCODE_PARITY,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = XE_HW_ERR_GSC_NONFATAL_VLT_GLITCH,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = XE_HW_ERR_GSC_NONFATAL_FUSE_PULL,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = XE_HW_ERR_GSC_NONFATAL_FUSE_CRC,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = XE_HW_ERR_GSC_NONFATAL_SELF_MBIST,
>> +	[DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = XE_HW_ERR_GSC_NONFATAL_AON_RF_PARITY,
>> +	[DRM_XE_GENL_SGGI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGGI,
>> +	[DRM_XE_GENL_SGLI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGLI,
>> +	[DRM_XE_GENL_SGCI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGCI,
>> +	[DRM_XE_GENL_MERT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_MERT,
>> +	[DRM_XE_GENL_SGGI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGGI,
>> +	[DRM_XE_GENL_SGLI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGLI,
>> +	[DRM_XE_GENL_SGCI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGCI,
>> +	[DRM_XE_GENL_MERT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_MERT,
>> +};
> probably deserves a separate header file?
is that to have better readability? as the header will ever be used only
in this file will it have any advantage?
>
>> +
>> +static unsigned int config_gt_id(const u64 config)
>> +{
>> +	return config >> __XE_GENL_GT_SHIFT;
>> +}
>> +
>> +static u64 config_counter(const u64 config)
>> +{
>> +	return config & ~(~0ULL << __XE_GENL_GT_SHIFT);
>> +}
>> +
>> +static bool is_gt_error(const u64 config)
>> +{
>> +	unsigned int error;
>> +
>> +	error = config_counter(config);
>> +	if (error <= DRM_XE_GENL_GT_ERROR_FATAL_FPU)
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static bool is_gt_vector_error(const u64 config)
>> +{
>> +	unsigned int error;
>> +
>> +	error = config_counter(config);
>> +	if (error >= DRM_XE_GENL_GT_ERROR_FATAL_TLB &&
>> +	    error <= DRM_XE_GENL_GT_ERROR_FATAL_L3BANK)
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static bool is_pvc_invalid_gt_errors(const u64 config)
>> +{
>> +	switch (config_counter(config)) {
>> +	case DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG:
>> +	case DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER:
>> +	case DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST:
>> +	case DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB:
>> +	case DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK:
>> +	case DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR:
>> +	case DRM_XE_GENL_GT_ERROR_FATAL_SQIDI:
>> +	case DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER:
>> +	case DRM_XE_GENL_GT_ERROR_FATAL_EU_IC:
>> +		return true;
>> +	default:
>> +		return false;
>> +	}
>> +}
>> +
>> +static bool is_gsc_hw_error(const u64 config)
>> +{
>> +	if (config_counter(config) >= DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
>> +	    config_counter(config) <= DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY)
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static bool is_soc_error(const u64 config)
>>  {
>> +	if (config_counter(config) >= DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD &&
>> +	    config_counter(config) <= DRM_XE_GENL_SOC_ERROR_FATAL_HBM(1, 15))
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static int
>> +config_status(struct xe_device *xe, u64 config)
>> +{
>> +	unsigned int gt_id = config_gt_id(config);
>> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
>> +
>> +	if (!IS_DGFX(xe))
>> +		return -ENODEV;
>> +
>> +	if (gt->info.type == XE_GT_TYPE_UNINITIALIZED)
>> +		return -ENOENT;
>> +
>> +	/* GSC HW ERRORS are present on root tile of
>> +	 * platform supporting MEMORY SPARING only
>> +	 */
>> +	if (is_gsc_hw_error(config) && !(xe->info.platform == XE_PVC && !gt_id))
>> +		return -ENODEV;
>> +
>> +	/* GT vectors error  are valid on Platforms supporting error vectors only */
>> +	if (is_gt_vector_error(config) && xe->info.platform != XE_PVC)
>> +		return -ENODEV;
>> +
>> +	/* Skip gt errors not supported on pvc */
>> +	if (is_pvc_invalid_gt_errors(config) && xe->info.platform == XE_PVC)
>> +		return  -ENODEV;
>> +
>> +	/* FATAL FPU error is valid on PVC only */
>> +	if (config_counter(config) == DRM_XE_GENL_GT_ERROR_FATAL_FPU &&
>> +	    !(xe->info.platform == XE_PVC))
>> +		return -ENODEV;
>> +
>> +	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
>> +		return -ENODEV;
>> +
>> +	return (config_counter(config) >=
>> +			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
>> +}
>> +
>> +static u64 get_counter_value(struct xe_device *xe, u64 config)
>> +{
>> +	const unsigned int gt_id = config_gt_id(config);
>> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
>> +	unsigned int id = config_counter(config);
>> +
>> +	if (is_gt_error(config) || is_gt_vector_error(config))
>> +		return xa_to_value(xa_load(&gt->errors.hw_error, xe_hw_error_map[id]));
>> +
>> +	return xa_to_value(xa_load(&gt->tile->errors.hw_error, xe_hw_error_map[id]));
>> +}
>> +
>> +static int fill_error_details(struct xe_device *xe, struct genl_info *info, struct sk_buff *new_msg)
>> +{
>> +	struct nlattr *entry_attr;
>> +	bool counter = false;
>> +	struct xe_gt *gt;
>> +	int i, j;
>> +
>> +	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
>> +		     ARRAY_SIZE(xe_hw_error_map));
>> +
>> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL)
>> +		counter = true;
>> +
>> +	entry_attr = nla_nest_start(new_msg, DRM_RAS_ATTR_QUERY_REPLY);
>> +	if (!entry_attr)
>> +		return -EMSGSIZE;
>> +
>> +	for_each_gt(gt, xe, j) {
>> +		char str[MAX_ERROR_NAME];
>> +		u64 val;
>> +
>> +		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
>> +			u64 config = DRM_XE_HW_ERROR(j, i);
>> +
>> +			if (config_status(xe, config))
>> +				continue;
>> +
>> +			/* should this be cleared everytime */
>> +			snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]);
>> +
>> +			if (nla_put_string(new_msg, DRM_RAS_ATTR_ERROR_NAME, str))
>> +				goto err;
>> +			if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
>> +				goto err;
>> +			if (counter) {
>> +				val = get_counter_value(xe, config);
>> +				if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val,
>> +						      DRM_ATTR_PAD))
>> +					goto err;
>> +			}
>> +		}
>> +	}
>> +
>> +	nla_nest_end(new_msg, entry_attr);
>> +
>>  	return 0;
>> +err:
>> +	drm_dbg_driver(&xe->drm, "msg buff is small\n");
>> +	nla_nest_cancel(new_msg, entry_attr);
>> +	nlmsg_free(new_msg);
>> +
>> +	return -EMSGSIZE;
>> +}
>> +
>> +static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct xe_device *xe = to_xe_device(drm);
>> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
>> +	enum drm_cmd_request_type query_type;
>> +	struct sk_buff *new_msg;
>> +	int retries = 2;
>> +	void *usrhdr;
>> +	int ret = 0;
>> +
>> +	if (!IS_DGFX(xe))
>> +		return -ENODEV;
>> +
>> +	/* Support verbose only errors */
>> +	query_type = nla_get_u8(info->attrs[DRM_RAS_ATTR_QUERY]);
>> +	if (query_type == DRM_RAS_CMD_QUERY_NORMAL)
>> +		return -EOPNOTSUPP;
>> +
>> +	do {
>> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
>> +		if (!new_msg)
>> +			return -ENOMEM;
>> +
>> +		ret = fill_error_details(xe, info, new_msg);
>> +		if (!ret)
>> +			break;
>> +
>> +		msg_size += NLMSG_DEFAULT_SIZE;
>> +	} while (retries--);
>> +
>> +	if (!ret)
>> +		ret = drm_genl_reply(new_msg, info, usrhdr);
>> +
>> +	return ret;
>>  }
>>  
>>  static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>>  {
>> -	return 0;
>> +	struct xe_device *xe = to_xe_device(drm);
>> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
>> +	struct sk_buff *new_msg;
>> +	void *usrhdr;
>> +	int ret = 0;
>> +	int retries = 2;
>> +	u64 config, val;
>> +
>> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_BLOCK)
>> +		return -EOPNOTSUPP;
>> +
>> +	config = nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_ID]);
>> +	ret = config_status(xe, config);
>> +	if (ret)
>> +		return ret;
>> +	do {
>> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
>> +		if (!new_msg)
>> +			return -ENOMEM;
>> +
>> +		val = get_counter_value(xe, config);
>> +		if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
>> +			msg_size += NLMSG_DEFAULT_SIZE;
>> +			continue;
>> +		}
>> +
>> +		break;
>> +	} while (retries--);
>> +
>> +	ret = drm_genl_reply(new_msg, info, usrhdr);
>> +
>> +	return ret;
>>  }
> this .c without any public function and no .h was what draw my attention that
> there was something wrong with the registration process. We need to have init and finish
> per component.

I had designed these similar to  drm_driver ops, even with the new
registration model i believe they would still be static.

Is there a concern?

>
>>  
>>  /* driver callbacks to DRM netlink commands*/
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index e2426413488f..d352a96e4826 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -1974,6 +1974,91 @@ struct drm_xe_query_eu_stall {
>>  	__u64 sampling_rates[];
>>  };
>>  
>> +/*
>> + * Top bits of every counter are GT id.
>> + */
>> +#define __XE_GENL_GT_SHIFT	(56)
>> +/**
>> + * DOC: XE GENL netlink event IDs
>> + * TODO: Add more details
> yes, please it is hard to understand why here.
>
> And also I have the feeling that this deserves to be together with
> the other definitions above all together in a separate header.
> perhaps even per platform?!
this is a UAPI the below IDs are used by drm_ras tool to read a
particular error counter. 
so felt these should be part of this header. But, why separate headers
aren't UAPI
stuff be in this header instead.

Thanks,
Aravind.
>
>> + */
>> +#define DRM_XE_HW_ERROR(gt, id) \
>> +	((id) | ((__u64)(gt) << __XE_GENL_GT_SHIFT))
>> +
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG		(0)
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_GUC			(1)
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER		(2)
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SLM			(3)
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_IC		(4)
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF		(5)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_ARR_BIST			(6)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3_DOUB			(7)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK		(8)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_GUC			(9)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_IDI_PAR			(10)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_SQIDI			(11)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_SAMPLER			(12)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_SLM			(13)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_EU_IC			(14)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_EU_GRF			(15)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_FPU			(16)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_TLB			(17)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3_FABRIC			(18)
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE		(19)
>> +#define DRM_XE_GENL_GT_ERROR_CORRECTABLE_L3BANK		(20)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_SUBSLICE			(21)
>> +#define DRM_XE_GENL_GT_ERROR_FATAL_L3BANK			(22)
>> +#define DRM_XE_GENL_SGUNIT_ERROR_CORRECTABLE			(23)
>> +#define DRM_XE_GENL_SGUNIT_ERROR_NONFATAL			(24)
>> +#define DRM_XE_GENL_SGUNIT_ERROR_FATAL			(25)
>> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD		(26)
>> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP		(27)
>> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ		(28)
>> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI		(29)
>> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T		(30)
>> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C		(31)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD		(32)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP		(33)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ		(34)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PUNIT			(35)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD		(36)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP		(37)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ		(38)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_ANR_MDFI			(39)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2T			(40)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_MDFI_T2C			(41)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_AER			(42)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_PCIE_ERR			(43)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_UR_COND			(44)
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS		(45)
>> +
>> +#define DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(ss, n)\
>> +		(DRM_XE_GENL_SOC_ERROR_FATAL_SERR_SRCS + 0x1 + (ss) * 0x10 + (n))
>> +#define DRM_XE_GENL_SOC_ERROR_FATAL_HBM(ss, n)\
>> +		(DRM_XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15) + 0x1 + (ss) * 0x10 + (n))
>> +
>> +/* 109 is the last ID used by SOC errors */
>> +#define DRM_XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC		(110)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN		(111)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL		(112)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC		(113)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT		(114)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY		(115)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY		(116)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH		(117)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL		(118)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(119)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST		(120)
>> +#define DRM_XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY		(121)
>> +#define DRM_XE_GENL_SGGI_ERROR_NONFATAL			(122)
>> +#define DRM_XE_GENL_SGLI_ERROR_NONFATAL			(123)
>> +#define DRM_XE_GENL_SGCI_ERROR_NONFATAL			(124)
>> +#define DRM_XE_GENL_MERT_ERROR_NONFATAL			(125)
>> +#define DRM_XE_GENL_SGGI_ERROR_FATAL				(126)
>> +#define DRM_XE_GENL_SGLI_ERROR_FATAL				(127)
>> +#define DRM_XE_GENL_SGCI_ERROR_FATAL				(128)
>> +#define DRM_XE_GENL_MERT_ERROR_FATAL				(129)
>> +
>>  #if defined(__cplusplus)
>>  }
>>  #endif
>> -- 
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v5 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2025-08-15 22:01   ` Rodrigo Vivi
@ 2025-08-26  9:34     ` Aravind Iddamsetty
  0 siblings, 0 replies; 24+ messages in thread
From: Aravind Iddamsetty @ 2025-08-26  9:34 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: intel-xe, dri-devel, Alex Deucher, David Airlie, Simona Vetter,
	Joonas Lahtinen, Hawking Zhang, Lijo Lazar, Michael J,
	Riana Tauro, Anshuman Gupta

[-- Attachment #1: Type: text/plain, Size: 2831 bytes --]


On 16-08-2025 03:31, Rodrigo Vivi wrote:
> On Wed, Jul 30, 2025 at 12:19:56PM +0530, Aravind Iddamsetty wrote:
>> Whenever a correctable or an uncorrectable error happens an event is sent
>> to the corresponding listeners of these groups.
>>
>> v2: Rebase
>> v3: protect with CONFIG_NET define.
>>
>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> #v2
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> ---
>>  drivers/gpu/drm/xe/xe_hw_error.c | 41 ++++++++++++++++++++++++++++++++
>>  1 file changed, 41 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index bdd9c88674b2..e6e2e6250b70 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -2,6 +2,8 @@
>>  /*
>>   * Copyright © 2023 Intel Corporation
>>   */
>> +#include <net/genetlink.h>
>> +#include <uapi/drm/drm_netlink.h>
>>  
>>  #include "xe_gt_printk.h"
>>  #include "xe_hw_error.h"
>> @@ -776,6 +778,43 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>  				(HARDWARE_ERROR_MAX << 1) + 1);
>>  }
>>  
>> +#ifdef CONFIG_NET
>> +static void
>> +generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
>> +{
>> +	struct sk_buff *msg;
>> +	void *hdr;
>> +
>> +	if (!xe->drm.drm_genl_family)
>> +		return;
>> +
>> +	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
>> +	if (!msg) {
>> +		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
>> +		return;
>> +	}
>> +
>> +	hdr = genlmsg_put(msg, 0, 0, xe->drm.drm_genl_family, 0, DRM_RAS_CMD_ERROR_EVENT);
> this is something that could be wrapped up in the drm_ras
are you  referring to entire generate_netlink_event function? because I
thought driver might want to pass in custom info as part of event
may be like the error ID that was reported by HW.
Thanks,
Aravind.
>
>> +	if (!hdr) {
>> +		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
>> +		nlmsg_free(msg);
>> +		return;
>> +	}
>> +
>> +	genlmsg_end(msg, hdr);
>> +
>> +	genlmsg_multicast(xe->drm.drm_genl_family, msg, 0,
>> +			  hw_err ?
>> +			  DRM_GENL_MCAST_UNCORR_ERR
>> +			  : DRM_GENL_MCAST_CORR_ERR,
>> +			  GFP_ATOMIC);
>> +}
>> +#else
>> +static void
>> +generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
>> +{}
>> +#endif
>> +
>>  static void
>>  xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>  {
>> @@ -837,6 +876,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
>>  	}
>>  
>>  	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), errsrc);
>> +
>> +	generate_netlink_event(tile_to_xe(tile), hw_err);
>>  unlock:
>>  	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>>  }
>> -- 
>> 2.25.1
>>

[-- Attachment #2: Type: text/html, Size: 3793 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-08-26  9:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30  6:49 [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
2025-07-30  6:49 ` [RFC v5 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
2025-08-15 17:07   ` Zack McKevitt
2025-08-21  9:45     ` Aravind Iddamsetty
2025-08-25 17:31       ` Zack McKevitt
2025-08-15 21:48   ` Rodrigo Vivi
2025-08-26  5:58     ` Aravind Iddamsetty
2025-07-30  6:49 ` [RFC v5 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
2025-08-15 21:52   ` Rodrigo Vivi
2025-08-26  9:01     ` Aravind Iddamsetty
2025-07-30  6:49 ` [RFC v5 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
2025-08-15 21:58   ` Rodrigo Vivi
2025-08-26  9:26     ` Aravind Iddamsetty
2025-07-30  6:49 ` [RFC v5 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
2025-08-15 22:00   ` Rodrigo Vivi
2025-07-30  6:49 ` [RFC v5 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
2025-08-15 22:01   ` Rodrigo Vivi
2025-08-26  9:34     ` Aravind Iddamsetty
2025-07-30 21:00 ` [RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Lukas Wunner
2025-07-31 15:30   ` Aravind Iddamsetty
2025-08-13 20:21 ` Rodrigo Vivi
2025-08-15 21:24   ` Rodrigo Vivi
2025-08-26  4:42     ` Aravind Iddamsetty
2025-08-25  9:38   ` Aravind Iddamsetty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).