dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
@ 2025-09-29 21:44 Rodrigo Vivi
  2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Rodrigo Vivi @ 2025-09-29 21:44 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: Rodrigo Vivi, Hawking Zhang, Alex Deucher, Zack McKevitt,
	Lukas Wunner, Dave Airlie, Simona Vetter, Aravind Iddamsetty,
	Joonas Lahtinen

This work is a continuation of the great work started by Aravind ([1] and [2])
in order to fulfill the RAS requirements and proposal as previously discussed
and agreed in the Linux Plumbers accelerator's bof of 2022 [3].

[1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
[2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
[3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

During the past review round, Lukas pointed out that netlink had evolved
in parallel during these years and that now, any new usage of netlink families
would require the usage of the YAML description and scripts.

With this new requirement in place, the family name is hardcoded in the yaml file,
so we are forced to have a single family name for the entire drm, and then we now
we are forced to have a registration.

So, while doing the registration, we now created the concept of drm-ras-node.
For now the only node type supported is the agreed error-counter. But that could
be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
driver.

In this first version, only querying counter is supported. But also this is expandable
to future introduction of multicast notification and also clearing the counters.

This design with multiple nodes per device is already flexible enough for driver
to decide if it wants to handle error per device, or per IP block, or per error
category. I believe this fully attend to the requested AMD feedback in the earlier
reviews.

So, my proposal is to start simple with this case as is, and then iterate over
with the drm-ras in tree so we evolve together according to various driver's RAS
needs.

I have provided a documentation and the first Xe implementation of the counter
as reference.

Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
exercises this new API, hence I hope this can be the reference code for the uAPI
usage, while we continue with the plan of introducing IGT tests and tools for this
and adjusting the internal vendor tools to open with open source developments and
changing them to support these flows.

Example on MTL:

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --dump list-nodes
[{'device-name': '00:02.0',
  'node-id': 0,
  'node-name': 'non-fatal',
  'node-type': 'error-counter'},
 {'device-name': '00:02.0',
  'node-id': 1,
  'node-name': 'correctable',
  'node-type': 'error-counter'}]

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --dump get-error-counters --json '{"node-id":1}'
[{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
 {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
 {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
 {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
 {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
 {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
{'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
{'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}

Thanks,
Rodrigo.

Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Rodrigo Vivi (2):
  drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  drm/xe: Introduce the usage of drm_ras with supported HW errors

 Documentation/gpu/drm-ras.rst              | 109 +++++++
 Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++
 drivers/gpu/drm/Kconfig                    |   9 +
 drivers/gpu/drm/Makefile                   |   1 +
 drivers/gpu/drm/drm_drv.c                  |   6 +
 drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++
 drivers/gpu/drm/drm_ras_nl.c               |  54 ++++
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++
 drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-
 include/drm/drm_ras.h                      |  76 +++++
 include/drm/drm_ras_genl_family.h          |  17 +
 include/drm/drm_ras_nl.h                   |  24 ++
 include/uapi/drm/drm_ras.h                 |  49 +++
 14 files changed, 1049 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

-- 
2.51.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-11-17 14:39 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
2025-10-31  1:32   ` Jakub Kicinski
2025-11-06 13:30     ` Rodrigo Vivi
2025-11-06 14:58       ` Jakub Kicinski
2025-09-29 21:44 ` [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors Rodrigo Vivi
2025-09-30  2:07   ` kernel test robot
2025-10-02 20:38 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Zack McKevitt
2025-10-28 19:14   ` Rodrigo Vivi
2025-11-06 13:42   ` Rodrigo Vivi
2025-11-07 20:20     ` Zack McKevitt
2025-11-08  3:01       ` Rodrigo Vivi
2025-10-28 19:13 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
2025-10-29  2:00   ` Zhang, Hawking
2025-11-06 13:16     ` Rodrigo Vivi
2025-11-10  3:34       ` Dave Airlie
2025-11-10  5:13         ` John Hubbard
2025-11-10 20:35         ` Rodrigo Vivi
2025-11-17 14:39         ` Jason Gunthorpe
2025-10-30 14:47   ` Rodrigo Vivi
2025-10-30 15:37     ` DRM_RAS (netlink genl family) " Rodrigo Vivi
2025-10-31  5:38     ` DRM_RAS " Lukas Wunner
2025-11-06 13:08       ` Rodrigo Vivi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).