[PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
@ 2025-09-29 21:44 Rodrigo Vivi
  2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-09-29 21:44 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: Rodrigo Vivi, Hawking Zhang, Alex Deucher, Zack McKevitt,
	Lukas Wunner, Dave Airlie, Simona Vetter, Aravind Iddamsetty,
	Joonas Lahtinen

This work is a continuation of the great work started by Aravind ([1] and [2])
in order to fulfill the RAS requirements and proposal as previously discussed
and agreed in the Linux Plumbers accelerator's bof of 2022 [3].

[1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
[2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
[3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

During the past review round, Lukas pointed out that netlink had evolved
in parallel during these years and that now, any new usage of netlink families
would require the usage of the YAML description and scripts.

With this new requirement in place, the family name is hardcoded in the yaml file,
so we are forced to have a single family name for the entire drm, and then we now
we are forced to have a registration.

So, while doing the registration, we now created the concept of drm-ras-node.
For now the only node type supported is the agreed error-counter. But that could
be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
driver.

In this first version, only querying counter is supported. But also this is expandable
to future introduction of multicast notification and also clearing the counters.

This design with multiple nodes per device is already flexible enough for driver
to decide if it wants to handle error per device, or per IP block, or per error
category. I believe this fully attend to the requested AMD feedback in the earlier
reviews.

So, my proposal is to start simple with this case as is, and then iterate over
with the drm-ras in tree so we evolve together according to various driver's RAS
needs.

I have provided a documentation and the first Xe implementation of the counter
as reference.

Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
exercises this new API, hence I hope this can be the reference code for the uAPI
usage, while we continue with the plan of introducing IGT tests and tools for this
and adjusting the internal vendor tools to open with open source developments and
changing them to support these flows.

Example on MTL:

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --dump list-nodes
[{'device-name': '00:02.0',
  'node-id': 0,
  'node-name': 'non-fatal',
  'node-type': 'error-counter'},
 {'device-name': '00:02.0',
  'node-id': 1,
  'node-name': 'correctable',
  'node-type': 'error-counter'}]

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --dump get-error-counters --json '{"node-id":1}'
[{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
 {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
 {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
 {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
 {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
 {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
{'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
{'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}

Thanks,
Rodrigo.

Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Rodrigo Vivi (2):
  drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  drm/xe: Introduce the usage of drm_ras with supported HW errors

 Documentation/gpu/drm-ras.rst              | 109 +++++++
 Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++
 drivers/gpu/drm/Kconfig                    |   9 +
 drivers/gpu/drm/Makefile                   |   1 +
 drivers/gpu/drm/drm_drv.c                  |   6 +
 drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++
 drivers/gpu/drm/drm_ras_nl.c               |  54 ++++
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++
 drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-
 include/drm/drm_ras.h                      |  76 +++++
 include/drm/drm_ras_genl_family.h          |  17 +
 include/drm/drm_ras_nl.h                   |  24 ++
 include/uapi/drm/drm_ras.h                 |  49 +++
 14 files changed, 1049 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

-- 
2.51.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
@ 2025-09-29 21:44 ` Rodrigo Vivi
  2025-10-31  1:32   ` Jakub Kicinski
  2025-09-29 21:44 ` [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors Rodrigo Vivi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Rodrigo Vivi @ 2025-09-29 21:44 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: Rodrigo Vivi, Zack McKevitt, Lukas Wunner, Lijo Lazar,
	Hawking Zhang, Aravind Iddamsetty

Introduces the DRM RAS infrastructure over generic netlink.

The new interface allows drivers to expose RAS nodes and their
associated error counters to userspace in a structured and extensible
way. Each drm_ras node can register its own set of error counters, which
are then discoverable and queryable through netlink operations. This
lays the groundwork for reporting and managing hardware error states
in a unified manner across different DRM drivers.

Currently is only supports error-counter nodes. But it can be
extended later.

The registration is also no tied to any drm node, so it can be
used by accel devices as well.

It uses the new and mandatory YAML description format stored in
Documentation/netlink/specs/. This forces a single generic netlink
family namespace for the entire drm: "drm-ras".
But multiple-endpoints are supported within the single family.

Any modification to this API needs to be applied to
Documentation/netlink/specs/drm_ras.yaml before regenerating the
code:

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
 > include/uapi/drm/drm_ras.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
 > include/drm/drm_ras_nl.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
 > drivers/gpu/drm/drm_ras_nl.c

Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 Documentation/gpu/drm-ras.rst            | 109 +++++++
 Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
 drivers/gpu/drm/Kconfig                  |   9 +
 drivers/gpu/drm/Makefile                 |   1 +
 drivers/gpu/drm/drm_drv.c                |   6 +
 drivers/gpu/drm/drm_ras.c                | 357 +++++++++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
 drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
 include/drm/drm_ras.h                    |  76 +++++
 include/drm/drm_ras_genl_family.h        |  17 ++
 include/drm/drm_ras_nl.h                 |  24 ++
 include/uapi/drm/drm_ras.h               |  49 ++++
 12 files changed, 874 insertions(+)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
new file mode 100644
index 000000000000..992c36dd4d8d
--- /dev/null
+++ b/Documentation/gpu/drm-ras.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================
+DRM RAS over Generic Netlink
+============================
+
+The DRM RAS (Reliability, Availability, Serviceability) interface provides a
+standardized way for GPU/accelerator drivers to expose error counters and
+other reliability nodes to user space via Generic Netlink. This allows
+diagnostic tools, monitoring daemons, or test infrastructure to query hardware
+health in a uniform way across different DRM drivers.
+
+Key Goals:
+
+* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
+  data center monitoring and reliability operations.
+* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
+  specifications and centralize all RAS-related communication in one namespace.
+* Support a basic error counter interface, addressing the immediate, essential
+  monitoring needs.
+* Offer a flexible, future-proof interface that can be extended to support
+  additional types of RAS data in the future.
+* Allow multiple nodes per driver, enabling drivers to register separate
+  nodes for different IP blocks, sub-blocks, or other logical subdivisions
+  as applicable.
+
+Nodes
+=====
+
+Nodes are logical abstractions representing an error source or block within
+the device. Currently, only error counter nodes is supported.
+
+Drivers are responsible for registering and unregistering nodes via the
+`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
+
+Node Management
+-------------------
+
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :doc: DRM RAS Node Management
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :internal:
+
+Generic Netlink Usage
+=====================
+
+The interface is implemented as a Generic Netlink family named ``drm-ras``.
+User space tools can:
+
+* List registered nodes with the ``get-nodes`` command.
+* List all error counters in an node with the ``get-error-counters`` command.
+* Query error counters using the ``query-error-counter`` command.
+
+YAML-based Interface
+--------------------
+
+The interface is described in a YAML specification:
+
+:ref:`Documentation/netlink/specs/drm_ras.yaml`
+
+This YAML is used to auto-generate user space bindings via
+``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
+attributes and operations.
+
+Usage Notes
+-----------
+
+* User space must first enumerate nodes to obtain their IDs.
+* Node IDs are then used for all further queries, such as error counters.
+* The interface supports future extension by adding new node types and
+  additional attributes.
+
+Example: List nodes using pyynl CLI tool
+
+.. code-block:: bash
+
+    ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/drm_ras.yaml --dump list-nodes
+    [{'device-name': '03:00.0',
+      'node-id': 0,
+      'node-name': 'tile0-gt0-correctable',
+      'node-type': 'error-counter'},
+     {'device-name': '03:00.0',
+      'node-id': 1,
+      'node-name': 'tile0-gt1-uncorrectable',
+      'node-type': 'error-counter'},
+     {'device-name': '03:00.0',
+     'node-id': 2,
+     'node-name': 'soc-uncorrectable',
+     'node-type': 'error-counter'}]
+
+Example: List all error counters using pyynl CLI tool
+
+.. code-block:: bash
+
+    ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/drm_ras.yaml --dump get-error-counters --json '{"node-id":0}'
+    [{'error-id': 0, 'error-name': 'correctable-l3', 'error-value': 0},
+     {'error-id': 3, 'error-name': 'correctable-sampler', 'error-value': 0}]
+
+    ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/drm_ras.yaml --dump get-error-counters --json '{"node-id":1}'
+    [{'error-id': 13, 'error-name': 'correctable-l3', 'error-value': 0},
+     {'error-id': 17, 'error-name': 'correctable-sampler', 'error-value': 0}]
+
+Example: Query an error counter for a given node
+
+.. code-block:: bash
+
+    ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/drm_ras.yaml --do query-error-counter --json '{"node-id": 0, "error-id": 0}'
+    {'error-id': 0, 'error-name': 'correctable-l3', 'error-value': 0}
+
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
new file mode 100644
index 000000000000..be0e379c5bc9
--- /dev/null
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -0,0 +1,130 @@
+# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+---
+name: drm-ras
+protocol: genetlink
+uapi-header: drm/drm_ras.h
+
+doc: >-
+  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
+  Provides a standardized mechanism for DRM drivers to register "nodes"
+  representing hardware/software components capable of reporting error counters.
+  Userspace tools can query the list of nodes or individual error counters
+  via the Generic Netlink interface.
+
+definitions:
+  -
+    type: enum
+    name: node-type
+    value-start: 1
+    entries: [error-counter]
+    doc: >-
+         Type of the node. Currently, only error-counter nodes are
+         supported, which expose reliability counters for a hardware/software
+         component.
+
+attribute-sets:
+  -
+    name: node-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc: >-
+             Unique identifier for the node.
+             Assigned dynamically by the DRM RAS core upon registration.
+      -
+        name: device-name
+        type: string
+        doc: >-
+             Device name chosen by the driver at registration.
+             Can be a PCI BDF, UUID, or module name if unique.
+      -
+        name: node-name
+        type: string
+        doc: >-
+             Node name chosen by the driver at registration.
+             Can be an IP block name, or any name that identifies the
+             RAS node inside the device.
+      -
+        name: node-type
+        type: u32
+        doc: Type of this node, identifying its function.
+        enum: node-type
+  -
+    name: error-counter-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc:  Node ID targeted by this error counter operation.
+      -
+        name: error-id
+        type: u32
+        doc: Unique identifier for a specific error counter within an node.
+      -
+        name: error-name
+        type: string
+        doc: Name of the error.
+      -
+        name: error-value
+        type: u32
+        doc: Current value of the requested error counter.
+
+operations:
+  list:
+    -
+      name: list-nodes
+      doc: >-
+           Retrieve the full list of currently registered DRM RAS nodes.
+           Each node includes its dynamically assigned ID, name, and type.
+           **Important:** User space must call this operation first to obtain
+           the node IDs. These IDs are required for all subsequent
+           operations on nodes, such as querying error counters.
+      attribute-set: node-attrs
+      flags: [admin-perm]
+      dump:
+        reply:
+          attributes:
+            - node-id
+            - device-name
+            - node-name
+            - node-type
+    -
+      name: get-error-counters
+      doc: >-
+           Retrieve the full list of error counters for a given node.
+           The response include the id, the name, and even the current
+           value of each counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      dump:
+        request:
+          attributes:
+            - node-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+    -
+      name: query-error-counter
+      doc: >-
+           Query the information of a specific error counter for a given node.
+           Users must provide the node ID and the error counter ID.
+           The response contains the id, the name, and the current value
+           of the counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes:
+            - node-id
+            - error-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+
+kernel-family:
+  headers: ["drm/drm_ras_nl.h"]
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index fda170730468..2043de78813d 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
 	  Smaller QR code are easier to read, but will contain less debugging
 	  data. Default is 40.
 
+config DRM_RAS
+	bool "DRM RAS support"
+	depends on DRM
+	help
+	  Enables the DRM RAS (Reliability, Availability and Serviceability)
+	  support for DRM drivers. This provides a Generic Netlink interface
+	  for error reporting and queries.
+	  If in doubt, say "N".
+
 config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
         bool "Enable refcount backtrace history in the DP MST helpers"
 	depends on STACKTRACE_SUPPORT
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 4b2f7d794275..31cdf98f09ce 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -93,6 +93,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
 drm-$(CONFIG_DRM_PANIC) += drm_panic.o
 drm-$(CONFIG_DRM_DRAW) += drm_draw.o
 drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
+drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
 obj-$(CONFIG_DRM)	+= drm.o
 
 obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 8e3cb08241c8..96841b5c0b9d 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -53,6 +53,7 @@
 #include <drm/drm_panic.h>
 #include <drm/drm_print.h>
 #include <drm/drm_privacy_screen_machine.h>
+#include <drm/drm_ras_genl_family.h>
 
 #include "drm_crtc_internal.h"
 #include "drm_internal.h"
@@ -1220,6 +1221,7 @@ static const struct file_operations drm_stub_fops = {
 
 static void drm_core_exit(void)
 {
+	drm_ras_genl_family_unregister();
 	drm_privacy_screen_lookup_exit();
 	drm_panic_exit();
 	accel_core_exit();
@@ -1258,6 +1260,10 @@ static int __init drm_core_init(void)
 
 	drm_privacy_screen_lookup_init();
 
+	ret = drm_ras_genl_family_register();
+	if (ret < 0)
+		goto error;
+
 	drm_core_init_complete = true;
 
 	DRM_DEBUG("Initialized\n");
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
new file mode 100644
index 000000000000..975b3d197edc
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras.c
@@ -0,0 +1,357 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/xarray.h>
+#include <net/genetlink.h>
+
+#include <drm/drm_ras.h>
+
+/**
+ * DOC: DRM RAS Node Management
+ *
+ * This module provides the infrastructure to manage RAS (Reliability,
+ * Availability, and Serviceability) nodes for DRM drivers. Each
+ * DRM driver may register one or more RAS nodes, which represent
+ * logical components capable of reporting error counters and other
+ * reliability metrics.
+ *
+ * The nodes are stored in a global xarray `drm_ras_xa` to allow
+ * efficient lookup by ID. Nodes can be registered or unregistered
+ * dynamically at runtime.
+ *
+ * A Generic Netlink family `drm_ras` exposes two main operations to
+ * userspace:
+ *
+ * 1. LIST_NODES: Dump all currently registered RAS nodes.
+ *    The user receives an array of node IDs, names, and types.
+ *
+ * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
+ *    The user receives an array of error IDs, names, and current value.
+ *
+ * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
+ *    Userspace must provide the node ID and the counter ID, and
+ *    receives the ID, the error name, and its current value.
+ *
+ * Node registration:
+ * - drm_ras_node_register(): Registers a new node and assigns
+ *   it a unique ID in the xarray.
+ * - drm_ras_node_unregister(): Removes a previously registered
+ *   node from the xarray.
+ *
+ * Node type:
+ * - ERROR_COUNTER:
+ *     + Currently, only error counters are supported.
+ *     + The driver must implement the query_error_counter() callback to provide
+ *       the name and the value of the error counter.
+ *     + The driver must provide a error_counter_range.last value informing the
+ *       last valid error ID.
+ *     + The driver can provide a error_counter_range.first value informing the
+ *       frst valid error ID.
+ *     + The error counters in the driver doesn't need to be contiguous, but the
+ *       driver must return -ENOENT to the query_error_counter as an indication
+ *       that the ID should be skipped and not listed in the netlink API.
+ *
+ * Netlink handlers:
+ * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
+ *   operation, iterating over the xarray.
+ * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
+ *   operation, iterating over the know valid error_counter_range.
+ * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
+ *   operation, fetching a counter value from a specific node.
+ */
+
+static DEFINE_XARRAY_ALLOC(drm_ras_xa);
+
+/*
+ * The netlink callback context carries dump state across multiple dumpit calls
+ */
+struct drm_ras_ctx {
+	/* Which xarray id to restart the dump from */
+	unsigned long restart;
+};
+
+/**
+ * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all registered RAS nodes in the global xarray and appends
+ * their attributes (ID, name, type) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb)
+{
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	unsigned long id;
+	int ret;
+
+	xa_for_each(&drm_ras_xa, id, node) {
+		if (id < ctx->restart)
+			continue;
+
+		hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
+				  cb->nlh->nlmsg_seq,
+				  &drm_ras_nl_family, NLM_F_MULTI,
+				  DRM_RAS_CMD_LIST_NODES);
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+				     node->device_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+				     node->node_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+				  node->type);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE) {
+		ctx->restart = id;
+		return skb->len;
+	}
+
+	return ret;
+}
+
+static int get_node_error_counter(u32 node_id, u32 error_id,
+				  const char **name, u32 *value)
+{
+	struct drm_ras_node *node;
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node || !node->query_error_counter)
+		return -ENOENT;
+
+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->query_error_counter(node, error_id, name, value);
+}
+
+static int msg_reply_value(struct sk_buff *msg, u32 error_id,
+			   const char *error_name, u32 value)
+{
+	int ret;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+	if (ret)
+		return ret;
+
+	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+			     error_name);
+	if (ret)
+		return ret;
+
+	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+			   value);
+}
+
+static int doit_reply_value(struct genl_info *info, u32 node_id,
+			    u32 error_id)
+{
+	struct sk_buff *msg;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 value;
+	int ret;
+
+	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+	hdr = genlmsg_put_reply(msg, info, &drm_ras_nl_family, 0,
+				DRM_RAS_CMD_QUERY_ERROR_COUNTER);
+	if (!hdr) {
+		nlmsg_free(msg);
+		return -EMSGSIZE;
+	}
+
+	ret = get_node_error_counter(node_id, error_id,
+				     &error_name, &value);
+	if (ret)
+		return ret;
+
+	ret = msg_reply_value(msg, error_id, error_name, value);
+	if (ret)
+		return ret;
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_reply(msg, info);
+}
+
+/**
+ * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all error counters in a given Node and appends
+ * their attributes (ID, name, value) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 node_id, error_id, value;
+	int ret;
+
+	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node)
+		return -ENOENT;
+
+	for (error_id = max(node->error_counter_range.first, ctx->restart);
+	     error_id <= node->error_counter_range.last;
+	     error_id++) {
+		ret = get_node_error_counter(node_id, error_id,
+					     &error_name, &value);
+		/*
+		 * For non-contiguous range, driver return -ENOENT as indication
+		 * to skip this ID when listing all errors.
+		 */
+		if (ret == -ENOENT)
+			continue;
+		if (ret)
+			return ret;
+
+		hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
+				  cb->nlh->nlmsg_seq,
+				  &drm_ras_nl_family, NLM_F_MULTI,
+				  DRM_RAS_CMD_GET_ERROR_COUNTERS);
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = msg_reply_value(skb, error_id, error_name, value);
+		if (ret)
+			break;
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE) {
+		ctx->restart = error_id;
+		return skb->len;
+	}
+
+	return ret;
+}
+
+/**
+ * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * retrieves the current value of the corresponding error counter. Sends the
+ * result back to the requesting user via the standard Genl reply.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	return doit_reply_value(info, node_id, error_id);
+}
+
+/**
+ * drm_ras_node_register() - Register a new RAS node
+ * @node: Node structure to register
+ *
+ * Adds the given RAS node to the global node xarray and assigns it
+ * a unique ID. Both @node->name and @node->type must be valid.
+ *
+ * Return: 0 on success, or negative errno on failure:
+ */
+int drm_ras_node_register(struct drm_ras_node *node)
+{
+	if (!node->device_name || !node->node_name)
+		return -EINVAL;
+
+	/* Currently, only Error Counter Endpoinnts are supported */
+	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
+		return -EINVAL;
+
+	/* Mandatorty entries for Error Counter Node */
+	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
+	    (!node->error_counter_range.last || !node->query_error_counter))
+		return -EINVAL;
+
+	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
+}
+EXPORT_SYMBOL(drm_ras_node_register);
+
+/**
+ * drm_ras_node_unregister() - Unregister a previously registered node
+ * @node: Node structure to unregister
+ *
+ * Removes the given node from the global node xarray using its ID.
+ */
+void drm_ras_node_unregister(struct drm_ras_node *node)
+{
+	xa_erase(&drm_ras_xa, node->id);
+}
+EXPORT_SYMBOL(drm_ras_node_unregister);
diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
new file mode 100644
index 000000000000..2d818b8c3808
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_genl_family.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <drm/drm_ras_genl_family.h>
+#include <drm/drm_ras_nl.h>
+
+/* Track family registration so the drm_exit can be called at any time */
+static bool registered;
+
+/**
+ * drm_ras_genl_family_register() - Register drm-ras genl family
+ *
+ * Only to be called one at drm_drv_init()
+ */
+int drm_ras_genl_family_register(void)
+{
+	int ret;
+
+	registered = false;
+
+	ret = genl_register_family(&drm_ras_nl_family);
+	if (ret)
+		return ret;
+
+	registered = true;
+	return 0;
+}
+
+/**
+ * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
+ *
+ * To be called one at drm_drv_exit() at any moment, but only once.
+ */
+void drm_ras_genl_family_unregister(void)
+{
+	if (registered) {
+		genl_unregister_family(&drm_ras_nl_family);
+		registered = false;
+	}
+}
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
new file mode 100644
index 000000000000..fcd1392410e4
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel source */
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
+static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+};
+
+/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
+/* Ops table for drm_ras */
+static const struct genl_split_ops drm_ras_nl_ops[] = {
+	{
+		.cmd	= DRM_RAS_CMD_LIST_NODES,
+		.dumpit	= drm_ras_nl_list_nodes_dumpit,
+		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
+		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
+		.policy		= drm_ras_get_error_counters_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+		.doit		= drm_ras_nl_query_error_counter_doit,
+		.policy		= drm_ras_query_error_counter_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+};
+
+struct genl_family drm_ras_nl_family __ro_after_init = {
+	.name		= DRM_RAS_FAMILY_NAME,
+	.version	= DRM_RAS_FAMILY_VERSION,
+	.netnsok	= true,
+	.parallel_ops	= true,
+	.module		= THIS_MODULE,
+	.split_ops	= drm_ras_nl_ops,
+	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
+};
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
new file mode 100644
index 000000000000..bba47a282ef8
--- /dev/null
+++ b/include/drm/drm_ras.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_H__
+#define __DRM_RAS_H__
+
+#include "drm_ras_nl.h"
+
+/**
+ * struct drm_ras_node - A DRM RAS Node
+ */
+struct drm_ras_node {
+	/** @id: Unique identifier for the node. Dynamically assigned. */
+	u32 id;
+	/**
+	 * @device_name: Human-readable name of the device. Given by the driver.
+	 */
+	const char *device_name;
+	/** @node_name: Human-readable name of the node. Given by the driver. */
+	const char *node_name;
+	/** @type: Type of the node (enum drm_ras_node_type). */
+	enum drm_ras_node_type type;
+
+	/* Error-Counter Related Callback and Variables */
+
+	/** @error_counter_range: Range of valid Error IDs for this node. */
+	struct {
+		/** @first: First valid Error ID. */
+		u32 first;
+		/** @last: Last valid Error ID. Mandatory entry. */
+		u32 last;
+	} error_counter_range;
+
+	/**
+	 * @query_error_counter:
+	 *
+	 * This callback is used by drm-ras to query a specific error counter.
+	 * counters supported by this node. Used for input check and to
+	 * iterate in all counters.
+	 *
+	 * Driver should expect query_error_counters() to be called with
+	 * error_id from `error_counter_range.first` to
+	 * `error_counter_range.last`.
+	 *
+	 * The @query_error_counter is a mandatory callback for
+	 * error_counter_node.
+	 *
+	 * Returns: 0 on success,
+	 *          -ENOENT when error_id is not supported as an indication that
+	 *                  drm_ras should silently skip this entry. Used for
+	 *                  supporting non-contiguous error ranges.
+	 *                  Driver is responsible for maintaining the list of
+	 *                  supported error IDs in the range of first to last.
+	 *          Other negative values on errors that should terminate the
+	 *          netlink query.
+	 */
+	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
+				   const char **name, u32 *val);
+
+	/** @priv: Driver private data */
+	void *priv;
+};
+
+struct drm_device;
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_node_register(struct drm_ras_node *ep);
+void drm_ras_node_unregister(struct drm_ras_node *ep);
+#else
+static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
+static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
new file mode 100644
index 000000000000..5931b53429f1
--- /dev/null
+++ b/include/drm/drm_ras_genl_family.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_GENL_FAMILY_H__
+#define __DRM_RAS_GENL_FAMILY_H__
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_genl_family_register(void);
+void drm_ras_genl_family_unregister(void);
+#else
+static inline int drm_ras_genl_family_register(void) { return 0; }
+static inline void drm_ras_genl_family_unregister(void) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
new file mode 100644
index 000000000000..9613b7d9ffdb
--- /dev/null
+++ b/include/drm/drm_ras_nl.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel header */
+
+#ifndef _LINUX_DRM_RAS_GEN_H
+#define _LINUX_DRM_RAS_GEN_H
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb);
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb);
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info);
+
+extern struct genl_family drm_ras_nl_family;
+
+#endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
new file mode 100644
index 000000000000..3415ba345ac8
--- /dev/null
+++ b/include/uapi/drm/drm_ras.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN uapi header */
+
+#ifndef _UAPI_LINUX_DRM_RAS_H
+#define _UAPI_LINUX_DRM_RAS_H
+
+#define DRM_RAS_FAMILY_NAME	"drm-ras"
+#define DRM_RAS_FAMILY_VERSION	1
+
+/*
+ * Type of the node. Currently, only error-counter nodes are supported, which
+ * expose reliability counters for a hardware/software component.
+ */
+enum drm_ras_node_type {
+	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
+};
+
+enum {
+	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+
+	__DRM_RAS_A_NODE_ATTRS_MAX,
+	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+
+	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_CMD_LIST_NODES = 1,
+	DRM_RAS_CMD_GET_ERROR_COUNTERS,
+	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+
+	__DRM_RAS_CMD_MAX,
+	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
+};
+
+#endif /* _UAPI_LINUX_DRM_RAS_H */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors
  2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
  2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
@ 2025-09-29 21:44 ` Rodrigo Vivi
  2025-09-30  2:07   ` kernel test robot
  2025-10-02 20:38 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Zack McKevitt
  2025-10-28 19:13 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
  3 siblings, 1 reply; 22+ messages in thread
From: Rodrigo Vivi @ 2025-09-29 21:44 UTC (permalink / raw)
  To: dri-devel, intel-xe; +Cc: Rodrigo Vivi, Riana Tauro

All MTL+ devices supports these correctable and non-fatal error
notification over the IRQ. None of current supported platforms
support error counter directly in the HW.

But since we are already supporting the error interrupt for
these errors, let's incorporate the counter inside the driver
itself and start using the drm_ras generic netlink to report them.

Keep the CSC_work only for discrete devices.

Cc: Riana Tauro <riana.tauro@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 +++
 drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++++++++++++++-
 2 files changed, 175 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index c146b9ef44eb..860fc3b8a3c4 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -16,5 +16,27 @@
 #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
 								  DEV_ERR_STAT_CORRECTABLE, \
 								  DEV_ERR_STAT_NONFATAL))
+#define   XE_SGMI_DATA_PARITY_ERROR		BIT(25)
+#define   XE_MERT_ERROR				BIT(20)
 #define   XE_CSC_ERROR				BIT(17)
+#define   XE_SOC_ERROR				BIT(16)
+#define   XE_SGCI_DATA_PARITY_ERROR		BIT(13)
+#define   XE_SGUNIT_ERROR			BIT(12)
+#define   XE_SGLI_DATA_PARITY_ERROR		BIT(9)
+#define   XE_GSC_ERROR				BIT(8)
+#define   XE_SGDI_DATA_PARITY_ERROR		BIT(5)
+#define   XE_DISPLAY_ERROR			BIT(4)
+#define   XE_SGGI_DATA_PARITY_ERROR		BIT(1)
+#define   XE_GT_ERROR				BIT(0)
+
+#define DEV_ERR_STAT_NONFATAL_VALID_MASK \
+	(XE_SGMI_DATA_PARITY_ERROR | XE_MERT_ERROR | XE_CSC_ERROR | XE_SOC_ERROR | \
+	 XE_SGCI_DATA_PARITY_ERROR | XE_SGUNIT_ERROR | XE_SGLI_DATA_PARITY_ERROR | \
+	 XE_GSC_ERROR | XE_SGDI_DATA_PARITY_ERROR | XE_DISPLAY_ERROR |	\
+	 XE_SGGI_DATA_PARITY_ERROR | XE_GT_ERROR)
+
+#define DEV_ERR_STAT_CORRECTABLE_VALID_MASK \
+	(XE_CSC_ERROR | XE_SOC_ERROR | XE_SGUNIT_ERROR | XE_GSC_ERROR | \
+	 XE_DISPLAY_ERROR | XE_GT_ERROR)
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 8c65291f36fc..615d10cd83f0 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,7 +3,13 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <linux/atomic.h>
 #include <linux/fault-inject.h>
+#include <linux/find.h>
+#include <linux/types.h>
+
+#include <drm/drm_managed.h>
+#include <drm/drm_ras.h>
 
 #include "regs/xe_gsc_regs.h"
 #include "regs/xe_hw_error_regs.h"
@@ -46,6 +52,93 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
 	}
 }
 
+struct error_info {
+	const char *name;
+	atomic64_t counter;
+};
+
+#define ERR_INFO(_bit, _name) \
+	[__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
+
+static struct error_info dev_err_stat_nonfatal_reg[] = {
+	ERR_INFO(XE_GT_ERROR, "GT Error"),
+	ERR_INFO(XE_SGGI_DATA_PARITY_ERROR, "SGGI Data Parity Error"),
+	ERR_INFO(XE_DISPLAY_ERROR, "Display Error"),
+	ERR_INFO(XE_SGDI_DATA_PARITY_ERROR, "SGDI Data Parity Error"),
+	ERR_INFO(XE_GSC_ERROR, "GSC Error"),
+	ERR_INFO(XE_SGLI_DATA_PARITY_ERROR, "SGLI Data Parity Error"),
+	ERR_INFO(XE_SGUNIT_ERROR, "SG Unit Error"),
+	ERR_INFO(XE_SGCI_DATA_PARITY_ERROR, "SGCI Data Parity Error"),
+	ERR_INFO(XE_SOC_ERROR, "SoC Error"),
+	ERR_INFO(XE_CSC_ERROR, "CSC Error"),
+	ERR_INFO(XE_MERT_ERROR, "MERT Error"),
+	ERR_INFO(XE_SGMI_DATA_PARITY_ERROR, "SGMI Data Parity Error"),
+};
+
+static struct error_info dev_err_stat_correctable_reg[] = {
+	ERR_INFO(XE_GT_ERROR, "GT Error"),
+	ERR_INFO(XE_DISPLAY_ERROR, "Display Error"),
+	ERR_INFO(XE_GSC_ERROR, "GSC Error"),
+	ERR_INFO(XE_SGUNIT_ERROR, "SG Unit Error"),
+	ERR_INFO(XE_SOC_ERROR, "SoC Error"),
+	ERR_INFO(XE_CSC_ERROR, "CSC Error"),
+};
+
+static int hw_query_error_counter(struct error_info *error_list,
+				  u32 error_id, const char **name, u32 *val)
+{
+	*name = error_list[error_id].name;
+	*val = atomic64_read(&error_list[error_id].counter);
+
+	return 0;
+}
+
+static int query_error_counter_non_fatal(struct drm_ras_node *ep,
+					 u32 error_id,
+					 const char **name,
+					 u32 *val)
+{
+	if (error_id >= ARRAY_SIZE(dev_err_stat_nonfatal_reg))
+		return -EINVAL;
+
+	if (!(DEV_ERR_STAT_NONFATAL_VALID_MASK & BIT(error_id)) ||
+	    !dev_err_stat_nonfatal_reg[error_id].name)
+		return -ENOENT;
+
+	return hw_query_error_counter(dev_err_stat_nonfatal_reg,
+				      error_id, name, val);
+}
+
+static int query_error_counter_correctable(struct drm_ras_node *ep,
+					   u32 error_id,
+					   const char **name,
+					   u32 *val)
+{
+	if (error_id >= ARRAY_SIZE(dev_err_stat_correctable_reg))
+		return -EINVAL;
+
+	if (!(DEV_ERR_STAT_CORRECTABLE_VALID_MASK & BIT(error_id)) ||
+	    !dev_err_stat_correctable_reg[error_id].name)
+		return -ENOENT;
+
+	return hw_query_error_counter(dev_err_stat_correctable_reg,
+				      error_id, name, val);
+}
+
+static struct drm_ras_node node_non_fatal = {
+	.node_name = "non-fatal",
+	.type = DRM_RAS_NODE_TYPE_ERROR_COUNTER,
+	.error_counter_range.last = __ffs(XE_SGMI_DATA_PARITY_ERROR),
+	.query_error_counter = query_error_counter_non_fatal,
+};
+
+static struct drm_ras_node node_correctable = {
+	.node_name = "correctable",
+	.type = DRM_RAS_NODE_TYPE_ERROR_COUNTER,
+	.error_counter_range.last = __ffs(XE_CSC_ERROR),
+	.query_error_counter = query_error_counter_correctable,
+};
+
 static bool fault_inject_csc_hw_error(void)
 {
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
@@ -97,6 +190,29 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
 }
 
+static void hw_error_counter(struct xe_device *xe,
+			     const enum hardware_error hw_err, const u32 err_src)
+{
+	struct error_info *err_info;
+	unsigned long err_bits = err_src;
+	unsigned long error;
+
+	if (hw_err == HARDWARE_ERROR_NONFATAL) {
+		err_info = dev_err_stat_nonfatal_reg;
+	} else if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
+		err_info = dev_err_stat_correctable_reg;
+	} else {
+		drm_err_ratelimited(&xe->drm, HW_ERR
+				    "Error from non-supported source, err_src=0x%x\n",
+				    err_src);
+		return;
+	}
+
+	for_each_set_bit(error, &err_bits, 32) {
+		atomic64_inc(&err_info[error].counter);
+	}
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const char *hw_err_str = hw_error_to_str(hw_err);
@@ -118,6 +234,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 	if (err_src & XE_CSC_ERROR)
 		csc_hw_error_handler(tile, hw_err);
 
+	hw_error_counter(xe, hw_err, err_src);
+
 	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
 
 unlock:
@@ -162,6 +280,36 @@ static void process_hw_errors(struct xe_device *xe)
 	}
 }
 
+static void hw_error_counter_fini(struct drm_device *dev, void *res)
+{
+	drm_ras_node_unregister(&node_non_fatal);
+	drm_ras_node_unregister(&node_correctable);
+}
+
+static void hw_error_counter_init(struct xe_device *xe)
+{
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+	char *name;
+
+	name = kasprintf(GFP_KERNEL, "%02x:%02x.%d",
+			 pdev->bus->number,
+			 PCI_SLOT(pdev->devfn),
+			 PCI_FUNC(pdev->devfn));
+	if (!name) {
+		drm_err(&xe->drm, "Failed to allocate memory for device name for drm_ras\n");
+		return;
+	}
+
+	node_non_fatal.device_name = name;
+	drm_ras_node_register(&node_non_fatal);
+
+	node_correctable.device_name = name;
+	drm_ras_node_register(&node_correctable);
+
+	if (drmm_add_action_or_reset(&xe->drm, hw_error_counter_fini, xe))
+		drm_err(&xe->drm, "Failed to add action for hw error counter fini\n");
+}
+
 /**
  * xe_hw_error_init - Initialize hw errors
  * @xe: xe device instance
@@ -173,10 +321,13 @@ void xe_hw_error_init(struct xe_device *xe)
 {
 	struct xe_tile *tile = xe_device_get_root_tile(xe);
 
-	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
+	if (IS_SRIOV_VF(xe))
 		return;
 
-	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
+	if (IS_DGFX(xe))
+		INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
 
 	process_hw_errors(xe);
+
+	hw_error_counter_init(xe);
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors
  2025-09-29 21:44 ` [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors Rodrigo Vivi
@ 2025-09-30  2:07   ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-09-30  2:07 UTC (permalink / raw)
  To: Rodrigo Vivi, dri-devel, intel-xe
  Cc: oe-kbuild-all, Rodrigo Vivi, Riana Tauro

Hi Rodrigo,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm-xe/drm-xe-next]
[also build test ERROR on drm/drm-next drm-exynos/exynos-drm-next drm-intel/for-linux-next drm-misc/drm-misc-next drm-tip/drm-tip next-20250929]
[cannot apply to drm-intel/for-linux-next-fixes linus/master v6.17]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Rodrigo-Vivi/drm-ras-Introduce-the-DRM-RAS-infrastructure-over-generic-netlink/20250930-054726
base:   https://gitlab.freedesktop.org/drm/xe/kernel.git drm-xe-next
patch link:    https://lore.kernel.org/r/20250929214415.326414-6-rodrigo.vivi%40intel.com
patch subject: [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors
config: parisc-randconfig-002-20250930 (https://download.01.org/0day-ci/archive/20250930/202509300919.eiP7GKSP-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 9.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250930/202509300919.eiP7GKSP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509300919.eiP7GKSP-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/xe/xe_hw_error.c:8:
>> include/linux/find.h:6:2: error: #error only <linux/bitmap.h> can be included directly
       6 | #error only <linux/bitmap.h> can be included directly
         |  ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:64:2: note: in expansion of macro 'ERR_INFO'
      64 |  ERR_INFO(XE_GT_ERROR, "GT Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:64:2: note: in expansion of macro 'ERR_INFO'
      64 |  ERR_INFO(XE_GT_ERROR, "GT Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:65:2: note: in expansion of macro 'ERR_INFO'
      65 |  ERR_INFO(XE_SGGI_DATA_PARITY_ERROR, "SGGI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:65:2: note: in expansion of macro 'ERR_INFO'
      65 |  ERR_INFO(XE_SGGI_DATA_PARITY_ERROR, "SGGI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:66:2: note: in expansion of macro 'ERR_INFO'
      66 |  ERR_INFO(XE_DISPLAY_ERROR, "Display Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:66:2: note: in expansion of macro 'ERR_INFO'
      66 |  ERR_INFO(XE_DISPLAY_ERROR, "Display Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:67:2: note: in expansion of macro 'ERR_INFO'
      67 |  ERR_INFO(XE_SGDI_DATA_PARITY_ERROR, "SGDI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:67:2: note: in expansion of macro 'ERR_INFO'
      67 |  ERR_INFO(XE_SGDI_DATA_PARITY_ERROR, "SGDI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:68:2: note: in expansion of macro 'ERR_INFO'
      68 |  ERR_INFO(XE_GSC_ERROR, "GSC Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:68:2: note: in expansion of macro 'ERR_INFO'
      68 |  ERR_INFO(XE_GSC_ERROR, "GSC Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:69:2: note: in expansion of macro 'ERR_INFO'
      69 |  ERR_INFO(XE_SGLI_DATA_PARITY_ERROR, "SGLI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:69:2: note: in expansion of macro 'ERR_INFO'
      69 |  ERR_INFO(XE_SGLI_DATA_PARITY_ERROR, "SGLI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:70:2: note: in expansion of macro 'ERR_INFO'
      70 |  ERR_INFO(XE_SGUNIT_ERROR, "SG Unit Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:70:2: note: in expansion of macro 'ERR_INFO'
      70 |  ERR_INFO(XE_SGUNIT_ERROR, "SG Unit Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:71:2: note: in expansion of macro 'ERR_INFO'
      71 |  ERR_INFO(XE_SGCI_DATA_PARITY_ERROR, "SGCI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: note: (near initialization for 'dev_err_stat_nonfatal_reg')
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }
         |   ^~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:71:2: note: in expansion of macro 'ERR_INFO'
      71 |  ERR_INFO(XE_SGCI_DATA_PARITY_ERROR, "SGCI Data Parity Error"),
         |  ^~~~~~~~
   drivers/gpu/drm/xe/xe_hw_error.c:61:3: error: nonconstant array index in initializer
      61 |  [__ffs(_bit)] = { .name = _name, .counter = ATOMIC64_INIT(0) }


vim +6 include/linux/find.h

47d8c15615c0a2 Yury Norov 2021-08-14  4  
47d8c15615c0a2 Yury Norov 2021-08-14  5  #ifndef __LINUX_BITMAP_H
47d8c15615c0a2 Yury Norov 2021-08-14 @6  #error only <linux/bitmap.h> can be included directly
47d8c15615c0a2 Yury Norov 2021-08-14  7  #endif
47d8c15615c0a2 Yury Norov 2021-08-14  8  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
  2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
  2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
  2025-09-29 21:44 ` [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors Rodrigo Vivi
@ 2025-10-02 20:38 ` Zack McKevitt
  2025-10-28 19:14   ` Rodrigo Vivi
  2025-11-06 13:42   ` Rodrigo Vivi
  2025-10-28 19:13 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
  3 siblings, 2 replies; 22+ messages in thread
From: Zack McKevitt @ 2025-10-02 20:38 UTC (permalink / raw)
  To: Rodrigo Vivi, dri-devel, intel-xe
  Cc: Hawking Zhang, Alex Deucher, Lukas Wunner, Dave Airlie,
	Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen

I think this looks good, adding telemetry functionality as a node type 
and in the yaml spec looks straightforward (despite some potential 
naming awkwardness with the RAS module). Thanks for adding this.

Have you considered how this might work for containerized workloads? 
Specifically, I think it would be best if the underlying drm_ras nodes 
are only accessible for containerized workloads where the device has 
been explicitly passed in. Do you know if this is handled automatically 
with the existing netlink implementation? I imagine that this would be 
of interest to the broader community outside of Qualcomm as well.

> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> exercises this new API, hence I hope this can be the reference code for the uAPI
> usage, while we continue with the plan of introducing IGT tests and tools for this
> and adjusting the internal vendor tools to open with open source developments and
> changing them to support these flows.

I think it would be nice to see some accompanying userspace code that 
makes use of this implementation to have as a reference if at all possible.

As a side note, I will be on vacation for a couple of weeks as of this 
weekend and my response time will be affected.

Thanks,

Zack

^ permalink raw reply	[flat|nested] 22+ messages in thread

* DRM_RAS for CPER Error logging?!
  2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
                   ` (2 preceding siblings ...)
  2025-10-02 20:38 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Zack McKevitt
@ 2025-10-28 19:13 ` Rodrigo Vivi
  2025-10-29  2:00   ` Zhang, Hawking
  2025-10-30 14:47   ` Rodrigo Vivi
  3 siblings, 2 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-10-28 19:13 UTC (permalink / raw)
  To: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
	Hawking Zhang, Alex Deucher, Zack McKevitt, Lukas Wunner,
	Aravind Iddamsetty
  Cc: Hawking Zhang, Alex Deucher, Zack McKevitt, Lukas Wunner,
	Dave Airlie, Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen

On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:

Hey Dave, Sima, AMD folks, Qualcomm folks,

I have a key question to you below here.

> This work is a continuation of the great work started by Aravind ([1] and [2])
> in order to fulfill the RAS requirements and proposal as previously discussed
> and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> 
> [1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
> [2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> [3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> 
> During the past review round, Lukas pointed out that netlink had evolved
> in parallel during these years and that now, any new usage of netlink families
> would require the usage of the YAML description and scripts.
> 
> With this new requirement in place, the family name is hardcoded in the yaml file,
> so we are forced to have a single family name for the entire drm, and then we now
> we are forced to have a registration.
> 
> So, while doing the registration, we now created the concept of drm-ras-node.
> For now the only node type supported is the agreed error-counter. But that could
> be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
> driver.
> 
> In this first version, only querying counter is supported. But also this is expandable
> to future introduction of multicast notification and also clearing the counters.
> 
> This design with multiple nodes per device is already flexible enough for driver
> to decide if it wants to handle error per device, or per IP block, or per error
> category. I believe this fully attend to the requested AMD feedback in the earlier
> reviews.
> 
> So, my proposal is to start simple with this case as is, and then iterate over
> with the drm-ras in tree so we evolve together according to various driver's RAS
> needs.
> 
> I have provided a documentation and the first Xe implementation of the counter
> as reference.
> 
> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> exercises this new API, hence I hope this can be the reference code for the uAPI
> usage, while we continue with the plan of introducing IGT tests and tools for this
> and adjusting the internal vendor tools to open with open source developments and
> changing them to support these flows.
> 
> Example on MTL:
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --dump list-nodes
> [{'device-name': '00:02.0',
>   'node-id': 0,
>   'node-name': 'non-fatal',
>   'node-type': 'error-counter'},
>  {'device-name': '00:02.0',
>   'node-id': 1,
>   'node-name': 'correctable',
>   'node-type': 'error-counter'}]

As you can see on the drm-ras patch, we now have only a single family called
'drm-ras', with that we have to register entry points, called 'nodes'
and for now only one type is existing: 'error-counter'

As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 [3].

Zack already indicated that for Qualcomm he doesn't need the error counters,
but another type, perhaps telemetry.

I need your feedback and input on yet another case here that goes side
by side with error-counters: Error logging.

One of the RAS requirements that we have is to emit CPER logs in certain
cases. AMD is currently using debugfs for printing the CPER entries that
accumulates in a ringbuffer. (iiuc).

Some folks are asking us to emit the CPER in the tracefs because
debugfs might not be available in some enterprise production images.

However, there's a concern on the tracefs usage for the error-logging case.
There is no active query path in the tracefs. If user needs to poll for
the latest CPER records it would need to pig-back on some other API
that would force the emit-trace(cper).

I believe that the cleanest way is to have another drm-ras node type
named 'error-logging' with a single operation that is query-logs,
that would be a dump of the available ring-buffer with latest known
cper records. Is this acceptable?

AMD folks, would you consider this to replace the current debugfs you
have?

Please let me know your thoughts.

We won't have an example for now, but it would be something like:

Thanks,
Rodrigo.

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --dump list-nodes
[{'device-name': '00:02.0',
  'node-id': 0,
  'node-name': 'non-fatal',
  'node-type': 'error-counter'},
 {'device-name': '00:02.0',
  'node-id': 1,
  'node-name': 'correctable',
  'node-type': 'error-counter'}
 'device-name': '00:02.0',
  'node-id': 2,
  'node-name': 'non-fatal',
  'node-type': 'error-logging'},
 {'device-name': '00:02.0',
  'node-id': 3,
  'node-name': 'correctable',
  'node-type': 'error-logging'}]

$ sudo ./tools/net/ynl/pyynl/cli.py \
   --spec Documentation/netlink/specs/drm_ras.yaml \
   --dump get-logs --json '{"node-id":3}'
[{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
]

Of course, details of the error-logging fields along with the CPER binary
is yet to be defined.

Oh, and the nodes names and split is device specific. The infra is flexible
enough. Driver can do whatever it makes sense for their device.

Any feedback or comment is really appreciated.

Thanks in advance,
Rodrigo.

> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --dump get-error-counters --json '{"node-id":1}'
> [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
>  {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
>  {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
>  {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
>  {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
>  {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
> {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
> {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
> 
> Thanks,
> Rodrigo.
> 
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> Rodrigo Vivi (2):
>   drm/ras: Introduce the DRM RAS infrastructure over generic netlink
>   drm/xe: Introduce the usage of drm_ras with supported HW errors
> 
>  Documentation/gpu/drm-ras.rst              | 109 +++++++
>  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++
>  drivers/gpu/drm/Kconfig                    |   9 +
>  drivers/gpu/drm/Makefile                   |   1 +
>  drivers/gpu/drm/drm_drv.c                  |   6 +
>  drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++
>  drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++
>  drivers/gpu/drm/drm_ras_nl.c               |  54 ++++
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++
>  drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-
>  include/drm/drm_ras.h                      |  76 +++++
>  include/drm/drm_ras_genl_family.h          |  17 +
>  include/drm/drm_ras_nl.h                   |  24 ++
>  include/uapi/drm/drm_ras.h                 |  49 +++
>  14 files changed, 1049 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/gpu/drm-ras.rst
>  create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>  create mode 100644 drivers/gpu/drm/drm_ras.c
>  create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>  create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>  create mode 100644 include/drm/drm_ras.h
>  create mode 100644 include/drm/drm_ras_genl_family.h
>  create mode 100644 include/drm/drm_ras_nl.h
>  create mode 100644 include/uapi/drm/drm_ras.h
> 
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
  2025-10-02 20:38 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Zack McKevitt
@ 2025-10-28 19:14   ` Rodrigo Vivi
  2025-11-06 13:42   ` Rodrigo Vivi
  1 sibling, 0 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-10-28 19:14 UTC (permalink / raw)
  To: Zack McKevitt
  Cc: dri-devel, intel-xe, Hawking Zhang, Alex Deucher, Lukas Wunner,
	Dave Airlie, Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen

On Thu, Oct 02, 2025 at 02:38:47PM -0600, Zack McKevitt wrote:
> I think this looks good, adding telemetry functionality as a node type and
> in the yaml spec looks straightforward (despite some potential naming
> awkwardness with the RAS module). Thanks for adding this.
> 
> Have you considered how this might work for containerized workloads?
> Specifically, I think it would be best if the underlying drm_ras nodes are
> only accessible for containerized workloads where the device has been
> explicitly passed in. Do you know if this is handled automatically with the
> existing netlink implementation? I imagine that this would be of interest to
> the broader community outside of Qualcomm as well.
> 
> > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > exercises this new API, hence I hope this can be the reference code for the uAPI
> > usage, while we continue with the plan of introducing IGT tests and tools for this
> > and adjusting the internal vendor tools to open with open source developments and
> > changing them to support these flows.
> 
> I think it would be nice to see some accompanying userspace code that makes
> use of this implementation to have as a reference if at all possible.

Yes, we are going to provide a true userspace IGT tool that exercise that directly
instead of only relying on the provided netlink  pyynl/cli.py.

Develop is in progress, but sorry for the delay here.

> 
> As a side note, I will be on vacation for a couple of weeks as of this
> weekend and my response time will be affected.
> 
> Thanks,
> 
> Zack

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: DRM_RAS for CPER Error logging?!
  2025-10-28 19:13 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
@ 2025-10-29  2:00   ` Zhang, Hawking
  2025-11-06 13:16     ` Rodrigo Vivi
  2025-10-30 14:47   ` Rodrigo Vivi
  1 sibling, 1 reply; 22+ messages in thread
From: Zhang, Hawking @ 2025-10-29  2:00 UTC (permalink / raw)
  To: Rodrigo Vivi, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Dave Airlie, Joonas Lahtinen,
	Simona Vetter, Deucher, Alexander, Zack McKevitt, Lukas Wunner,
	Aravind Iddamsetty, Zhou1, Tao, Liu, Xiang(Dean)
  Cc: Deucher, Alexander, Zack McKevitt, Lukas Wunner, Dave Airlie,
	Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen

[-- Attachment #1: Type: text/plain, Size: 10665 bytes --]

[AMD Official Use Only - AMD Internal Distribution Only]


+ @Zhou1, Tao<mailto:Tao.Zhou1@amd.com> and @Liu, Xiang(Dean)<mailto:Xiang.Liu@amd.com> for the awareness.

RE - AMD folks, would you consider this to replace the current debugfs you have?

[Hawking]:

Replacing the debugfs is not the primary concern. The main concern is whether drm_ras can effectively support the necessary RAS information for all device vendors, as this largely depends on the design of the hardware and firmware.

AMD is currently evaluating the proposed interface for error logging.

Regards,
Hawking

-----Original Message-----
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Sent: Wednesday, October 29, 2025 03:13
To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Dave Airlie <airlied@gmail.com>; Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Simona Vetter <simona.vetter@ffwll.ch>; Zhang, Hawking <Hawking.Zhang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>; Lukas Wunner <lukas@wunner.de>; Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Cc: Zhang, Hawking <Hawking.Zhang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>; Lukas Wunner <lukas@wunner.de>; Dave Airlie <airlied@gmail.com>; Simona Vetter <simona.vetter@ffwll.ch>; Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>; Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Subject: DRM_RAS for CPER Error logging?!

On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:

Hey Dave, Sima, AMD folks, Qualcomm folks,

I have a key question to you below here.

> This work is a continuation of the great work started by Aravind ([1]
> and [2]) in order to fulfill the RAS requirements and proposal as
> previously discussed and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
>
> [1]:
> https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.idd
> amsetty@linux.intel.com/<mailto:amsetty@linux.intel.com/>
> [2]:
> https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux
> .intel.com/
> [3]:
> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary
> .html
>
> During the past review round, Lukas pointed out that netlink had
> evolved in parallel during these years and that now, any new usage of
> netlink families would require the usage of the YAML description and scripts.
>
> With this new requirement in place, the family name is hardcoded in
> the yaml file, so we are forced to have a single family name for the
> entire drm, and then we now we are forced to have a registration.
>
> So, while doing the registration, we now created the concept of drm-ras-node.
> For now the only node type supported is the agreed error-counter. But
> that could be expanded for other cases like telemetry, requested by
> Zack for the qualcomm accel driver.
>
> In this first version, only querying counter is supported. But also
> this is expandable to future introduction of multicast notification and also clearing the counters.
>
> This design with multiple nodes per device is already flexible enough
> for driver to decide if it wants to handle error per device, or per IP
> block, or per error category. I believe this fully attend to the
> requested AMD feedback in the earlier reviews.
>
> So, my proposal is to start simple with this case as is, and then
> iterate over with the drm-ras in tree so we evolve together according
> to various driver's RAS needs.
>
> I have provided a documentation and the first Xe implementation of the
> counter as reference.
>
> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool
> that entirely exercises this new API, hence I hope this can be the
> reference code for the uAPI usage, while we continue with the plan of
> introducing IGT tests and tools for this and adjusting the internal
> vendor tools to open with open source developments and changing them to support these flows.
>
> Example on MTL:
>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --dump list-nodes
> [{'device-name': '00:02.0',
>   'node-id': 0,
>   'node-name': 'non-fatal',
>   'node-type': 'error-counter'},
>  {'device-name': '00:02.0',
>   'node-id': 1,
>   'node-name': 'correctable',
>   'node-type': 'error-counter'}]

As you can see on the drm-ras patch, we now have only a single family called 'drm-ras', with that we have to register entry points, called 'nodes'
and for now only one type is existing: 'error-counter'

As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 [3].

Zack already indicated that for Qualcomm he doesn't need the error counters, but another type, perhaps telemetry.

I need your feedback and input on yet another case here that goes side by side with error-counters: Error logging.

One of the RAS requirements that we have is to emit CPER logs in certain cases. AMD is currently using debugfs for printing the CPER entries that accumulates in a ringbuffer. (iiuc).

Some folks are asking us to emit the CPER in the tracefs because debugfs might not be available in some enterprise production images.

However, there's a concern on the tracefs usage for the error-logging case.
There is no active query path in the tracefs. If user needs to poll for the latest CPER records it would need to pig-back on some other API that would force the emit-trace(cper).

I believe that the cleanest way is to have another drm-ras node type named 'error-logging' with a single operation that is query-logs, that would be a dump of the available ring-buffer with latest known cper records. Is this acceptable?

AMD folks, would you consider this to replace the current debugfs you have?

Please let me know your thoughts.

We won't have an example for now, but it would be something like:

Thanks,
Rodrigo.

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/drm_ras.yaml \
  --dump list-nodes
[{'device-name': '00:02.0',
  'node-id': 0,
  'node-name': 'non-fatal',
  'node-type': 'error-counter'},
 {'device-name': '00:02.0',
  'node-id': 1,
  'node-name': 'correctable',
  'node-type': 'error-counter'}
 'device-name': '00:02.0',
  'node-id': 2,
  'node-name': 'non-fatal',
  'node-type': 'error-logging'},
 {'device-name': '00:02.0',
  'node-id': 3,
  'node-name': 'correctable',
  'node-type': 'error-logging'}]

$ sudo ./tools/net/ynl/pyynl/cli.py \
   --spec Documentation/netlink/specs/drm_ras.yaml \
   --dump get-logs --json '{"node-id":3}'
[{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
{'FRU': 'String with device info', 'CPER': !@#$#!@#$}, ]

Of course, details of the error-logging fields along with the CPER binary is yet to be defined.

Oh, and the nodes names and split is device specific. The infra is flexible enough. Driver can do whatever it makes sense for their device.

Any feedback or comment is really appreciated.

Thanks in advance,
Rodrigo.

>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --dump get-error-counters --json '{"node-id":1}'
> [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
>  {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
>  {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
>  {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
>  {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
>  {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
> {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
> {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
>
> Thanks,
> Rodrigo.
>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com<mailto:Hawking.Zhang@amd.com>>
> Cc: Alex Deucher <alexander.deucher@amd.com<mailto:alexander.deucher@amd.com>>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com<mailto:zachary.mckevitt@oss.qualcomm.com>>
> Cc: Lukas Wunner <lukas@wunner.de<mailto:lukas@wunner.de>>
> Cc: Dave Airlie <airlied@gmail.com<mailto:airlied@gmail.com>>
> Cc: Simona Vetter <simona.vetter@ffwll.ch<mailto:simona.vetter@ffwll.ch>>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com<mailto:aravind.iddamsetty@linux.intel.com>>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com<mailto:joonas.lahtinen@linux.intel.com>>
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com<mailto:rodrigo.vivi@intel.com>>
>
> Rodrigo Vivi (2):
>   drm/ras: Introduce the DRM RAS infrastructure over generic netlink
>   drm/xe: Introduce the usage of drm_ras with supported HW errors
>
>  Documentation/gpu/drm-ras.rst              | 109 +++++++
>  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++
>  drivers/gpu/drm/Kconfig                    |   9 +
>  drivers/gpu/drm/Makefile                   |   1 +
>  drivers/gpu/drm/drm_drv.c                  |   6 +
>  drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++
>  drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++
>  drivers/gpu/drm/drm_ras_nl.c               |  54 ++++
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++
>  drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-
>  include/drm/drm_ras.h                      |  76 +++++
>  include/drm/drm_ras_genl_family.h          |  17 +
>  include/drm/drm_ras_nl.h                   |  24 ++
>  include/uapi/drm/drm_ras.h                 |  49 +++
>  14 files changed, 1049 insertions(+), 2 deletions(-)  create mode
> 100644 Documentation/gpu/drm-ras.rst  create mode 100644
> Documentation/netlink/specs/drm_ras.yaml
>  create mode 100644 drivers/gpu/drm/drm_ras.c  create mode 100644
> drivers/gpu/drm/drm_ras_genl_family.c
>  create mode 100644 drivers/gpu/drm/drm_ras_nl.c  create mode 100644
> include/drm/drm_ras.h  create mode 100644
> include/drm/drm_ras_genl_family.h  create mode 100644
> include/drm/drm_ras_nl.h  create mode 100644
> include/uapi/drm/drm_ras.h
>
> --
> 2.51.0
>

[-- Attachment #2: Type: text/html, Size: 32247 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS for CPER Error logging?!
  2025-10-28 19:13 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
  2025-10-29  2:00   ` Zhang, Hawking
@ 2025-10-30 14:47   ` Rodrigo Vivi
  2025-10-30 15:37     ` DRM_RAS (netlink genl family) " Rodrigo Vivi
  2025-10-31  5:38     ` DRM_RAS " Lukas Wunner
  1 sibling, 2 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-10-30 14:47 UTC (permalink / raw)
  To: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
	Hawking Zhang, Alex Deucher, Zack McKevitt, Lukas Wunner,
	Aravind Iddamsetty, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman

On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
> 
> Hey Dave, Sima, AMD folks, Qualcomm folks,

+ Netlink list and maintainers to get some feedback on the netlink usage
proposed here.

Specially to check if there's any concern with CPER blob going through
netlink or if there's any size limitation or concern.

> 
> I have a key question to you below here.
> 
> > This work is a continuation of the great work started by Aravind ([1] and [2])
> > in order to fulfill the RAS requirements and proposal as previously discussed
> > and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> > 
> > [1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
> > [2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> > [3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> > 
> > During the past review round, Lukas pointed out that netlink had evolved
> > in parallel during these years and that now, any new usage of netlink families
> > would require the usage of the YAML description and scripts.
> > 
> > With this new requirement in place, the family name is hardcoded in the yaml file,
> > so we are forced to have a single family name for the entire drm, and then we now
> > we are forced to have a registration.
> > 
> > So, while doing the registration, we now created the concept of drm-ras-node.
> > For now the only node type supported is the agreed error-counter. But that could
> > be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
> > driver.
> > 
> > In this first version, only querying counter is supported. But also this is expandable
> > to future introduction of multicast notification and also clearing the counters.
> > 
> > This design with multiple nodes per device is already flexible enough for driver
> > to decide if it wants to handle error per device, or per IP block, or per error
> > category. I believe this fully attend to the requested AMD feedback in the earlier
> > reviews.
> > 
> > So, my proposal is to start simple with this case as is, and then iterate over
> > with the drm-ras in tree so we evolve together according to various driver's RAS
> > needs.
> > 
> > I have provided a documentation and the first Xe implementation of the counter
> > as reference.
> > 
> > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > exercises this new API, hence I hope this can be the reference code for the uAPI
> > usage, while we continue with the plan of introducing IGT tests and tools for this
> > and adjusting the internal vendor tools to open with open source developments and
> > changing them to support these flows.
> > 
> > Example on MTL:
> > 
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> >   --spec Documentation/netlink/specs/drm_ras.yaml \
> >   --dump list-nodes
> > [{'device-name': '00:02.0',
> >   'node-id': 0,
> >   'node-name': 'non-fatal',
> >   'node-type': 'error-counter'},
> >  {'device-name': '00:02.0',
> >   'node-id': 1,
> >   'node-name': 'correctable',
> >   'node-type': 'error-counter'}]
> 
> As you can see on the drm-ras patch, we now have only a single family called
> 'drm-ras', with that we have to register entry points, called 'nodes'
> and for now only one type is existing: 'error-counter'
> 
> As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> 
> Zack already indicated that for Qualcomm he doesn't need the error counters,
> but another type, perhaps telemetry.
> 
> I need your feedback and input on yet another case here that goes side
> by side with error-counters: Error logging.
> 
> One of the RAS requirements that we have is to emit CPER logs in certain
> cases. AMD is currently using debugfs for printing the CPER entries that
> accumulates in a ringbuffer. (iiuc).
> 
> Some folks are asking us to emit the CPER in the tracefs because
> debugfs might not be available in some enterprise production images.
> 
> However, there's a concern on the tracefs usage for the error-logging case.
> There is no active query path in the tracefs. If user needs to poll for
> the latest CPER records it would need to pig-back on some other API
> that would force the emit-trace(cper).
> 
> I believe that the cleanest way is to have another drm-ras node type
> named 'error-logging' with a single operation that is query-logs,
> that would be a dump of the available ring-buffer with latest known
> cper records. Is this acceptable?
> 
> AMD folks, would you consider this to replace the current debugfs you
> have?
> 
> Please let me know your thoughts.
> 
> We won't have an example for now, but it would be something like:
> 
> Thanks,
> Rodrigo.
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>   --spec Documentation/netlink/specs/drm_ras.yaml \
>   --dump list-nodes
> [{'device-name': '00:02.0',
>   'node-id': 0,
>   'node-name': 'non-fatal',
>   'node-type': 'error-counter'},
>  {'device-name': '00:02.0',
>   'node-id': 1,
>   'node-name': 'correctable',
>   'node-type': 'error-counter'}
>  'device-name': '00:02.0',
>   'node-id': 2,
>   'node-name': 'non-fatal',
>   'node-type': 'error-logging'},
>  {'device-name': '00:02.0',
>   'node-id': 3,
>   'node-name': 'correctable',
>   'node-type': 'error-logging'}]
> 
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>    --spec Documentation/netlink/specs/drm_ras.yaml \
>    --dump get-logs --json '{"node-id":3}'
> [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> ]
> 
> Of course, details of the error-logging fields along with the CPER binary
> is yet to be defined.
> 
> Oh, and the nodes names and split is device specific. The infra is flexible
> enough. Driver can do whatever it makes sense for their device.
> 
> Any feedback or comment is really appreciated.
> 
> Thanks in advance,
> Rodrigo.
> 
> > 
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> >   --spec Documentation/netlink/specs/drm_ras.yaml \
> >   --dump get-error-counters --json '{"node-id":1}'
> > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
> >  {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
> >  {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
> >  {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
> >  {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
> >  {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
> > 
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> >   --spec Documentation/netlink/specs/drm_ras.yaml \
> >   --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
> > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
> > 
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> >   --spec Documentation/netlink/specs/drm_ras.yaml \
> >   --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
> > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
> > 
> > Thanks,
> > Rodrigo.
> > 
> > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> > Cc: Lukas Wunner <lukas@wunner.de>
> > Cc: Dave Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > 
> > Rodrigo Vivi (2):
> >   drm/ras: Introduce the DRM RAS infrastructure over generic netlink
> >   drm/xe: Introduce the usage of drm_ras with supported HW errors
> > 
> >  Documentation/gpu/drm-ras.rst              | 109 +++++++
> >  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++
> >  drivers/gpu/drm/Kconfig                    |   9 +
> >  drivers/gpu/drm/Makefile                   |   1 +
> >  drivers/gpu/drm/drm_drv.c                  |   6 +
> >  drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++
> >  drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++
> >  drivers/gpu/drm/drm_ras_nl.c               |  54 ++++
> >  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++
> >  drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-
> >  include/drm/drm_ras.h                      |  76 +++++
> >  include/drm/drm_ras_genl_family.h          |  17 +
> >  include/drm/drm_ras_nl.h                   |  24 ++
> >  include/uapi/drm/drm_ras.h                 |  49 +++
> >  14 files changed, 1049 insertions(+), 2 deletions(-)
> >  create mode 100644 Documentation/gpu/drm-ras.rst
> >  create mode 100644 Documentation/netlink/specs/drm_ras.yaml
> >  create mode 100644 drivers/gpu/drm/drm_ras.c
> >  create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
> >  create mode 100644 drivers/gpu/drm/drm_ras_nl.c
> >  create mode 100644 include/drm/drm_ras.h
> >  create mode 100644 include/drm/drm_ras_genl_family.h
> >  create mode 100644 include/drm/drm_ras_nl.h
> >  create mode 100644 include/uapi/drm/drm_ras.h
> > 
> > -- 
> > 2.51.0
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS (netlink genl family) for CPER Error logging?!
  2025-10-30 14:47   ` Rodrigo Vivi
@ 2025-10-30 15:37     ` Rodrigo Vivi
  2025-10-31  5:38     ` DRM_RAS " Lukas Wunner
  1 sibling, 0 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-10-30 15:37 UTC (permalink / raw)
  To: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
	Hawking Zhang, Alex Deucher, Zack McKevitt, Lukas Wunner,
	Aravind Iddamsetty, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman

On Thu, Oct 30, 2025 at 10:47:18AM -0400, Rodrigo Vivi wrote:
> On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> > On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
> > 
> > Hey Dave, Sima, AMD folks, Qualcomm folks,
> 
> + Netlink list and maintainers to get some feedback on the netlink usage
> proposed here.

The netdev mailing list blocked my bounces of the original discussions,
so for the overall context:

Usage of netlink as a drm-ras solution (with error counters in mind):
https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

Proposal for error-counters with drm-ras generic netlink:
https://lore.kernel.org/dri-devel/20250929214415.326414-4-rodrigo.vivi@intel.com/

Question about the error-logging RAS sub-case with CPER over this drm-ras netlink:
https://lore.kernel.org/dri-devel/aQEVy1qjaDCwL_cc@intel.com/

> 
> Specially to check if there's any concern with CPER blob going through
> netlink or if there's any size limitation or concern.
> 
> > 
> > I have a key question to you below here.
> > 
> > > This work is a continuation of the great work started by Aravind ([1] and [2])
> > > in order to fulfill the RAS requirements and proposal as previously discussed
> > > and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> > > 
> > > [1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
> > > [2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> > > [3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> > > 
> > > During the past review round, Lukas pointed out that netlink had evolved
> > > in parallel during these years and that now, any new usage of netlink families
> > > would require the usage of the YAML description and scripts.
> > > 
> > > With this new requirement in place, the family name is hardcoded in the yaml file,
> > > so we are forced to have a single family name for the entire drm, and then we now
> > > we are forced to have a registration.
> > > 
> > > So, while doing the registration, we now created the concept of drm-ras-node.
> > > For now the only node type supported is the agreed error-counter. But that could
> > > be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
> > > driver.
> > > 
> > > In this first version, only querying counter is supported. But also this is expandable
> > > to future introduction of multicast notification and also clearing the counters.
> > > 
> > > This design with multiple nodes per device is already flexible enough for driver
> > > to decide if it wants to handle error per device, or per IP block, or per error
> > > category. I believe this fully attend to the requested AMD feedback in the earlier
> > > reviews.
> > > 
> > > So, my proposal is to start simple with this case as is, and then iterate over
> > > with the drm-ras in tree so we evolve together according to various driver's RAS
> > > needs.
> > > 
> > > I have provided a documentation and the first Xe implementation of the counter
> > > as reference.
> > > 
> > > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > > exercises this new API, hence I hope this can be the reference code for the uAPI
> > > usage, while we continue with the plan of introducing IGT tests and tools for this
> > > and adjusting the internal vendor tools to open with open source developments and
> > > changing them to support these flows.
> > > 
> > > Example on MTL:
> > > 
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > >   --spec Documentation/netlink/specs/drm_ras.yaml \
> > >   --dump list-nodes
> > > [{'device-name': '00:02.0',
> > >   'node-id': 0,
> > >   'node-name': 'non-fatal',
> > >   'node-type': 'error-counter'},
> > >  {'device-name': '00:02.0',
> > >   'node-id': 1,
> > >   'node-name': 'correctable',
> > >   'node-type': 'error-counter'}]
> > 
> > As you can see on the drm-ras patch, we now have only a single family called
> > 'drm-ras', with that we have to register entry points, called 'nodes'
> > and for now only one type is existing: 'error-counter'
> > 
> > As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 [3].
> > 
> > Zack already indicated that for Qualcomm he doesn't need the error counters,
> > but another type, perhaps telemetry.
> > 
> > I need your feedback and input on yet another case here that goes side
> > by side with error-counters: Error logging.
> > 
> > One of the RAS requirements that we have is to emit CPER logs in certain
> > cases. AMD is currently using debugfs for printing the CPER entries that
> > accumulates in a ringbuffer. (iiuc).
> > 
> > Some folks are asking us to emit the CPER in the tracefs because
> > debugfs might not be available in some enterprise production images.
> > 
> > However, there's a concern on the tracefs usage for the error-logging case.
> > There is no active query path in the tracefs. If user needs to poll for
> > the latest CPER records it would need to pig-back on some other API
> > that would force the emit-trace(cper).
> > 
> > I believe that the cleanest way is to have another drm-ras node type
> > named 'error-logging' with a single operation that is query-logs,
> > that would be a dump of the available ring-buffer with latest known
> > cper records. Is this acceptable?
> > 
> > AMD folks, would you consider this to replace the current debugfs you
> > have?
> > 
> > Please let me know your thoughts.
> > 
> > We won't have an example for now, but it would be something like:
> > 
> > Thanks,
> > Rodrigo.
> > 
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> >   --spec Documentation/netlink/specs/drm_ras.yaml \
> >   --dump list-nodes
> > [{'device-name': '00:02.0',
> >   'node-id': 0,
> >   'node-name': 'non-fatal',
> >   'node-type': 'error-counter'},
> >  {'device-name': '00:02.0',
> >   'node-id': 1,
> >   'node-name': 'correctable',
> >   'node-type': 'error-counter'}
> >  'device-name': '00:02.0',
> >   'node-id': 2,
> >   'node-name': 'non-fatal',
> >   'node-type': 'error-logging'},
> >  {'device-name': '00:02.0',
> >   'node-id': 3,
> >   'node-name': 'correctable',
> >   'node-type': 'error-logging'}]
> > 
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
> >    --spec Documentation/netlink/specs/drm_ras.yaml \
> >    --dump get-logs --json '{"node-id":3}'
> > [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> > ]
> > 
> > Of course, details of the error-logging fields along with the CPER binary
> > is yet to be defined.
> > 
> > Oh, and the nodes names and split is device specific. The infra is flexible
> > enough. Driver can do whatever it makes sense for their device.
> > 
> > Any feedback or comment is really appreciated.
> > 
> > Thanks in advance,
> > Rodrigo.
> > 
> > > 
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > >   --spec Documentation/netlink/specs/drm_ras.yaml \
> > >   --dump get-error-counters --json '{"node-id":1}'
> > > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
> > >  {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
> > >  {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
> > >  {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
> > >  {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
> > >  {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
> > > 
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > >   --spec Documentation/netlink/specs/drm_ras.yaml \
> > >   --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
> > > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
> > > 
> > > $ sudo ./tools/net/ynl/pyynl/cli.py \
> > >   --spec Documentation/netlink/specs/drm_ras.yaml \
> > >   --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
> > > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
> > > 
> > > Thanks,
> > > Rodrigo.
> > > 
> > > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> > > Cc: Lukas Wunner <lukas@wunner.de>
> > > Cc: Dave Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > > Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > 
> > > Rodrigo Vivi (2):
> > >   drm/ras: Introduce the DRM RAS infrastructure over generic netlink
> > >   drm/xe: Introduce the usage of drm_ras with supported HW errors
> > > 
> > >  Documentation/gpu/drm-ras.rst              | 109 +++++++
> > >  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++
> > >  drivers/gpu/drm/Kconfig                    |   9 +
> > >  drivers/gpu/drm/Makefile                   |   1 +
> > >  drivers/gpu/drm/drm_drv.c                  |   6 +
> > >  drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++
> > >  drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++
> > >  drivers/gpu/drm/drm_ras_nl.c               |  54 ++++
> > >  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++
> > >  drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-
> > >  include/drm/drm_ras.h                      |  76 +++++
> > >  include/drm/drm_ras_genl_family.h          |  17 +
> > >  include/drm/drm_ras_nl.h                   |  24 ++
> > >  include/uapi/drm/drm_ras.h                 |  49 +++
> > >  14 files changed, 1049 insertions(+), 2 deletions(-)
> > >  create mode 100644 Documentation/gpu/drm-ras.rst
> > >  create mode 100644 Documentation/netlink/specs/drm_ras.yaml
> > >  create mode 100644 drivers/gpu/drm/drm_ras.c
> > >  create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
> > >  create mode 100644 drivers/gpu/drm/drm_ras_nl.c
> > >  create mode 100644 include/drm/drm_ras.h
> > >  create mode 100644 include/drm/drm_ras_genl_family.h
> > >  create mode 100644 include/drm/drm_ras_nl.h
> > >  create mode 100644 include/uapi/drm/drm_ras.h
> > > 
> > > -- 
> > > 2.51.0
> > > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
@ 2025-10-31  1:32   ` Jakub Kicinski
  2025-11-06 13:30     ` Rodrigo Vivi
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2025-10-31  1:32 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: dri-devel, intel-xe, Zack McKevitt, Lukas Wunner, Lijo Lazar,
	Hawking Zhang, Aravind Iddamsetty

On Mon, 29 Sep 2025 17:44:13 -0400 Rodrigo Vivi wrote:
> Introduces the DRM RAS infrastructure over generic netlink.

Can't comment on the merits but in terms of netlink..

> +    ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/drm_ras.yaml --dump list-nodes

We recommend using the "installed" syntax in examples, so:

	ynl --family drm_ras

instead of:

	./tools/net/ynl/pyynl/cli.py --spec
	Documentation/netlink/specs/drm_ras.yaml 

If you're using Fedora or another good distro ynl CLI is packaged (for
Fedora in kernel-tools). The in-tree syntax is a bit verbose.

> +	xa_for_each(&drm_ras_xa, id, node) {
> +		if (id < ctx->restart)
> +			continue;

IIRC xa_for_each_start can make this simpler?

> +		hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
> +				  cb->nlh->nlmsg_seq,
> +				  &drm_ras_nl_family, NLM_F_MULTI,
> +				  DRM_RAS_CMD_LIST_NODES);

genlmsg_iput()
genl_info_dump(cb) to get info

> +		if (!hdr) {
> +			ret = -EMSGSIZE;
> +			break;
> +		}
> +
> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> +				     node->device_name);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> +				     node->node_name);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> +				  node->type);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		genlmsg_end(skb, hdr);
> +	}
> +
> +	if (ret == -EMSGSIZE) {
> +		ctx->restart = id;
> +		return skb->len;
> +	}
> +
> +	return ret;

Separate handling of -EMSGSIZE and returning skb->len is not necessary
as of a few releases ago. Just return ret; core will do the right thing
if ret == -EMSGSIZE and skb->len != 0


> +static int doit_reply_value(struct genl_info *info, u32 node_id,
> +			    u32 error_id)
> +{
> +	struct sk_buff *msg;
> +	struct nlattr *hdr;
> +	const char *error_name;
> +	u32 value;
> +	int ret;
> +
> +	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +	hdr = genlmsg_put_reply(msg, info, &drm_ras_nl_family, 0,
> +				DRM_RAS_CMD_QUERY_ERROR_COUNTER);
> +	if (!hdr) {
> +		nlmsg_free(msg);
> +		return -EMSGSIZE;
> +	}
> +
> +	ret = get_node_error_counter(node_id, error_id,
> +				     &error_name, &value);
> +	if (ret)
> +		return ret;
> +
> +	ret = msg_reply_value(msg, error_id, error_name, value);
> +	if (ret)
> +		return ret;

Leaking message on errors?

> +	genlmsg_end(msg, hdr);
> +
> +	return genlmsg_reply(msg, info);
> +}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS for CPER Error logging?!
  2025-10-30 14:47   ` Rodrigo Vivi
  2025-10-30 15:37     ` DRM_RAS (netlink genl family) " Rodrigo Vivi
@ 2025-10-31  5:38     ` Lukas Wunner
  2025-11-06 13:08       ` Rodrigo Vivi
  1 sibling, 1 reply; 22+ messages in thread
From: Lukas Wunner @ 2025-10-31  5:38 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
	Hawking Zhang, Alex Deucher, Zack McKevitt, Aravind Iddamsetty,
	netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman

On Thu, Oct 30, 2025 at 10:47:18AM -0400, Rodrigo Vivi wrote:
> On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> > On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
> > 
> > Hey Dave, Sima, AMD folks, Qualcomm folks,
> 
> + Netlink list and maintainers to get some feedback on the netlink usage
> proposed here.
> 
> Specially to check if there's any concern with CPER blob going through
> netlink or if there's any size limitation or concern.

How large are those blobs?  If the netlink message exceeds PAGE_SIZE
because of the CPER blob, a workaround might be to attach it to the
skb as fragments with skb_add_rx_frag().

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS for CPER Error logging?!
  2025-10-31  5:38     ` DRM_RAS " Lukas Wunner
@ 2025-11-06 13:08       ` Rodrigo Vivi
  0 siblings, 0 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-11-06 13:08 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: dri-devel, intel-xe, Dave Airlie, Joonas Lahtinen, Simona Vetter,
	Hawking Zhang, Alex Deucher, Zack McKevitt, Aravind Iddamsetty,
	netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman

On Fri, Oct 31, 2025 at 06:38:57AM +0100, Lukas Wunner wrote:
> On Thu, Oct 30, 2025 at 10:47:18AM -0400, Rodrigo Vivi wrote:
> > On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote:
> > > On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
> > > 
> > > Hey Dave, Sima, AMD folks, Qualcomm folks,
> > 
> > + Netlink list and maintainers to get some feedback on the netlink usage
> > proposed here.
> > 
> > Specially to check if there's any concern with CPER blob going through
> > netlink or if there's any size limitation or concern.
> 
> How large are those blobs? 

The honest answer is: I don't know!

By spec it has no limitation, but since in general CPER is made
for FW storage it is usually not really big.

From what I could see usual max seems to be around 64Kb. But
for our case we are looking to something much smaller than that.

> If the netlink message exceeds PAGE_SIZE
> because of the CPER blob, a workaround might be to attach it to the
> skb as fragments with skb_add_rx_frag().

Yeap, I imagined that there should be a way.

Thank you

> 
> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS for CPER Error logging?!
  2025-10-29  2:00   ` Zhang, Hawking
@ 2025-11-06 13:16     ` Rodrigo Vivi
  2025-11-10  3:34       ` Dave Airlie
  0 siblings, 1 reply; 22+ messages in thread
From: Rodrigo Vivi @ 2025-11-06 13:16 UTC (permalink / raw)
  To: Zhang, Hawking
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Dave Airlie, Joonas Lahtinen, Simona Vetter, Deucher, Alexander,
	Zack McKevitt, Lukas Wunner, Aravind Iddamsetty, Zhou1, Tao,
	Liu, Xiang(Dean)

On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
>    [AMD Official Use Only - AMD Internal Distribution Only]                     
>    + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.                 
>                                                                                 
>    RE - AMD folks, would you consider this to replace the current debugfs you   
>    have?                                                                        
>                                                                                 
>    [Hawking]:                                                                   
>                                                                                 
>    Replacing the debugfs is not the primary concern.

My initial plan was to go with debugfs like you are doing, but
I keep hearing complains that debugfs is not global and we need
to take into account some cases where debugfs is not available
in production images.

> The main concern is        
>    whether drm_ras can effectively support the necessary RAS information for    
>    all device vendors, as this largely depends on the design of the hardware    
>    and firmware.                                                                

I fully agree. This is the main reason I'm doing my best to make the drm-ras
the most generic and expansible as possible.

node registration with different node types, and names.

I imagined something like:

[{'FRU': 'String with device info', 'CPER': !@#$#!@#$},

based on the format that the current non-standard-cper tracefs uses, with
the FRU + CPER. But we could avoid the FRU and make the FRU as node name.

>                                                                                 
>    AMD is currently evaluating the proposed interface for error logging.        

The design of the details and the implementation is pretty much open for discussion
at this point.

What I'm really looking forward is:

to know if the path is acceptable overall
even if different drivers are opting for different node types?

Is there any blocker on using this drm-ras/netlink for the CPER?

Thanks,
Rodrigo.


>                                                                                 
>    Regards,                                                                     
>    Hawking                                                                      
>                                                                                 
>    -----Original Message-----                                                   
>    From: Rodrigo Vivi <rodrigo.vivi@intel.com>                                  
>    Sent: Wednesday, October 29, 2025 03:13                                      
>    To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Dave    
>    Airlie <airlied@gmail.com>; Joonas Lahtinen                                  
>    <joonas.lahtinen@linux.intel.com>; Simona Vetter <simona.vetter@ffwll.ch>;   
>    Zhang, Hawking <Hawking.Zhang@amd.com>; Deucher, Alexander                   
>    <Alexander.Deucher@amd.com>; Zack McKevitt                                   
>    <zachary.mckevitt@oss.qualcomm.com>; Lukas Wunner <lukas@wunner.de>;         
>    Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>                      
>    Cc: Zhang, Hawking <Hawking.Zhang@amd.com>; Deucher, Alexander               
>    <Alexander.Deucher@amd.com>; Zack McKevitt                                   
>    <zachary.mckevitt@oss.qualcomm.com>; Lukas Wunner <lukas@wunner.de>; Dave    
>    Airlie <airlied@gmail.com>; Simona Vetter <simona.vetter@ffwll.ch>;          
>    Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>; Joonas Lahtinen     
>    <joonas.lahtinen@linux.intel.com>                                            
>    Subject: DRM_RAS for CPER Error logging?!                                    
>                                                                                 
>    On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:                
>                                                                                 
>    Hey Dave, Sima, AMD folks, Qualcomm folks,                                   
>                                                                                 
>    I have a key question to you below here.                                     
>                                                                                 
>    > This work is a continuation of the great work started by Aravind ([1]      
>    > and [2]) in order to fulfill the RAS requirements and proposal as          
>    > previously discussed and agreed in the Linux Plumbers accelerator's bof    
>    of 2022 [3].                                                                 
>    >                                                                            
>    > [1]:                                                                       
>    >                                                                            
>    [3]https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.idd    
>    > [4]amsetty@linux.intel.com/                                                
>    > [2]:                                                                       
>    >                                                                            
>    [5]https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux    
>    > .intel.com/                                                                
>    > [3]:                                                                       
>    >                                                                            
>    [6]https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary    
>    > .html                                                                      
>    >                                                                            
>    > During the past review round, Lukas pointed out that netlink had           
>    > evolved in parallel during these years and that now, any new usage of      
>    > netlink families would require the usage of the YAML description and       
>    scripts.                                                                     
>    >                                                                            
>    > With this new requirement in place, the family name is hardcoded in        
>    > the yaml file, so we are forced to have a single family name for the       
>    > entire drm, and then we now we are forced to have a registration.          
>    >                                                                            
>    > So, while doing the registration, we now created the concept of            
>    drm-ras-node.                                                                
>    > For now the only node type supported is the agreed error-counter. But      
>    > that could be expanded for other cases like telemetry, requested by        
>    > Zack for the qualcomm accel driver.                                        
>    >                                                                            
>    > In this first version, only querying counter is supported. But also        
>    > this is expandable to future introduction of multicast notification and    
>    also clearing the counters.                                                  
>    >                                                                            
>    > This design with multiple nodes per device is already flexible enough      
>    > for driver to decide if it wants to handle error per device, or per IP     
>    > block, or per error category. I believe this fully attend to the           
>    > requested AMD feedback in the earlier reviews.                             
>    >                                                                            
>    > So, my proposal is to start simple with this case as is, and then          
>    > iterate over with the drm-ras in tree so we evolve together according      
>    > to various driver's RAS needs.                                             
>    >                                                                            
>    > I have provided a documentation and the first Xe implementation of the     
>    > counter as reference.                                                      
>    >                                                                            
>    > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool      
>    > that entirely exercises this new API, hence I hope this can be the         
>    > reference code for the uAPI usage, while we continue with the plan of      
>    > introducing IGT tests and tools for this and adjusting the internal        
>    > vendor tools to open with open source developments and changing them to    
>    support these flows.                                                         
>    >                                                                            
>    > Example on MTL:                                                            
>    >                                                                            
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                      
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                        
>    >   --dump list-nodes                                                        
>    > [{'device-name': '00:02.0',                                                
>    >   'node-id': 0,                                                            
>    >   'node-name': 'non-fatal',                                                
>    >   'node-type': 'error-counter'},                                           
>    >  {'device-name': '00:02.0',                                                
>    >   'node-id': 1,                                                            
>    >   'node-name': 'correctable',                                              
>    >   'node-type': 'error-counter'}]                                           
>                                                                                 
>    As you can see on the drm-ras patch, we now have only a single family        
>    called 'drm-ras', with that we have to register entry points, called         
>    'nodes'                                                                      
>    and for now only one type is existing: 'error-counter'                       
>                                                                                 
>    As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022   
>    [3].                                                                         
>                                                                                 
>    Zack already indicated that for Qualcomm he doesn't need the error           
>    counters, but another type, perhaps telemetry.                               
>                                                                                 
>    I need your feedback and input on yet another case here that goes side by    
>    side with error-counters: Error logging.                                     
>                                                                                 
>    One of the RAS requirements that we have is to emit CPER logs in certain     
>    cases. AMD is currently using debugfs for printing the CPER entries that     
>    accumulates in a ringbuffer. (iiuc).                                         
>                                                                                 
>    Some folks are asking us to emit the CPER in the tracefs because debugfs     
>    might not be available in some enterprise production images.                 
>                                                                                 
>    However, there's a concern on the tracefs usage for the error-logging        
>    case.                                                                        
>    There is no active query path in the tracefs. If user needs to poll for      
>    the latest CPER records it would need to pig-back on some other API that     
>    would force the emit-trace(cper).                                            
>                                                                                 
>    I believe that the cleanest way is to have another drm-ras node type named   
>    'error-logging' with a single operation that is query-logs, that would be    
>    a dump of the available ring-buffer with latest known cper records. Is       
>    this acceptable?                                                             
>                                                                                 
>    AMD folks, would you consider this to replace the current debugfs you        
>    have?                                                                        
>                                                                                 
>    Please let me know your thoughts.                                            
>                                                                                 
>    We won't have an example for now, but it would be something like:            
>                                                                                 
>    Thanks,                                                                      
>    Rodrigo.                                                                     
>                                                                                 
>    $ sudo ./tools/net/ynl/pyynl/cli.py \                                        
>      --spec Documentation/netlink/specs/drm_ras.yaml \                          
>      --dump list-nodes                                                          
>    [{'device-name': '00:02.0',                                                  
>      'node-id': 0,                                                              
>      'node-name': 'non-fatal',                                                  
>      'node-type': 'error-counter'},                                             
>    {'device-name': '00:02.0',                                                   
>      'node-id': 1,                                                              
>      'node-name': 'correctable',                                                
>      'node-type': 'error-counter'}                                              
>    'device-name': '00:02.0',                                                    
>      'node-id': 2,                                                              
>      'node-name': 'non-fatal',                                                  
>      'node-type': 'error-logging'},                                             
>    {'device-name': '00:02.0',                                                   
>      'node-id': 3,                                                              
>      'node-name': 'correctable',                                                
>      'node-type': 'error-logging'}]                                             
>                                                                                 
>    $ sudo ./tools/net/ynl/pyynl/cli.py \                                        
>       --spec Documentation/netlink/specs/drm_ras.yaml \                         
>       --dump get-logs --json '{"node-id":3}'                                    
>    [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},                      
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$},                       
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$},                       
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$},                       
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$}, ]                     
>                                                                                 
>    Of course, details of the error-logging fields along with the CPER binary    
>    is yet to be defined.                                                        
>                                                                                 
>    Oh, and the nodes names and split is device specific. The infra is           
>    flexible enough. Driver can do whatever it makes sense for their device.     
>                                                                                 
>    Any feedback or comment is really appreciated.                               
>                                                                                 
>    Thanks in advance,                                                           
>    Rodrigo.                                                                     
>                                                                                 
>    >                                                                            
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                      
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                        
>    >   --dump get-error-counters --json '{"node-id":1}'                         
>    > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},              
>    >  {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},         
>    >  {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},             
>    >  {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},        
>    >  {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},            
>    >  {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]            
>    >                                                                            
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                      
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                        
>    >   --do query-error-counter --json '{"node-id": 0, "error-id": 12}'         
>    > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}          
>    >                                                                            
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                      
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                        
>    >   --do query-error-counter --json '{"node-id": 1, "error-id": 16}'         
>    > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}              
>    >                                                                            
>    > Thanks,                                                                    
>    > Rodrigo.                                                                   
>    >                                                                            
>    > Cc: Hawking Zhang <[7]Hawking.Zhang@amd.com>                               
>    > Cc: Alex Deucher <[8]alexander.deucher@amd.com>                            
>    > Cc: Zack McKevitt <[9]zachary.mckevitt@oss.qualcomm.com>                   
>    > Cc: Lukas Wunner <[10]lukas@wunner.de>                                     
>    > Cc: Dave Airlie <[11]airlied@gmail.com>                                    
>    > Cc: Simona Vetter <[12]simona.vetter@ffwll.ch>                             
>    > Cc: Aravind Iddamsetty <[13]aravind.iddamsetty@linux.intel.com>            
>    > Cc: Joonas Lahtinen <[14]joonas.lahtinen@linux.intel.com>                  
>    > Signed-off-by: Rodrigo Vivi <[15]rodrigo.vivi@intel.com>                   
>    >                                                                            
>    > Rodrigo Vivi (2):                                                          
>    >   drm/ras: Introduce the DRM RAS infrastructure over generic netlink       
>    >   drm/xe: Introduce the usage of drm_ras with supported HW errors          
>    >                                                                            
>    >  Documentation/gpu/drm-ras.rst              | 109 +++++++                  
>    >  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++                 
>    >  drivers/gpu/drm/Kconfig                    |   9 +                        
>    >  drivers/gpu/drm/Makefile                   |   1 +                        
>    >  drivers/gpu/drm/drm_drv.c                  |   6 +                        
>    >  drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++    
>    >  drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++                      
>    >  drivers/gpu/drm/drm_ras_nl.c               |  54 ++++                     
>    >  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++                       
>    >  drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-                
>    >  include/drm/drm_ras.h                      |  76 +++++                    
>    >  include/drm/drm_ras_genl_family.h          |  17 +                        
>    >  include/drm/drm_ras_nl.h                   |  24 ++                       
>    >  include/uapi/drm/drm_ras.h                 |  49 +++                      
>    >  14 files changed, 1049 insertions(+), 2 deletions(-)  create mode         
>    > 100644 Documentation/gpu/drm-ras.rst  create mode 100644                   
>    > Documentation/netlink/specs/drm_ras.yaml                                   
>    >  create mode 100644 drivers/gpu/drm/drm_ras.c  create mode 100644          
>    > drivers/gpu/drm/drm_ras_genl_family.c                                      
>    >  create mode 100644 drivers/gpu/drm/drm_ras_nl.c  create mode 100644       
>    > include/drm/drm_ras.h  create mode 100644                                  
>    > include/drm/drm_ras_genl_family.h  create mode 100644                      
>    > include/drm/drm_ras_nl.h  create mode 100644                               
>    > include/uapi/drm/drm_ras.h                                                 
>    >                                                                            
>    > --                                                                         
>    > 2.51.0                                                                     
>    >                                                                            
> 
> References
> 
>    Visible links
>    1. mailto:Tao.Zhou1@amd.com
>    2. mailto:Xiang.Liu@amd.com
>    3. https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.idd
>    4. mailto:amsetty@linux.intel.com/
>    5. https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux
>    6. https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary
>    7. mailto:Hawking.Zhang@amd.com
>    8. mailto:alexander.deucher@amd.com
>    9. mailto:zachary.mckevitt@oss.qualcomm.com
>   10. mailto:lukas@wunner.de
>   11. mailto:airlied@gmail.com
>   12. mailto:simona.vetter@ffwll.ch
>   13. mailto:aravind.iddamsetty@linux.intel.com
>   14. mailto:joonas.lahtinen@linux.intel.com
>   15. mailto:rodrigo.vivi@intel.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2025-10-31  1:32   ` Jakub Kicinski
@ 2025-11-06 13:30     ` Rodrigo Vivi
  2025-11-06 14:58       ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Rodrigo Vivi @ 2025-11-06 13:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: dri-devel, intel-xe, Zack McKevitt, Lukas Wunner, Lijo Lazar,
	Hawking Zhang, Aravind Iddamsetty

On Thu, Oct 30, 2025 at 06:32:54PM -0700, Jakub Kicinski wrote:
> On Mon, 29 Sep 2025 17:44:13 -0400 Rodrigo Vivi wrote:
> > Introduces the DRM RAS infrastructure over generic netlink.
> 
> Can't comment on the merits but in terms of netlink..
> 
> > +    ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/drm_ras.yaml --dump list-nodes
> 
> We recommend using the "installed" syntax in examples, so:
> 
> 	ynl --family drm_ras
> 
> instead of:
> 
> 	./tools/net/ynl/pyynl/cli.py --spec
> 	Documentation/netlink/specs/drm_ras.yaml 

That's really neat. Thank you

$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '00:02.0',
  'node-id': 0,
  'node-name': 'non-fatal',
  'node-type': 'error-counter'},
 {'device-name': '00:02.0',
  'node-id': 1,
  'node-name': 'correctable',
  'node-type': 'error-counter'}]

> 
> If you're using Fedora or another good distro ynl CLI is packaged (for
> Fedora in kernel-tools). The in-tree syntax is a bit verbose.

I didn't know this tool was getting package with the kernel-tools
I thought it was only helping for debug during the development.

Now I'm even wondering if we really need to code a user-space tool
for this drm-ras, or simply recommending the kernel-tools/ynl as
the official consumer of this API.

> 
> > +	xa_for_each(&drm_ras_xa, id, node) {
> > +		if (id < ctx->restart)
> > +			continue;
> 
> IIRC xa_for_each_start can make this simpler?

indeed. I will attempt that on the next revision.

> 
> > +		hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
> > +				  cb->nlh->nlmsg_seq,
> > +				  &drm_ras_nl_family, NLM_F_MULTI,
> > +				  DRM_RAS_CMD_LIST_NODES);
> 
> genlmsg_iput()
> genl_info_dump(cb) to get info

I was taking a look to other examples that was using the genl
but I will try this on the next round. Thanks.

> 
> > +		if (!hdr) {
> > +			ret = -EMSGSIZE;
> > +			break;
> > +		}
> > +
> > +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
> > +		if (ret) {
> > +			genlmsg_cancel(skb, hdr);
> > +			break;
> > +		}
> > +
> > +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> > +				     node->device_name);
> > +		if (ret) {
> > +			genlmsg_cancel(skb, hdr);
> > +			break;
> > +		}
> > +
> > +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> > +				     node->node_name);
> > +		if (ret) {
> > +			genlmsg_cancel(skb, hdr);
> > +			break;
> > +		}
> > +
> > +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> > +				  node->type);
> > +		if (ret) {
> > +			genlmsg_cancel(skb, hdr);
> > +			break;
> > +		}
> > +
> > +		genlmsg_end(skb, hdr);
> > +	}
> > +
> > +	if (ret == -EMSGSIZE) {
> > +		ctx->restart = id;
> > +		return skb->len;
> > +	}
> > +
> > +	return ret;
> 
> Separate handling of -EMSGSIZE and returning skb->len is not necessary
> as of a few releases ago. Just return ret; core will do the right thing
> if ret == -EMSGSIZE and skb->len != 0

Any good modern example that I could get the right inspiration from?

> 
> 
> > +static int doit_reply_value(struct genl_info *info, u32 node_id,
> > +			    u32 error_id)
> > +{
> > +	struct sk_buff *msg;
> > +	struct nlattr *hdr;
> > +	const char *error_name;
> > +	u32 value;
> > +	int ret;
> > +
> > +	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
> > +	if (!msg)
> > +		return -ENOMEM;
> > +	hdr = genlmsg_put_reply(msg, info, &drm_ras_nl_family, 0,
> > +				DRM_RAS_CMD_QUERY_ERROR_COUNTER);
> > +	if (!hdr) {
> > +		nlmsg_free(msg);
> > +		return -EMSGSIZE;
> > +	}
> > +
> > +	ret = get_node_error_counter(node_id, error_id,
> > +				     &error_name, &value);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = msg_reply_value(msg, error_id, error_name, value);
> > +	if (ret)
> > +		return ret;
> 
> Leaking message on errors?

good catch! will fix

Thanks a lot,
Rodrigo.

> 
> > +	genlmsg_end(msg, hdr);
> > +
> > +	return genlmsg_reply(msg, info);
> > +}
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
  2025-10-02 20:38 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Zack McKevitt
  2025-10-28 19:14   ` Rodrigo Vivi
@ 2025-11-06 13:42   ` Rodrigo Vivi
  2025-11-07 20:20     ` Zack McKevitt
  1 sibling, 1 reply; 22+ messages in thread
From: Rodrigo Vivi @ 2025-11-06 13:42 UTC (permalink / raw)
  To: Zack McKevitt, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: dri-devel, intel-xe, Hawking Zhang, Alex Deucher, Lukas Wunner,
	Dave Airlie, Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen

On Thu, Oct 02, 2025 at 02:38:47PM -0600, Zack McKevitt wrote:
> I think this looks good, adding telemetry functionality as a node type and
> in the yaml spec looks straightforward (despite some potential naming
> awkwardness with the RAS module). Thanks for adding this.
> 
> Have you considered how this might work for containerized workloads?

From the use cases that we have, we are already expecting network=host,
so there shouldn't be any problem for this usage.

> Specifically, I think it would be best if the underlying drm_ras nodes are
> only accessible for containerized workloads where the device has been
> explicitly passed in. Do you know if this is handled automatically with the
> existing netlink implementation? I imagine that this would be of interest to
> the broader community outside of Qualcomm as well.

My understanding is that it is. But adding the netlink mailing list and maintainers
here for more specialized eyes.

> 
> > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > exercises this new API, hence I hope this can be the reference code for the uAPI
> > usage, while we continue with the plan of introducing IGT tests and tools for this
> > and adjusting the internal vendor tools to open with open source developments and
> > changing them to support these flows.
> 
> I think it would be nice to see some accompanying userspace code that makes
> use of this implementation to have as a reference if at all possible.

We have some folks working on the userspace tools, but I just realized that
perhaps we don't even need that and we could perhaps only using the
kernel-tools/ynl as official drm-ras consumer?

$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '00:02.0',
  'node-id': 0,
  'node-name': 'non-fatal',
  'node-type': 'error-counter'},
 {'device-name': '00:02.0',
  'node-id': 1,
  'node-name': 'correctable',
 'node-type': 'error-counter'}]

thoughts?

> 
> As a side note, I will be on vacation for a couple of weeks as of this
> weekend and my response time will be affected.

Thank you,
Please let me know if you have further thoughts here, or if you see any blocker
or an ack to move forward with this path.

Thanks,
Rodrigo.

> 
> Thanks,
> 
> Zack

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2025-11-06 13:30     ` Rodrigo Vivi
@ 2025-11-06 14:58       ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2025-11-06 14:58 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: dri-devel, intel-xe, Zack McKevitt, Lukas Wunner, Lijo Lazar,
	Hawking Zhang, Aravind Iddamsetty

On Thu, 6 Nov 2025 08:30:37 -0500 Rodrigo Vivi wrote:
> > If you're using Fedora or another good distro ynl CLI is packaged (for
> > Fedora in kernel-tools). The in-tree syntax is a bit verbose.  
> 
> I didn't know this tool was getting package with the kernel-tools
> I thought it was only helping for debug during the development.
> 
> Now I'm even wondering if we really need to code a user-space tool
> for this drm-ras, or simply recommending the kernel-tools/ynl as
> the official consumer of this API.

Right, depends on the intended use of the API. In many cases,
especially for configuration interfaces we no longer write separate 
CLI tools. But for certain things typing in the JSON gets a bit
tedious, and other cases need some sort of summarization if the kernel
output is too verbose. So YMMV.

> > Separate handling of -EMSGSIZE and returning skb->len is not necessary
> > as of a few releases ago. Just return ret; core will do the right thing
> > if ret == -EMSGSIZE and skb->len != 0  
> 
> Any good modern example that I could get the right inspiration from?

It's a moving target but:

net/core/netdev-genl.c
net/psp/psp_nl.c

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
  2025-11-06 13:42   ` Rodrigo Vivi
@ 2025-11-07 20:20     ` Zack McKevitt
  2025-11-08  3:01       ` Rodrigo Vivi
  0 siblings, 1 reply; 22+ messages in thread
From: Zack McKevitt @ 2025-11-07 20:20 UTC (permalink / raw)
  To: Rodrigo Vivi, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: dri-devel, intel-xe, Hawking Zhang, Alex Deucher, Lukas Wunner,
	Dave Airlie, Simona Vetter, Aravind Iddamsetty, Joonas Lahtinen



On 11/6/2025 6:42 AM, Rodrigo Vivi wrote:
>>
>>> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
>>> exercises this new API, hence I hope this can be the reference code for the uAPI
>>> usage, while we continue with the plan of introducing IGT tests and tools for this
>>> and adjusting the internal vendor tools to open with open source developments and
>>> changing them to support these flows.
>>
>> I think it would be nice to see some accompanying userspace code that makes
>> use of this implementation to have as a reference if at all possible.
> 
> We have some folks working on the userspace tools, but I just realized that
> perhaps we don't even need that and we could perhaps only using the
> kernel-tools/ynl as official drm-ras consumer?
> 
> $ sudo ynl --family drm_ras --dump list-nodes
> [{'device-name': '00:02.0',
>    'node-id': 0,
>    'node-name': 'non-fatal',
>    'node-type': 'error-counter'},
>   {'device-name': '00:02.0',
>    'node-id': 1,
>    'node-name': 'correctable',
>   'node-type': 'error-counter'}]
> 
> thoughts?
> 

I think this is probably ok for demonstrating this patch's 
functionality, but some userspace code would be helpful as a reference 
for applications that might want to integrate this directly instead of 
relying on CLI tools.

>>
>> As a side note, I will be on vacation for a couple of weeks as of this
>> weekend and my response time will be affected.
> 
> Thank you,
> Please let me know if you have further thoughts here, or if you see any blocker
> or an ack to move forward with this path.
> 
> Thanks,
> Rodrigo.
> 

No further thoughts on the patch contents, I think it looks good. I see 
that Jakub posted some TODOs while I was away, so I assume there will be 
another iteration that I will take a look at if/when that comes in.

>>
>> Thanks,
>>
>> Zack

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS
  2025-11-07 20:20     ` Zack McKevitt
@ 2025-11-08  3:01       ` Rodrigo Vivi
  0 siblings, 0 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-11-08  3:01 UTC (permalink / raw)
  To: Zack McKevitt
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, dri-devel, intel-xe, Hawking Zhang,
	Alex Deucher, Lukas Wunner, Dave Airlie, Simona Vetter,
	Aravind Iddamsetty, Joonas Lahtinen

On Fri, Nov 07, 2025 at 01:20:03PM -0700, Zack McKevitt wrote:
> 
> 
> On 11/6/2025 6:42 AM, Rodrigo Vivi wrote:
> > > 
> > > > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> > > > exercises this new API, hence I hope this can be the reference code for the uAPI
> > > > usage, while we continue with the plan of introducing IGT tests and tools for this
> > > > and adjusting the internal vendor tools to open with open source developments and
> > > > changing them to support these flows.
> > > 
> > > I think it would be nice to see some accompanying userspace code that makes
> > > use of this implementation to have as a reference if at all possible.
> > 
> > We have some folks working on the userspace tools, but I just realized that
> > perhaps we don't even need that and we could perhaps only using the
> > kernel-tools/ynl as official drm-ras consumer?
> > 
> > $ sudo ynl --family drm_ras --dump list-nodes
> > [{'device-name': '00:02.0',
> >    'node-id': 0,
> >    'node-name': 'non-fatal',
> >    'node-type': 'error-counter'},
> >   {'device-name': '00:02.0',
> >    'node-id': 1,
> >    'node-name': 'correctable',
> >   'node-type': 'error-counter'}]
> > 
> > thoughts?
> > 
> 
> I think this is probably ok for demonstrating this patch's functionality,
> but some userspace code would be helpful as a reference for applications
> that might want to integrate this directly instead of relying on CLI tools.

It makes sense. So let's continue to have some IGT tool for this.

> 
> > > 
> > > As a side note, I will be on vacation for a couple of weeks as of this
> > > weekend and my response time will be affected.
> > 
> > Thank you,
> > Please let me know if you have further thoughts here, or if you see any blocker
> > or an ack to move forward with this path.
> > 
> > Thanks,
> > Rodrigo.
> > 
> 
> No further thoughts on the patch contents, I think it looks good. I see that
> Jakub posted some TODOs while I was away, so I assume there will be another
> iteration that I will take a look at if/when that comes in.

Yes, but the changes in the error counter is not that big, just some better iteration,
small fixes and a fixed driver API regarding the error ID and error string.

> 
> > > 
> > > Thanks,
> > > 
> > > Zack

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS for CPER Error logging?!
  2025-11-06 13:16     ` Rodrigo Vivi
@ 2025-11-10  3:34       ` Dave Airlie
  2025-11-10  5:13         ` John Hubbard
  2025-11-10 20:35         ` Rodrigo Vivi
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Airlie @ 2025-11-10  3:34 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Zhang, Hawking, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Joonas Lahtinen, Simona Vetter,
	Deucher, Alexander, Zack McKevitt, Lukas Wunner,
	Aravind Iddamsetty, Zhou1, Tao, Liu, Xiang(Dean), Jason Gunthorpe,
	Steven Rostedt, John Hubbard

On Thu, 6 Nov 2025 at 23:16, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>
> On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
> >    [AMD Official Use Only - AMD Internal Distribution Only]
> >    + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.
> >
> >    RE - AMD folks, would you consider this to replace the current debugfs you
> >    have?
> >
> >    [Hawking]:
> >
> >    Replacing the debugfs is not the primary concern.
>
> My initial plan was to go with debugfs like you are doing, but
> I keep hearing complains that debugfs is not global and we need
> to take into account some cases where debugfs is not available
> in production images.
>
> > The main concern is
> >    whether drm_ras can effectively support the necessary RAS information for
> >    all device vendors, as this largely depends on the design of the hardware
> >    and firmware.
>
> I fully agree. This is the main reason I'm doing my best to make the drm-ras
> the most generic and expansible as possible.
>
> node registration with different node types, and names.
>
> I imagined something like:
>
> [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
>
> based on the format that the current non-standard-cper tracefs uses, with
> the FRU + CPER. But we could avoid the FRU and make the FRU as node name.
>
> >
> >    AMD is currently evaluating the proposed interface for error logging.
>
> The design of the details and the implementation is pretty much open for discussion
> at this point.
>
> What I'm really looking forward is:
>
> to know if the path is acceptable overall
> even if different drivers are opting for different node types?
>
> Is there any blocker on using this drm-ras/netlink for the CPER?

sorry for delay on this, I just had to read what CPER was :-)

I'm not offended by the idea of using tracefs here, I definitely think
debugfs is a bad idea coming from the enterprise distro land where we
don't like having it.

I'm ccing a few other people that might have opinions on exposing CPER
compatible logs for RAS purposes from devices, I assume there might be
more than GPUs wanting to do something like this,

Dave.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS for CPER Error logging?!
  2025-11-10  3:34       ` Dave Airlie
@ 2025-11-10  5:13         ` John Hubbard
  2025-11-10 20:35         ` Rodrigo Vivi
  1 sibling, 0 replies; 22+ messages in thread
From: John Hubbard @ 2025-11-10  5:13 UTC (permalink / raw)
  To: Dave Airlie, Rodrigo Vivi
  Cc: Zhang, Hawking, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Joonas Lahtinen, Simona Vetter,
	Deucher, Alexander, Zack McKevitt, Lukas Wunner,
	Aravind Iddamsetty, Zhou1, Tao, Liu, Xiang(Dean), Jason Gunthorpe,
	Steven Rostedt, Will Davis

On 11/9/25 7:34 PM, Dave Airlie wrote:
> On Thu, 6 Nov 2025 at 23:16, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>>
>> On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
>>>     [AMD Official Use Only - AMD Internal Distribution Only]
>>>     + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.
>>>
>>>     RE - AMD folks, would you consider this to replace the current debugfs you
>>>     have?
>>>
>>>     [Hawking]:
>>>
>>>     Replacing the debugfs is not the primary concern.
>>
>> My initial plan was to go with debugfs like you are doing, but
>> I keep hearing complains that debugfs is not global and we need
>> to take into account some cases where debugfs is not available
>> in production images.
>>
>>> The main concern is
>>>     whether drm_ras can effectively support the necessary RAS information for
>>>     all device vendors, as this largely depends on the design of the hardware
>>>     and firmware.
>>
>> I fully agree. This is the main reason I'm doing my best to make the drm-ras
>> the most generic and expansible as possible.
>>
>> node registration with different node types, and names.
>>
>> I imagined something like:
>>
>> [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
>>
>> based on the format that the current non-standard-cper tracefs uses, with
>> the FRU + CPER. But we could avoid the FRU and make the FRU as node name.
>>
>>>
>>>     AMD is currently evaluating the proposed interface for error logging.
>>
>> The design of the details and the implementation is pretty much open for discussion
>> at this point.
>>
>> What I'm really looking forward is:
>>
>> to know if the path is acceptable overall
>> even if different drivers are opting for different node types?
>>
>> Is there any blocker on using this drm-ras/netlink for the CPER?
> 
> sorry for delay on this, I just had to read what CPER was :-)
> 
> I'm not offended by the idea of using tracefs here, I definitely think
> debugfs is a bad idea coming from the enterprise distro land where we
> don't like having it.
> 
> I'm ccing a few other people that might have opinions on exposing CPER
> compatible logs for RAS purposes from devices, I assume there might be
> more than GPUs wanting to do something like this,
> 
> Dave.

I'm adding Will Davis, who was looking into CPER for our Open RM driver.


thanks,
-- 
John Hubbard


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: DRM_RAS for CPER Error logging?!
  2025-11-10  3:34       ` Dave Airlie
  2025-11-10  5:13         ` John Hubbard
@ 2025-11-10 20:35         ` Rodrigo Vivi
  1 sibling, 0 replies; 22+ messages in thread
From: Rodrigo Vivi @ 2025-11-10 20:35 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Zhang, Hawking, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Joonas Lahtinen, Simona Vetter,
	Deucher, Alexander, Zack McKevitt, Lukas Wunner,
	Aravind Iddamsetty, Zhou1, Tao, Liu, Xiang(Dean), Jason Gunthorpe,
	Steven Rostedt, John Hubbard

On Mon, Nov 10, 2025 at 01:34:22PM +1000, Dave Airlie wrote:
> On Thu, 6 Nov 2025 at 23:16, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
> > >    [AMD Official Use Only - AMD Internal Distribution Only]
> > >    + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.
> > >
> > >    RE - AMD folks, would you consider this to replace the current debugfs you
> > >    have?
> > >
> > >    [Hawking]:
> > >
> > >    Replacing the debugfs is not the primary concern.
> >
> > My initial plan was to go with debugfs like you are doing, but
> > I keep hearing complains that debugfs is not global and we need
> > to take into account some cases where debugfs is not available
> > in production images.
> >
> > > The main concern is
> > >    whether drm_ras can effectively support the necessary RAS information for
> > >    all device vendors, as this largely depends on the design of the hardware
> > >    and firmware.
> >
> > I fully agree. This is the main reason I'm doing my best to make the drm-ras
> > the most generic and expansible as possible.
> >
> > node registration with different node types, and names.
> >
> > I imagined something like:
> >
> > [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> >
> > based on the format that the current non-standard-cper tracefs uses, with
> > the FRU + CPER. But we could avoid the FRU and make the FRU as node name.
> >
> > >
> > >    AMD is currently evaluating the proposed interface for error logging.
> >
> > The design of the details and the implementation is pretty much open for discussion
> > at this point.
> >
> > What I'm really looking forward is:
> >
> > to know if the path is acceptable overall
> > even if different drivers are opting for different node types?
> >
> > Is there any blocker on using this drm-ras/netlink for the CPER?
> 
> sorry for delay on this, I just had to read what CPER was :-)
> 
> I'm not offended by the idea of using tracefs here,

Right, that was my first thought as well.
Perhaps we simply use the

log_non_standard_event(sec_type, fru_id, fru_text, sec_sev, cper_data, cper_length)

provided directly by dirvers/ras/ras.c

But one limitation with that is that it is from HW/FW -> Kernel -> User Space.

There is no way for user space to query for the current/last log available.

I mean, we would only generate the CPER when passing certain threshold to avoid
flood in case of memory error storm. So, in this case, there's the need for user
to query the most recent log.

I believe it gets a bit ugly if we tell admin that in order to get the most
recent cper log you need to query the error counter through the netlink, and
up to every single error counter query we also emit the tracefs event.

Then I thought about using the netlink to query the cper, but with a separate
node, exclusively for error-log instead of abusing the error counter API.

But if you believe it is okay to emit tracefs on every counter check, then
we can take that path.

> I definitely think
> debugfs is a bad idea coming from the enterprise distro land where we
> don't like having it.

Yeap, this is why I thought that AMD was trying to find alternatives to
their debugfs solution. But the debugfs solution does have this possibility
of query...

> 
> I'm ccing a few other people that might have opinions on exposing CPER
> compatible logs for RAS purposes from devices, I assume there might be
> more than GPUs wanting to do something like this,

Thank you!

> 
> Dave.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-11-10 20:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-29 21:44 [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Rodrigo Vivi
2025-09-29 21:44 ` [PATCH 1/2] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Rodrigo Vivi
2025-10-31  1:32   ` Jakub Kicinski
2025-11-06 13:30     ` Rodrigo Vivi
2025-11-06 14:58       ` Jakub Kicinski
2025-09-29 21:44 ` [PATCH 2/2] drm/xe: Introduce the usage of drm_ras with supported HW errors Rodrigo Vivi
2025-09-30  2:07   ` kernel test robot
2025-10-02 20:38 ` [PATCH 0/2] Introduce DRM_RAS using generic netlink for RAS Zack McKevitt
2025-10-28 19:14   ` Rodrigo Vivi
2025-11-06 13:42   ` Rodrigo Vivi
2025-11-07 20:20     ` Zack McKevitt
2025-11-08  3:01       ` Rodrigo Vivi
2025-10-28 19:13 ` DRM_RAS for CPER Error logging?! Rodrigo Vivi
2025-10-29  2:00   ` Zhang, Hawking
2025-11-06 13:16     ` Rodrigo Vivi
2025-11-10  3:34       ` Dave Airlie
2025-11-10  5:13         ` John Hubbard
2025-11-10 20:35         ` Rodrigo Vivi
2025-10-30 14:47   ` Rodrigo Vivi
2025-10-30 15:37     ` DRM_RAS (netlink genl family) " Rodrigo Vivi
2025-10-31  5:38     ` DRM_RAS " Lukas Wunner
2025-11-06 13:08       ` Rodrigo Vivi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).