[PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS
@ 2025-12-05  8:39 Riana Tauro
  2025-12-05  8:39 ` [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Riana Tauro @ 2025-12-05  8:39 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	lukas, simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, Riana Tauro

This work is a continuation of the great work started by Aravind ([1] and [2])
in order to fulfill the RAS requirements and proposal as previously discussed
and agreed in the Linux Plumbers accelerator's bof of 2022 [3].

[1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
[2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
[3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

During the past review round, Lukas pointed out that netlink had evolved
in parallel during these years and that now, any new usage of netlink families
would require the usage of the YAML description and scripts.

With this new requirement in place, the family name is hardcoded in the yaml file,
so we are forced to have a single family name for the entire drm, and then we now
we are forced to have a registration.

So, while doing the registration, we now created the concept of drm-ras-node.
For now the only node type supported is the agreed error-counter. But that could
be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
driver.

In this first version, only querying counter is supported. But also this is expandable
to future introduction of multicast notification and also clearing the counters.

This design with multiple nodes per device is already flexible enough for driver
to decide if it wants to handle error per device, or per IP block, or per error
category. I believe this fully attend to the requested AMD feedback in the earlier
reviews.

So, my proposal is to start simple with this case as is, and then iterate over
with the drm-ras in tree so we evolve together according to various driver's RAS
needs.

I have provided a documentation and the first Xe implementation of the counter
as reference.

Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
exercises this new API, hence I hope this can be the reference code for the uAPI
usage, while we continue with the plan of introducing IGT tests and tools for this
and adjusting the internal vendor tools to open with open source developments and
changing them to support these flows.

Example:

$ sudo ynl --family drm_ras  --dump list-nodes
[{'device-name': '0000:03:00.0',
  'node-id': 0,
  'node-name': 'correctable-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 1,
  'node-name': 'nonfatal-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 2,
  'node-name': 'fatal-errors',
  'node-type': 'error-counter'}]

$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
 {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]

$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}

IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3

Rev2: Fix review comments
      Add support for GT and SOC errors

Rev3: Add uAPI for errors and nodes
      Update documentation

Riana Tauro (3):
  drm/xe/xe_drm_ras: Add support for drm ras
  drm/xe/xe_hw_error: Add support for GT hardware errors
  drm/xe/xe_hw_error: Add support for PVC SOC errors

Rodrigo Vivi (1):
  drm/ras: Introduce the DRM RAS infrastructure over generic netlink

 Documentation/gpu/drm-ras.rst              | 109 +++++
 Documentation/gpu/index.rst                |   1 +
 Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++
 drivers/gpu/drm/Kconfig                    |   9 +
 drivers/gpu/drm/Makefile                   |   1 +
 drivers/gpu/drm/drm_drv.c                  |   6 +
 drivers/gpu/drm/drm_ras.c                  | 351 ++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c      |  42 ++
 drivers/gpu/drm/drm_ras_nl.c               |  54 +++
 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  68 ++++
 drivers/gpu/drm/xe/xe_device_types.h       |   4 +
 drivers/gpu/drm/xe/xe_drm_ras.c            | 199 +++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h            |  12 +
 drivers/gpu/drm/xe/xe_drm_ras_types.h      |  40 ++
 drivers/gpu/drm/xe/xe_hw_error.c           | 444 +++++++++++++++++++--
 include/drm/drm_ras.h                      |  76 ++++
 include/drm/drm_ras_genl_family.h          |  17 +
 include/drm/drm_ras_nl.h                   |  24 ++
 include/uapi/drm/drm_ras.h                 |  49 +++
 include/uapi/drm/xe_drm.h                  |  82 ++++
 21 files changed, 1682 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

-- 
2.47.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
@ 2025-12-05  8:39 ` Riana Tauro
  2025-12-09 21:35   ` Rodrigo Vivi
  2025-12-05  8:39 ` [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2025-12-05  8:39 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	lukas, simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, Jakub Kicinski, David S. Miller, Paolo Abeni,
	Eric Dumazet, netdev, Riana Tauro

From: Rodrigo Vivi <rodrigo.vivi@intel.com>

Introduces the DRM RAS infrastructure over generic netlink.

The new interface allows drivers to expose RAS nodes and their
associated error counters to userspace in a structured and extensible
way. Each drm_ras node can register its own set of error counters, which
are then discoverable and queryable through netlink operations. This
lays the groundwork for reporting and managing hardware error states
in a unified manner across different DRM drivers.

Currently is only supports error-counter nodes. But it can be
extended later.

The registration is also no tied to any drm node, so it can be
used by accel devices as well.

It uses the new and mandatory YAML description format stored in
Documentation/netlink/specs/. This forces a single generic netlink
family namespace for the entire drm: "drm-ras".
But multiple-endpoints are supported within the single family.

Any modification to this API needs to be applied to
Documentation/netlink/specs/drm_ras.yaml before regenerating the
code:

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
 > include/uapi/drm/drm_ras.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
 > include/drm/drm_ras_nl.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
 > drivers/gpu/drm/drm_ras_nl.c

Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: fix doc and memory leak
    use xe_for_each_start
    use standard genlmsg_iput (Jakub Kicinski)

v3: add documentation to index
    modify documentation to mention uAPI requirements (Rodrigo)
---
 Documentation/gpu/drm-ras.rst            | 109 +++++++
 Documentation/gpu/index.rst              |   1 +
 Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
 drivers/gpu/drm/Kconfig                  |   9 +
 drivers/gpu/drm/Makefile                 |   1 +
 drivers/gpu/drm/drm_drv.c                |   6 +
 drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
 drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
 include/drm/drm_ras.h                    |  76 +++++
 include/drm/drm_ras_genl_family.h        |  17 ++
 include/drm/drm_ras_nl.h                 |  24 ++
 include/uapi/drm/drm_ras.h               |  49 ++++
 13 files changed, 869 insertions(+)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
new file mode 100644
index 000000000000..cec60cf5d17d
--- /dev/null
+++ b/Documentation/gpu/drm-ras.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================
+DRM RAS over Generic Netlink
+============================
+
+The DRM RAS (Reliability, Availability, Serviceability) interface provides a
+standardized way for GPU/accelerator drivers to expose error counters and
+other reliability nodes to user space via Generic Netlink. This allows
+diagnostic tools, monitoring daemons, or test infrastructure to query hardware
+health in a uniform way across different DRM drivers.
+
+Key Goals:
+
+* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
+  data center monitoring and reliability operations.
+* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
+  specifications and centralize all RAS-related communication in one namespace.
+* Support a basic error counter interface, addressing the immediate, essential
+  monitoring needs.
+* Offer a flexible, future-proof interface that can be extended to support
+  additional types of RAS data in the future.
+* Allow multiple nodes per driver, enabling drivers to register separate
+  nodes for different IP blocks, sub-blocks, or other logical subdivisions
+  as applicable.
+
+Nodes
+=====
+
+Nodes are logical abstractions representing an error source or block within
+the device. Currently, only error counter nodes is supported.
+
+Drivers are responsible for registering and unregistering nodes via the
+`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
+
+Node Management
+-------------------
+
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :doc: DRM RAS Node Management
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :internal:
+
+Generic Netlink Usage
+=====================
+
+The interface is implemented as a Generic Netlink family named ``drm-ras``.
+User space tools can:
+
+* List registered nodes with the ``get-nodes`` command.
+* List all error counters in an node with the ``get-error-counters`` command.
+* Query error counters using the ``query-error-counter`` command.
+
+YAML-based Interface
+--------------------
+
+The interface is described in a YAML specification:
+
+:ref:`Documentation/netlink/specs/drm_ras.yaml`
+
+This YAML is used to auto-generate user space bindings via
+``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
+attributes and operations.
+
+Usage Notes
+-----------
+
+* User space must first enumerate nodes to obtain their IDs.
+* Node IDs or Node names can be used for all further queries, such as error counters.
+* Error counters can be queried by either the Error ID or Error name.
+* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
+* The interface supports future extension by adding new node types and
+  additional attributes.
+
+Example: List nodes using ynl
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras  --dump list-nodes
+    [{'device-name': '0000:03:00.0',
+    'node-id': 0,
+    'node-name': 'correctable-errors',
+    'node-type': 'error-counter'},
+    {'device-name': '0000:03:00.0',
+     'node-id': 1,
+    'node-name': 'nonfatal-errors',
+    'node-type': 'error-counter'},
+    {'device-name': '0000:03:00.0',
+    'node-id': 2,
+    'node-name': 'fatal-errors',
+    'node-type': 'error-counter'}]
+
+Example: List all error counters using ynl
+
+.. code-block:: bash
+
+
+   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
+   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
+   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
+
+
+Example: Query an error counter for a given node
+
+.. code-block:: bash
+
+   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
+   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
+
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index 7dcb15850afd..60c73fdcfeed 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -9,6 +9,7 @@ GPU Driver Developer's Guide
    drm-mm
    drm-kms
    drm-kms-helpers
+   drm-ras
    drm-uapi
    drm-usage-stats
    driver-uapi
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
new file mode 100644
index 000000000000..be0e379c5bc9
--- /dev/null
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -0,0 +1,130 @@
+# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+---
+name: drm-ras
+protocol: genetlink
+uapi-header: drm/drm_ras.h
+
+doc: >-
+  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
+  Provides a standardized mechanism for DRM drivers to register "nodes"
+  representing hardware/software components capable of reporting error counters.
+  Userspace tools can query the list of nodes or individual error counters
+  via the Generic Netlink interface.
+
+definitions:
+  -
+    type: enum
+    name: node-type
+    value-start: 1
+    entries: [error-counter]
+    doc: >-
+         Type of the node. Currently, only error-counter nodes are
+         supported, which expose reliability counters for a hardware/software
+         component.
+
+attribute-sets:
+  -
+    name: node-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc: >-
+             Unique identifier for the node.
+             Assigned dynamically by the DRM RAS core upon registration.
+      -
+        name: device-name
+        type: string
+        doc: >-
+             Device name chosen by the driver at registration.
+             Can be a PCI BDF, UUID, or module name if unique.
+      -
+        name: node-name
+        type: string
+        doc: >-
+             Node name chosen by the driver at registration.
+             Can be an IP block name, or any name that identifies the
+             RAS node inside the device.
+      -
+        name: node-type
+        type: u32
+        doc: Type of this node, identifying its function.
+        enum: node-type
+  -
+    name: error-counter-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc:  Node ID targeted by this error counter operation.
+      -
+        name: error-id
+        type: u32
+        doc: Unique identifier for a specific error counter within an node.
+      -
+        name: error-name
+        type: string
+        doc: Name of the error.
+      -
+        name: error-value
+        type: u32
+        doc: Current value of the requested error counter.
+
+operations:
+  list:
+    -
+      name: list-nodes
+      doc: >-
+           Retrieve the full list of currently registered DRM RAS nodes.
+           Each node includes its dynamically assigned ID, name, and type.
+           **Important:** User space must call this operation first to obtain
+           the node IDs. These IDs are required for all subsequent
+           operations on nodes, such as querying error counters.
+      attribute-set: node-attrs
+      flags: [admin-perm]
+      dump:
+        reply:
+          attributes:
+            - node-id
+            - device-name
+            - node-name
+            - node-type
+    -
+      name: get-error-counters
+      doc: >-
+           Retrieve the full list of error counters for a given node.
+           The response include the id, the name, and even the current
+           value of each counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      dump:
+        request:
+          attributes:
+            - node-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+    -
+      name: query-error-counter
+      doc: >-
+           Query the information of a specific error counter for a given node.
+           Users must provide the node ID and the error counter ID.
+           The response contains the id, the name, and the current value
+           of the counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes:
+            - node-id
+            - error-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+
+kernel-family:
+  headers: ["drm/drm_ras_nl.h"]
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 7e6bc0b3a589..5cfb23b80441 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
 	  Smaller QR code are easier to read, but will contain less debugging
 	  data. Default is 40.
 
+config DRM_RAS
+	bool "DRM RAS support"
+	depends on DRM
+	help
+	  Enables the DRM RAS (Reliability, Availability and Serviceability)
+	  support for DRM drivers. This provides a Generic Netlink interface
+	  for error reporting and queries.
+	  If in doubt, say "N".
+
 config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
         bool "Enable refcount backtrace history in the DP MST helpers"
 	depends on STACKTRACE_SUPPORT
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 4b3f3ad5058a..cd19573b2d9f 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
 drm-$(CONFIG_DRM_PANIC) += drm_panic.o
 drm-$(CONFIG_DRM_DRAW) += drm_draw.o
 drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
+drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
 obj-$(CONFIG_DRM)	+= drm.o
 
 obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 2915118436ce..6b965c3d3307 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -53,6 +53,7 @@
 #include <drm/drm_panic.h>
 #include <drm/drm_print.h>
 #include <drm/drm_privacy_screen_machine.h>
+#include <drm/drm_ras_genl_family.h>
 
 #include "drm_crtc_internal.h"
 #include "drm_internal.h"
@@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
 
 static void drm_core_exit(void)
 {
+	drm_ras_genl_family_unregister();
 	drm_privacy_screen_lookup_exit();
 	drm_panic_exit();
 	accel_core_exit();
@@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
 
 	drm_privacy_screen_lookup_init();
 
+	ret = drm_ras_genl_family_register();
+	if (ret < 0)
+		goto error;
+
 	drm_core_init_complete = true;
 
 	DRM_DEBUG("Initialized\n");
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
new file mode 100644
index 000000000000..32f3897ce580
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/xarray.h>
+#include <net/genetlink.h>
+
+#include <drm/drm_ras.h>
+
+/**
+ * DOC: DRM RAS Node Management
+ *
+ * This module provides the infrastructure to manage RAS (Reliability,
+ * Availability, and Serviceability) nodes for DRM drivers. Each
+ * DRM driver may register one or more RAS nodes, which represent
+ * logical components capable of reporting error counters and other
+ * reliability metrics.
+ *
+ * The nodes are stored in a global xarray `drm_ras_xa` to allow
+ * efficient lookup by ID. Nodes can be registered or unregistered
+ * dynamically at runtime.
+ *
+ * A Generic Netlink family `drm_ras` exposes two main operations to
+ * userspace:
+ *
+ * 1. LIST_NODES: Dump all currently registered RAS nodes.
+ *    The user receives an array of node IDs, names, and types.
+ *
+ * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
+ *    The user receives an array of error IDs, names, and current value.
+ *
+ * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
+ *    Userspace must provide the node ID and the counter ID, and
+ *    receives the ID, the error name, and its current value.
+ *
+ * Node registration:
+ * - drm_ras_node_register(): Registers a new node and assigns
+ *   it a unique ID in the xarray.
+ * - drm_ras_node_unregister(): Removes a previously registered
+ *   node from the xarray.
+ *
+ * Node type:
+ * - ERROR_COUNTER:
+ *     + Currently, only error counters are supported.
+ *     + The driver must implement the query_error_counter() callback to provide
+ *       the name and the value of the error counter.
+ *     + The driver must provide a error_counter_range.last value informing the
+ *       last valid error ID.
+ *     + The driver can provide a error_counter_range.first value informing the
+ *       frst valid error ID.
+ *     + The error counters in the driver doesn't need to be contiguous, but the
+ *       driver must return -ENOENT to the query_error_counter as an indication
+ *       that the ID should be skipped and not listed in the netlink API.
+ *
+ * Netlink handlers:
+ * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
+ *   operation, iterating over the xarray.
+ * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
+ *   operation, iterating over the know valid error_counter_range.
+ * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
+ *   operation, fetching a counter value from a specific node.
+ */
+
+static DEFINE_XARRAY_ALLOC(drm_ras_xa);
+
+/*
+ * The netlink callback context carries dump state across multiple dumpit calls
+ */
+struct drm_ras_ctx {
+	/* Which xarray id to restart the dump from */
+	unsigned long restart;
+};
+
+/**
+ * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all registered RAS nodes in the global xarray and appends
+ * their attributes (ID, name, type) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	unsigned long id;
+	int ret;
+
+	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
+		hdr = genlmsg_iput(skb, info);
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+				     node->device_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+				     node->node_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+				  node->type);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE)
+		ctx->restart = id;
+
+	return ret;
+}
+
+static int get_node_error_counter(u32 node_id, u32 error_id,
+				  const char **name, u32 *value)
+{
+	struct drm_ras_node *node;
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node || !node->query_error_counter)
+		return -ENOENT;
+
+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->query_error_counter(node, error_id, name, value);
+}
+
+static int msg_reply_value(struct sk_buff *msg, u32 error_id,
+			   const char *error_name, u32 value)
+{
+	int ret;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+	if (ret)
+		return ret;
+
+	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+			     error_name);
+	if (ret)
+		return ret;
+
+	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+			   value);
+}
+
+static int doit_reply_value(struct genl_info *info, u32 node_id,
+			    u32 error_id)
+{
+	struct sk_buff *msg;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 value;
+	int ret;
+
+	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_iput(msg, info);
+	if (!hdr) {
+		nlmsg_free(msg);
+		return -EMSGSIZE;
+	}
+
+	ret = get_node_error_counter(node_id, error_id,
+				     &error_name, &value);
+	if (ret)
+		return ret;
+
+	ret = msg_reply_value(msg, error_id, error_name, value);
+	if (ret) {
+		genlmsg_cancel(msg, hdr);
+		nlmsg_free(msg);
+		return ret;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_reply(msg, info);
+}
+
+/**
+ * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all error counters in a given Node and appends
+ * their attributes (ID, name, value) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 node_id, error_id, value;
+	int ret;
+
+	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node)
+		return -ENOENT;
+
+	for (error_id = max(node->error_counter_range.first, ctx->restart);
+	     error_id <= node->error_counter_range.last;
+	     error_id++) {
+		ret = get_node_error_counter(node_id, error_id,
+					     &error_name, &value);
+		/*
+		 * For non-contiguous range, driver return -ENOENT as indication
+		 * to skip this ID when listing all errors.
+		 */
+		if (ret == -ENOENT)
+			continue;
+		if (ret)
+			return ret;
+
+		hdr = genlmsg_iput(skb, info);
+
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = msg_reply_value(skb, error_id, error_name, value);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE)
+		ctx->restart = error_id;
+
+	return ret;
+}
+
+/**
+ * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * retrieves the current value of the corresponding error counter. Sends the
+ * result back to the requesting user via the standard Genl reply.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	return doit_reply_value(info, node_id, error_id);
+}
+
+/**
+ * drm_ras_node_register() - Register a new RAS node
+ * @node: Node structure to register
+ *
+ * Adds the given RAS node to the global node xarray and assigns it
+ * a unique ID. Both @node->name and @node->type must be valid.
+ *
+ * Return: 0 on success, or negative errno on failure:
+ */
+int drm_ras_node_register(struct drm_ras_node *node)
+{
+	if (!node->device_name || !node->node_name)
+		return -EINVAL;
+
+	/* Currently, only Error Counter Endpoinnts are supported */
+	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
+		return -EINVAL;
+
+	/* Mandatorty entries for Error Counter Node */
+	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
+	    (!node->error_counter_range.last || !node->query_error_counter))
+		return -EINVAL;
+
+	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
+}
+EXPORT_SYMBOL(drm_ras_node_register);
+
+/**
+ * drm_ras_node_unregister() - Unregister a previously registered node
+ * @node: Node structure to unregister
+ *
+ * Removes the given node from the global node xarray using its ID.
+ */
+void drm_ras_node_unregister(struct drm_ras_node *node)
+{
+	xa_erase(&drm_ras_xa, node->id);
+}
+EXPORT_SYMBOL(drm_ras_node_unregister);
diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
new file mode 100644
index 000000000000..2d818b8c3808
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_genl_family.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <drm/drm_ras_genl_family.h>
+#include <drm/drm_ras_nl.h>
+
+/* Track family registration so the drm_exit can be called at any time */
+static bool registered;
+
+/**
+ * drm_ras_genl_family_register() - Register drm-ras genl family
+ *
+ * Only to be called one at drm_drv_init()
+ */
+int drm_ras_genl_family_register(void)
+{
+	int ret;
+
+	registered = false;
+
+	ret = genl_register_family(&drm_ras_nl_family);
+	if (ret)
+		return ret;
+
+	registered = true;
+	return 0;
+}
+
+/**
+ * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
+ *
+ * To be called one at drm_drv_exit() at any moment, but only once.
+ */
+void drm_ras_genl_family_unregister(void)
+{
+	if (registered) {
+		genl_unregister_family(&drm_ras_nl_family);
+		registered = false;
+	}
+}
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
new file mode 100644
index 000000000000..fcd1392410e4
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel source */
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
+static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+};
+
+/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
+/* Ops table for drm_ras */
+static const struct genl_split_ops drm_ras_nl_ops[] = {
+	{
+		.cmd	= DRM_RAS_CMD_LIST_NODES,
+		.dumpit	= drm_ras_nl_list_nodes_dumpit,
+		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
+		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
+		.policy		= drm_ras_get_error_counters_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+		.doit		= drm_ras_nl_query_error_counter_doit,
+		.policy		= drm_ras_query_error_counter_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+};
+
+struct genl_family drm_ras_nl_family __ro_after_init = {
+	.name		= DRM_RAS_FAMILY_NAME,
+	.version	= DRM_RAS_FAMILY_VERSION,
+	.netnsok	= true,
+	.parallel_ops	= true,
+	.module		= THIS_MODULE,
+	.split_ops	= drm_ras_nl_ops,
+	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
+};
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
new file mode 100644
index 000000000000..bba47a282ef8
--- /dev/null
+++ b/include/drm/drm_ras.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_H__
+#define __DRM_RAS_H__
+
+#include "drm_ras_nl.h"
+
+/**
+ * struct drm_ras_node - A DRM RAS Node
+ */
+struct drm_ras_node {
+	/** @id: Unique identifier for the node. Dynamically assigned. */
+	u32 id;
+	/**
+	 * @device_name: Human-readable name of the device. Given by the driver.
+	 */
+	const char *device_name;
+	/** @node_name: Human-readable name of the node. Given by the driver. */
+	const char *node_name;
+	/** @type: Type of the node (enum drm_ras_node_type). */
+	enum drm_ras_node_type type;
+
+	/* Error-Counter Related Callback and Variables */
+
+	/** @error_counter_range: Range of valid Error IDs for this node. */
+	struct {
+		/** @first: First valid Error ID. */
+		u32 first;
+		/** @last: Last valid Error ID. Mandatory entry. */
+		u32 last;
+	} error_counter_range;
+
+	/**
+	 * @query_error_counter:
+	 *
+	 * This callback is used by drm-ras to query a specific error counter.
+	 * counters supported by this node. Used for input check and to
+	 * iterate in all counters.
+	 *
+	 * Driver should expect query_error_counters() to be called with
+	 * error_id from `error_counter_range.first` to
+	 * `error_counter_range.last`.
+	 *
+	 * The @query_error_counter is a mandatory callback for
+	 * error_counter_node.
+	 *
+	 * Returns: 0 on success,
+	 *          -ENOENT when error_id is not supported as an indication that
+	 *                  drm_ras should silently skip this entry. Used for
+	 *                  supporting non-contiguous error ranges.
+	 *                  Driver is responsible for maintaining the list of
+	 *                  supported error IDs in the range of first to last.
+	 *          Other negative values on errors that should terminate the
+	 *          netlink query.
+	 */
+	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
+				   const char **name, u32 *val);
+
+	/** @priv: Driver private data */
+	void *priv;
+};
+
+struct drm_device;
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_node_register(struct drm_ras_node *ep);
+void drm_ras_node_unregister(struct drm_ras_node *ep);
+#else
+static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
+static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
new file mode 100644
index 000000000000..5931b53429f1
--- /dev/null
+++ b/include/drm/drm_ras_genl_family.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_GENL_FAMILY_H__
+#define __DRM_RAS_GENL_FAMILY_H__
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_genl_family_register(void);
+void drm_ras_genl_family_unregister(void);
+#else
+static inline int drm_ras_genl_family_register(void) { return 0; }
+static inline void drm_ras_genl_family_unregister(void) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
new file mode 100644
index 000000000000..9613b7d9ffdb
--- /dev/null
+++ b/include/drm/drm_ras_nl.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel header */
+
+#ifndef _LINUX_DRM_RAS_GEN_H
+#define _LINUX_DRM_RAS_GEN_H
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb);
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb);
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info);
+
+extern struct genl_family drm_ras_nl_family;
+
+#endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
new file mode 100644
index 000000000000..3415ba345ac8
--- /dev/null
+++ b/include/uapi/drm/drm_ras.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN uapi header */
+
+#ifndef _UAPI_LINUX_DRM_RAS_H
+#define _UAPI_LINUX_DRM_RAS_H
+
+#define DRM_RAS_FAMILY_NAME	"drm-ras"
+#define DRM_RAS_FAMILY_VERSION	1
+
+/*
+ * Type of the node. Currently, only error-counter nodes are supported, which
+ * expose reliability counters for a hardware/software component.
+ */
+enum drm_ras_node_type {
+	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
+};
+
+enum {
+	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+
+	__DRM_RAS_A_NODE_ATTRS_MAX,
+	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+
+	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_CMD_LIST_NODES = 1,
+	DRM_RAS_CMD_GET_ERROR_COUNTERS,
+	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+
+	__DRM_RAS_CMD_MAX,
+	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
+};
+
+#endif /* _UAPI_LINUX_DRM_RAS_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2025-12-05  8:39 ` [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2025-12-09 21:35   ` Rodrigo Vivi
  2026-01-08 22:36     ` Zack McKevitt
  0 siblings, 1 reply; 31+ messages in thread
From: Rodrigo Vivi @ 2025-12-09 21:35 UTC (permalink / raw)
  To: Riana Tauro, Jakub Kicinski
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	joonas.lahtinen, lukas, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
	Zack McKevitt, Lijo Lazar, Hawking Zhang, Jakub Kicinski,
	David S. Miller, Paolo Abeni, Eric Dumazet, netdev

On Fri, Dec 05, 2025 at 02:09:33PM +0530, Riana Tauro wrote:
> From: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> Introduces the DRM RAS infrastructure over generic netlink.
> 
> The new interface allows drivers to expose RAS nodes and their
> associated error counters to userspace in a structured and extensible
> way. Each drm_ras node can register its own set of error counters, which
> are then discoverable and queryable through netlink operations. This
> lays the groundwork for reporting and managing hardware error states
> in a unified manner across different DRM drivers.
> 
> Currently is only supports error-counter nodes. But it can be
> extended later.
> 
> The registration is also no tied to any drm node, so it can be
> used by accel devices as well.
> 
> It uses the new and mandatory YAML description format stored in
> Documentation/netlink/specs/. This forces a single generic netlink
> family namespace for the entire drm: "drm-ras".
> But multiple-endpoints are supported within the single family.
> 
> Any modification to this API needs to be applied to
> Documentation/netlink/specs/drm_ras.yaml before regenerating the
> code:
> 
> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>  Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
>  > include/uapi/drm/drm_ras.h
> 
> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>  Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
>  > include/drm/drm_ras_nl.h
> 
> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>  Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
>  > drivers/gpu/drm/drm_ras_nl.c
> 
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: netdev@vger.kernel.org
> Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: fix doc and memory leak
>     use xe_for_each_start
>     use standard genlmsg_iput (Jakub Kicinski)
> 
> v3: add documentation to index
>     modify documentation to mention uAPI requirements (Rodrigo)
> ---
>  Documentation/gpu/drm-ras.rst            | 109 +++++++
>  Documentation/gpu/index.rst              |   1 +
>  Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
>  drivers/gpu/drm/Kconfig                  |   9 +
>  drivers/gpu/drm/Makefile                 |   1 +
>  drivers/gpu/drm/drm_drv.c                |   6 +
>  drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
>  drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
>  drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
>  include/drm/drm_ras.h                    |  76 +++++
>  include/drm/drm_ras_genl_family.h        |  17 ++
>  include/drm/drm_ras_nl.h                 |  24 ++
>  include/uapi/drm/drm_ras.h               |  49 ++++
>  13 files changed, 869 insertions(+)
>  create mode 100644 Documentation/gpu/drm-ras.rst
>  create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>  create mode 100644 drivers/gpu/drm/drm_ras.c
>  create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>  create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>  create mode 100644 include/drm/drm_ras.h
>  create mode 100644 include/drm/drm_ras_genl_family.h
>  create mode 100644 include/drm/drm_ras_nl.h
>  create mode 100644 include/uapi/drm/drm_ras.h
> 
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> new file mode 100644
> index 000000000000..cec60cf5d17d
> --- /dev/null
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -0,0 +1,109 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +============================
> +DRM RAS over Generic Netlink
> +============================
> +
> +The DRM RAS (Reliability, Availability, Serviceability) interface provides a
> +standardized way for GPU/accelerator drivers to expose error counters and
> +other reliability nodes to user space via Generic Netlink. This allows
> +diagnostic tools, monitoring daemons, or test infrastructure to query hardware
> +health in a uniform way across different DRM drivers.
> +
> +Key Goals:
> +
> +* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
> +  data center monitoring and reliability operations.
> +* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
> +  specifications and centralize all RAS-related communication in one namespace.
> +* Support a basic error counter interface, addressing the immediate, essential
> +  monitoring needs.
> +* Offer a flexible, future-proof interface that can be extended to support
> +  additional types of RAS data in the future.
> +* Allow multiple nodes per driver, enabling drivers to register separate
> +  nodes for different IP blocks, sub-blocks, or other logical subdivisions
> +  as applicable.
> +
> +Nodes
> +=====
> +
> +Nodes are logical abstractions representing an error source or block within
> +the device. Currently, only error counter nodes is supported.
> +
> +Drivers are responsible for registering and unregistering nodes via the
> +`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
> +
> +Node Management
> +-------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
> +   :doc: DRM RAS Node Management
> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
> +   :internal:
> +
> +Generic Netlink Usage
> +=====================
> +
> +The interface is implemented as a Generic Netlink family named ``drm-ras``.
> +User space tools can:
> +
> +* List registered nodes with the ``get-nodes`` command.
> +* List all error counters in an node with the ``get-error-counters`` command.
> +* Query error counters using the ``query-error-counter`` command.
> +
> +YAML-based Interface
> +--------------------
> +
> +The interface is described in a YAML specification:
> +
> +:ref:`Documentation/netlink/specs/drm_ras.yaml`
> +
> +This YAML is used to auto-generate user space bindings via
> +``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
> +attributes and operations.
> +
> +Usage Notes
> +-----------
> +
> +* User space must first enumerate nodes to obtain their IDs.
> +* Node IDs or Node names can be used for all further queries, such as error counters.
> +* Error counters can be queried by either the Error ID or Error name.
> +* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
> +* The interface supports future extension by adding new node types and
> +  additional attributes.
> +
> +Example: List nodes using ynl
> +
> +.. code-block:: bash
> +
> +    sudo ynl --family drm_ras  --dump list-nodes
> +    [{'device-name': '0000:03:00.0',
> +    'node-id': 0,
> +    'node-name': 'correctable-errors',
> +    'node-type': 'error-counter'},
> +    {'device-name': '0000:03:00.0',
> +     'node-id': 1,
> +    'node-name': 'nonfatal-errors',
> +    'node-type': 'error-counter'},
> +    {'device-name': '0000:03:00.0',
> +    'node-id': 2,
> +    'node-name': 'fatal-errors',
> +    'node-type': 'error-counter'}]
> +
> +Example: List all error counters using ynl
> +
> +.. code-block:: bash
> +
> +
> +   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> +   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
> +   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
> +
> +
> +Example: Query an error counter for a given node
> +
> +.. code-block:: bash
> +
> +   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
> +   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
> +
> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
> index 7dcb15850afd..60c73fdcfeed 100644
> --- a/Documentation/gpu/index.rst
> +++ b/Documentation/gpu/index.rst
> @@ -9,6 +9,7 @@ GPU Driver Developer's Guide
>     drm-mm
>     drm-kms
>     drm-kms-helpers
> +   drm-ras
>     drm-uapi
>     drm-usage-stats
>     driver-uapi
> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> new file mode 100644
> index 000000000000..be0e379c5bc9
> --- /dev/null
> +++ b/Documentation/netlink/specs/drm_ras.yaml
> @@ -0,0 +1,130 @@
> +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
> +---
> +name: drm-ras
> +protocol: genetlink
> +uapi-header: drm/drm_ras.h
> +
> +doc: >-
> +  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
> +  Provides a standardized mechanism for DRM drivers to register "nodes"
> +  representing hardware/software components capable of reporting error counters.
> +  Userspace tools can query the list of nodes or individual error counters
> +  via the Generic Netlink interface.
> +
> +definitions:
> +  -
> +    type: enum
> +    name: node-type
> +    value-start: 1
> +    entries: [error-counter]
> +    doc: >-
> +         Type of the node. Currently, only error-counter nodes are
> +         supported, which expose reliability counters for a hardware/software
> +         component.
> +
> +attribute-sets:
> +  -
> +    name: node-attrs
> +    attributes:
> +      -
> +        name: node-id
> +        type: u32
> +        doc: >-
> +             Unique identifier for the node.
> +             Assigned dynamically by the DRM RAS core upon registration.
> +      -
> +        name: device-name
> +        type: string
> +        doc: >-
> +             Device name chosen by the driver at registration.
> +             Can be a PCI BDF, UUID, or module name if unique.
> +      -
> +        name: node-name
> +        type: string
> +        doc: >-
> +             Node name chosen by the driver at registration.
> +             Can be an IP block name, or any name that identifies the
> +             RAS node inside the device.
> +      -
> +        name: node-type
> +        type: u32
> +        doc: Type of this node, identifying its function.
> +        enum: node-type
> +  -
> +    name: error-counter-attrs
> +    attributes:
> +      -
> +        name: node-id
> +        type: u32
> +        doc:  Node ID targeted by this error counter operation.
> +      -
> +        name: error-id
> +        type: u32
> +        doc: Unique identifier for a specific error counter within an node.
> +      -
> +        name: error-name
> +        type: string
> +        doc: Name of the error.
> +      -
> +        name: error-value
> +        type: u32
> +        doc: Current value of the requested error counter.
> +
> +operations:
> +  list:
> +    -
> +      name: list-nodes
> +      doc: >-
> +           Retrieve the full list of currently registered DRM RAS nodes.
> +           Each node includes its dynamically assigned ID, name, and type.
> +           **Important:** User space must call this operation first to obtain
> +           the node IDs. These IDs are required for all subsequent
> +           operations on nodes, such as querying error counters.
> +      attribute-set: node-attrs
> +      flags: [admin-perm]
> +      dump:
> +        reply:
> +          attributes:
> +            - node-id
> +            - device-name
> +            - node-name
> +            - node-type
> +    -
> +      name: get-error-counters
> +      doc: >-
> +           Retrieve the full list of error counters for a given node.
> +           The response include the id, the name, and even the current
> +           value of each counter.
> +      attribute-set: error-counter-attrs
> +      flags: [admin-perm]
> +      dump:
> +        request:
> +          attributes:
> +            - node-id
> +        reply:
> +          attributes:
> +            - error-id
> +            - error-name
> +            - error-value
> +    -
> +      name: query-error-counter
> +      doc: >-
> +           Query the information of a specific error counter for a given node.
> +           Users must provide the node ID and the error counter ID.
> +           The response contains the id, the name, and the current value
> +           of the counter.
> +      attribute-set: error-counter-attrs
> +      flags: [admin-perm]
> +      do:
> +        request:
> +          attributes:
> +            - node-id
> +            - error-id
> +        reply:
> +          attributes:
> +            - error-id
> +            - error-name
> +            - error-value
> +
> +kernel-family:
> +  headers: ["drm/drm_ras_nl.h"]
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 7e6bc0b3a589..5cfb23b80441 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
>  	  Smaller QR code are easier to read, but will contain less debugging
>  	  data. Default is 40.
>  
> +config DRM_RAS
> +	bool "DRM RAS support"
> +	depends on DRM
> +	help
> +	  Enables the DRM RAS (Reliability, Availability and Serviceability)
> +	  support for DRM drivers. This provides a Generic Netlink interface
> +	  for error reporting and queries.
> +	  If in doubt, say "N".
> +
>  config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
>          bool "Enable refcount backtrace history in the DP MST helpers"
>  	depends on STACKTRACE_SUPPORT
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index 4b3f3ad5058a..cd19573b2d9f 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
>  drm-$(CONFIG_DRM_PANIC) += drm_panic.o
>  drm-$(CONFIG_DRM_DRAW) += drm_draw.o
>  drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
> +drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
>  obj-$(CONFIG_DRM)	+= drm.o
>  
>  obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 2915118436ce..6b965c3d3307 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -53,6 +53,7 @@
>  #include <drm/drm_panic.h>
>  #include <drm/drm_print.h>
>  #include <drm/drm_privacy_screen_machine.h>
> +#include <drm/drm_ras_genl_family.h>
>  
>  #include "drm_crtc_internal.h"
>  #include "drm_internal.h"
> @@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
>  
>  static void drm_core_exit(void)
>  {
> +	drm_ras_genl_family_unregister();
>  	drm_privacy_screen_lookup_exit();
>  	drm_panic_exit();
>  	accel_core_exit();
> @@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
>  
>  	drm_privacy_screen_lookup_init();
>  
> +	ret = drm_ras_genl_family_register();
> +	if (ret < 0)
> +		goto error;
> +
>  	drm_core_init_complete = true;
>  
>  	DRM_DEBUG("Initialized\n");
> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> new file mode 100644
> index 000000000000..32f3897ce580
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -0,0 +1,351 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/xarray.h>
> +#include <net/genetlink.h>
> +
> +#include <drm/drm_ras.h>
> +
> +/**
> + * DOC: DRM RAS Node Management
> + *
> + * This module provides the infrastructure to manage RAS (Reliability,
> + * Availability, and Serviceability) nodes for DRM drivers. Each
> + * DRM driver may register one or more RAS nodes, which represent
> + * logical components capable of reporting error counters and other
> + * reliability metrics.
> + *
> + * The nodes are stored in a global xarray `drm_ras_xa` to allow
> + * efficient lookup by ID. Nodes can be registered or unregistered
> + * dynamically at runtime.
> + *
> + * A Generic Netlink family `drm_ras` exposes two main operations to
> + * userspace:
> + *
> + * 1. LIST_NODES: Dump all currently registered RAS nodes.
> + *    The user receives an array of node IDs, names, and types.
> + *
> + * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
> + *    The user receives an array of error IDs, names, and current value.
> + *
> + * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
> + *    Userspace must provide the node ID and the counter ID, and
> + *    receives the ID, the error name, and its current value.
> + *
> + * Node registration:
> + * - drm_ras_node_register(): Registers a new node and assigns
> + *   it a unique ID in the xarray.
> + * - drm_ras_node_unregister(): Removes a previously registered
> + *   node from the xarray.
> + *
> + * Node type:
> + * - ERROR_COUNTER:
> + *     + Currently, only error counters are supported.
> + *     + The driver must implement the query_error_counter() callback to provide
> + *       the name and the value of the error counter.
> + *     + The driver must provide a error_counter_range.last value informing the
> + *       last valid error ID.
> + *     + The driver can provide a error_counter_range.first value informing the
> + *       frst valid error ID.
> + *     + The error counters in the driver doesn't need to be contiguous, but the
> + *       driver must return -ENOENT to the query_error_counter as an indication
> + *       that the ID should be skipped and not listed in the netlink API.
> + *
> + * Netlink handlers:
> + * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
> + *   operation, iterating over the xarray.
> + * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
> + *   operation, iterating over the know valid error_counter_range.
> + * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
> + *   operation, fetching a counter value from a specific node.
> + */
> +
> +static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> +
> +/*
> + * The netlink callback context carries dump state across multiple dumpit calls
> + */
> +struct drm_ras_ctx {
> +	/* Which xarray id to restart the dump from */
> +	unsigned long restart;
> +};
> +
> +/**
> + * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
> + * @skb: Netlink message buffer
> + * @cb: Callback context for multi-part dumps
> + *
> + * Iterates over all registered RAS nodes in the global xarray and appends
> + * their attributes (ID, name, type) to the given netlink message buffer.
> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
> + * multi-part dump support. On buffer overflow, updates the context to resume
> + * from the last node on the next invocation.
> + *
> + * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
> + *          the buffer filled up (requires multi-part continuation), or
> + *          a negative error code on failure.
> + */
> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
> +				 struct netlink_callback *cb)
> +{
> +	const struct genl_info *info = genl_info_dump(cb);
> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
> +	struct drm_ras_node *node;
> +	struct nlattr *hdr;
> +	unsigned long id;
> +	int ret;
> +
> +	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
> +		hdr = genlmsg_iput(skb, info);
> +		if (!hdr) {
> +			ret = -EMSGSIZE;
> +			break;
> +		}
> +
> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> +				     node->device_name);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> +				     node->node_name);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> +				  node->type);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		genlmsg_end(skb, hdr);
> +	}
> +
> +	if (ret == -EMSGSIZE)
> +		ctx->restart = id;

Jakub had mentioned that we don't need this special handling
of the -EMSGSIZE, but then I'm not sure what to use in the
xa_for_each_start, so

Cc: Jakub Kicinski <kuba@kernel.org>

to ensure that we are in the right path here.

Riana, thank you so much for picking up this and addressing all
the comments. Patch looks good to me.

Thanks,
Rodrigo.

> +
> +	return ret;
> +}
> +
> +static int get_node_error_counter(u32 node_id, u32 error_id,
> +				  const char **name, u32 *value)
> +{
> +	struct drm_ras_node *node;
> +
> +	node = xa_load(&drm_ras_xa, node_id);
> +	if (!node || !node->query_error_counter)
> +		return -ENOENT;
> +
> +	if (error_id < node->error_counter_range.first ||
> +	    error_id > node->error_counter_range.last)
> +		return -EINVAL;
> +
> +	return node->query_error_counter(node, error_id, name, value);
> +}
> +
> +static int msg_reply_value(struct sk_buff *msg, u32 error_id,
> +			   const char *error_name, u32 value)
> +{
> +	int ret;
> +
> +	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
> +	if (ret)
> +		return ret;
> +
> +	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> +			     error_name);
> +	if (ret)
> +		return ret;
> +
> +	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
> +			   value);
> +}
> +
> +static int doit_reply_value(struct genl_info *info, u32 node_id,
> +			    u32 error_id)
> +{
> +	struct sk_buff *msg;
> +	struct nlattr *hdr;
> +	const char *error_name;
> +	u32 value;
> +	int ret;
> +
> +	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +
> +	hdr = genlmsg_iput(msg, info);
> +	if (!hdr) {
> +		nlmsg_free(msg);
> +		return -EMSGSIZE;
> +	}
> +
> +	ret = get_node_error_counter(node_id, error_id,
> +				     &error_name, &value);
> +	if (ret)
> +		return ret;
> +
> +	ret = msg_reply_value(msg, error_id, error_name, value);
> +	if (ret) {
> +		genlmsg_cancel(msg, hdr);
> +		nlmsg_free(msg);
> +		return ret;
> +	}
> +
> +	genlmsg_end(msg, hdr);
> +
> +	return genlmsg_reply(msg, info);
> +}
> +
> +/**
> + * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
> + * @skb: Netlink message buffer
> + * @cb: Callback context for multi-part dumps
> + *
> + * Iterates over all error counters in a given Node and appends
> + * their attributes (ID, name, value) to the given netlink message buffer.
> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
> + * multi-part dump support. On buffer overflow, updates the context to resume
> + * from the last node on the next invocation.
> + *
> + * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
> + *          the buffer filled up (requires multi-part continuation), or
> + *          a negative error code on failure.
> + */
> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
> +					 struct netlink_callback *cb)
> +{
> +	const struct genl_info *info = genl_info_dump(cb);
> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
> +	struct drm_ras_node *node;
> +	struct nlattr *hdr;
> +	const char *error_name;
> +	u32 node_id, error_id, value;
> +	int ret;
> +
> +	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
> +		return -EINVAL;
> +
> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> +
> +	node = xa_load(&drm_ras_xa, node_id);
> +	if (!node)
> +		return -ENOENT;
> +
> +	for (error_id = max(node->error_counter_range.first, ctx->restart);
> +	     error_id <= node->error_counter_range.last;
> +	     error_id++) {
> +		ret = get_node_error_counter(node_id, error_id,
> +					     &error_name, &value);
> +		/*
> +		 * For non-contiguous range, driver return -ENOENT as indication
> +		 * to skip this ID when listing all errors.
> +		 */
> +		if (ret == -ENOENT)
> +			continue;
> +		if (ret)
> +			return ret;
> +
> +		hdr = genlmsg_iput(skb, info);
> +
> +		if (!hdr) {
> +			ret = -EMSGSIZE;
> +			break;
> +		}
> +
> +		ret = msg_reply_value(skb, error_id, error_name, value);
> +		if (ret) {
> +			genlmsg_cancel(skb, hdr);
> +			break;
> +		}
> +
> +		genlmsg_end(skb, hdr);
> +	}
> +
> +	if (ret == -EMSGSIZE)
> +		ctx->restart = error_id;
> +
> +	return ret;
> +}
> +
> +/**
> + * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
> + * @skb: Netlink message buffer
> + * @info: Generic Netlink info containing attributes of the request
> + *
> + * Extracts the node ID and error ID from the netlink attributes and
> + * retrieves the current value of the corresponding error counter. Sends the
> + * result back to the requesting user via the standard Genl reply.
> + *
> + * Return: 0 on success, or negative errno on failure.
> + */
> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
> +					struct genl_info *info)
> +{
> +	u32 node_id, error_id;
> +
> +	if (!info->attrs ||
> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
> +		return -EINVAL;
> +
> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> +	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> +
> +	return doit_reply_value(info, node_id, error_id);
> +}
> +
> +/**
> + * drm_ras_node_register() - Register a new RAS node
> + * @node: Node structure to register
> + *
> + * Adds the given RAS node to the global node xarray and assigns it
> + * a unique ID. Both @node->name and @node->type must be valid.
> + *
> + * Return: 0 on success, or negative errno on failure:
> + */
> +int drm_ras_node_register(struct drm_ras_node *node)
> +{
> +	if (!node->device_name || !node->node_name)
> +		return -EINVAL;
> +
> +	/* Currently, only Error Counter Endpoinnts are supported */
> +	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
> +		return -EINVAL;
> +
> +	/* Mandatorty entries for Error Counter Node */
> +	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
> +	    (!node->error_counter_range.last || !node->query_error_counter))
> +		return -EINVAL;
> +
> +	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
> +}
> +EXPORT_SYMBOL(drm_ras_node_register);
> +
> +/**
> + * drm_ras_node_unregister() - Unregister a previously registered node
> + * @node: Node structure to unregister
> + *
> + * Removes the given node from the global node xarray using its ID.
> + */
> +void drm_ras_node_unregister(struct drm_ras_node *node)
> +{
> +	xa_erase(&drm_ras_xa, node->id);
> +}
> +EXPORT_SYMBOL(drm_ras_node_unregister);
> diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
> new file mode 100644
> index 000000000000..2d818b8c3808
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_ras_genl_family.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include <drm/drm_ras_genl_family.h>
> +#include <drm/drm_ras_nl.h>
> +
> +/* Track family registration so the drm_exit can be called at any time */
> +static bool registered;
> +
> +/**
> + * drm_ras_genl_family_register() - Register drm-ras genl family
> + *
> + * Only to be called one at drm_drv_init()
> + */
> +int drm_ras_genl_family_register(void)
> +{
> +	int ret;
> +
> +	registered = false;
> +
> +	ret = genl_register_family(&drm_ras_nl_family);
> +	if (ret)
> +		return ret;
> +
> +	registered = true;
> +	return 0;
> +}
> +
> +/**
> + * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
> + *
> + * To be called one at drm_drv_exit() at any moment, but only once.
> + */
> +void drm_ras_genl_family_unregister(void)
> +{
> +	if (registered) {
> +		genl_unregister_family(&drm_ras_nl_family);
> +		registered = false;
> +	}
> +}
> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> new file mode 100644
> index 000000000000..fcd1392410e4
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_ras_nl.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
> +/* Do not edit directly, auto-generated from: */
> +/*	Documentation/netlink/specs/drm_ras.yaml */
> +/* YNL-GEN kernel source */
> +
> +#include <net/netlink.h>
> +#include <net/genetlink.h>
> +
> +#include <uapi/drm/drm_ras.h>
> +#include <drm/drm_ras_nl.h>
> +
> +/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
> +static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> +};
> +
> +/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
> +static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> +};
> +
> +/* Ops table for drm_ras */
> +static const struct genl_split_ops drm_ras_nl_ops[] = {
> +	{
> +		.cmd	= DRM_RAS_CMD_LIST_NODES,
> +		.dumpit	= drm_ras_nl_list_nodes_dumpit,
> +		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> +	},
> +	{
> +		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
> +		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
> +		.policy		= drm_ras_get_error_counters_nl_policy,
> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> +	},
> +	{
> +		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
> +		.doit		= drm_ras_nl_query_error_counter_doit,
> +		.policy		= drm_ras_query_error_counter_nl_policy,
> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> +	},
> +};
> +
> +struct genl_family drm_ras_nl_family __ro_after_init = {
> +	.name		= DRM_RAS_FAMILY_NAME,
> +	.version	= DRM_RAS_FAMILY_VERSION,
> +	.netnsok	= true,
> +	.parallel_ops	= true,
> +	.module		= THIS_MODULE,
> +	.split_ops	= drm_ras_nl_ops,
> +	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
> +};
> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> new file mode 100644
> index 000000000000..bba47a282ef8
> --- /dev/null
> +++ b/include/drm/drm_ras.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef __DRM_RAS_H__
> +#define __DRM_RAS_H__
> +
> +#include "drm_ras_nl.h"
> +
> +/**
> + * struct drm_ras_node - A DRM RAS Node
> + */
> +struct drm_ras_node {
> +	/** @id: Unique identifier for the node. Dynamically assigned. */
> +	u32 id;
> +	/**
> +	 * @device_name: Human-readable name of the device. Given by the driver.
> +	 */
> +	const char *device_name;
> +	/** @node_name: Human-readable name of the node. Given by the driver. */
> +	const char *node_name;
> +	/** @type: Type of the node (enum drm_ras_node_type). */
> +	enum drm_ras_node_type type;
> +
> +	/* Error-Counter Related Callback and Variables */
> +
> +	/** @error_counter_range: Range of valid Error IDs for this node. */
> +	struct {
> +		/** @first: First valid Error ID. */
> +		u32 first;
> +		/** @last: Last valid Error ID. Mandatory entry. */
> +		u32 last;
> +	} error_counter_range;
> +
> +	/**
> +	 * @query_error_counter:
> +	 *
> +	 * This callback is used by drm-ras to query a specific error counter.
> +	 * counters supported by this node. Used for input check and to
> +	 * iterate in all counters.
> +	 *
> +	 * Driver should expect query_error_counters() to be called with
> +	 * error_id from `error_counter_range.first` to
> +	 * `error_counter_range.last`.
> +	 *
> +	 * The @query_error_counter is a mandatory callback for
> +	 * error_counter_node.
> +	 *
> +	 * Returns: 0 on success,
> +	 *          -ENOENT when error_id is not supported as an indication that
> +	 *                  drm_ras should silently skip this entry. Used for
> +	 *                  supporting non-contiguous error ranges.
> +	 *                  Driver is responsible for maintaining the list of
> +	 *                  supported error IDs in the range of first to last.
> +	 *          Other negative values on errors that should terminate the
> +	 *          netlink query.
> +	 */
> +	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
> +				   const char **name, u32 *val);
> +
> +	/** @priv: Driver private data */
> +	void *priv;
> +};
> +
> +struct drm_device;
> +
> +#if IS_ENABLED(CONFIG_DRM_RAS)
> +int drm_ras_node_register(struct drm_ras_node *ep);
> +void drm_ras_node_unregister(struct drm_ras_node *ep);
> +#else
> +static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
> +static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
> +#endif
> +
> +#endif
> diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
> new file mode 100644
> index 000000000000..5931b53429f1
> --- /dev/null
> +++ b/include/drm/drm_ras_genl_family.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef __DRM_RAS_GENL_FAMILY_H__
> +#define __DRM_RAS_GENL_FAMILY_H__
> +
> +#if IS_ENABLED(CONFIG_DRM_RAS)
> +int drm_ras_genl_family_register(void);
> +void drm_ras_genl_family_unregister(void);
> +#else
> +static inline int drm_ras_genl_family_register(void) { return 0; }
> +static inline void drm_ras_genl_family_unregister(void) { }
> +#endif
> +
> +#endif
> diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
> new file mode 100644
> index 000000000000..9613b7d9ffdb
> --- /dev/null
> +++ b/include/drm/drm_ras_nl.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
> +/* Do not edit directly, auto-generated from: */
> +/*	Documentation/netlink/specs/drm_ras.yaml */
> +/* YNL-GEN kernel header */
> +
> +#ifndef _LINUX_DRM_RAS_GEN_H
> +#define _LINUX_DRM_RAS_GEN_H
> +
> +#include <net/netlink.h>
> +#include <net/genetlink.h>
> +
> +#include <uapi/drm/drm_ras.h>
> +#include <drm/drm_ras_nl.h>
> +
> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
> +				 struct netlink_callback *cb);
> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
> +					 struct netlink_callback *cb);
> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
> +					struct genl_info *info);
> +
> +extern struct genl_family drm_ras_nl_family;
> +
> +#endif /* _LINUX_DRM_RAS_GEN_H */
> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> new file mode 100644
> index 000000000000..3415ba345ac8
> --- /dev/null
> +++ b/include/uapi/drm/drm_ras.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
> +/* Do not edit directly, auto-generated from: */
> +/*	Documentation/netlink/specs/drm_ras.yaml */
> +/* YNL-GEN uapi header */
> +
> +#ifndef _UAPI_LINUX_DRM_RAS_H
> +#define _UAPI_LINUX_DRM_RAS_H
> +
> +#define DRM_RAS_FAMILY_NAME	"drm-ras"
> +#define DRM_RAS_FAMILY_VERSION	1
> +
> +/*
> + * Type of the node. Currently, only error-counter nodes are supported, which
> + * expose reliability counters for a hardware/software component.
> + */
> +enum drm_ras_node_type {
> +	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
> +};
> +
> +enum {
> +	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
> +	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> +	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> +	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> +
> +	__DRM_RAS_A_NODE_ATTRS_MAX,
> +	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
> +};
> +
> +enum {
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
> +
> +	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
> +};
> +
> +enum {
> +	DRM_RAS_CMD_LIST_NODES = 1,
> +	DRM_RAS_CMD_GET_ERROR_COUNTERS,
> +	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
> +
> +	__DRM_RAS_CMD_MAX,
> +	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
> +};
> +
> +#endif /* _UAPI_LINUX_DRM_RAS_H */
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2025-12-09 21:35   ` Rodrigo Vivi
@ 2026-01-08 22:36     ` Zack McKevitt
  2026-01-09 20:57       ` Rodrigo Vivi
  0 siblings, 1 reply; 31+ messages in thread
From: Zack McKevitt @ 2026-01-08 22:36 UTC (permalink / raw)
  To: Rodrigo Vivi, Riana Tauro, Jakub Kicinski
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	joonas.lahtinen, lukas, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
	Lijo Lazar, Hawking Zhang, David S. Miller, Paolo Abeni,
	Eric Dumazet, netdev



On 12/9/2025 2:35 PM, Rodrigo Vivi wrote:

Apologies for the delay getting back to this. We are still supportive of 
this functionality making it into the DRM subsystem but have a couple of 
questions.

> On Fri, Dec 05, 2025 at 02:09:33PM +0530, Riana Tauro wrote:
>> From: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>
>> Introduces the DRM RAS infrastructure over generic netlink.
>>
>> The new interface allows drivers to expose RAS nodes and their
>> associated error counters to userspace in a structured and extensible
>> way. Each drm_ras node can register its own set of error counters, which
>> are then discoverable and queryable through netlink operations. This
>> lays the groundwork for reporting and managing hardware error states
>> in a unified manner across different DRM drivers.
>>
>> Currently is only supports error-counter nodes. But it can be
>> extended later.
>>
>> The registration is also no tied to any drm node, so it can be
>> used by accel devices as well.

Thank you for including the userspace reference implementation. I have
begun prototyping an extension for our qaic accel driver to incorporate
telemetry functionality by adding a new node type to drm_ras. Overall, 
extending the interface is intuitive.

>>
>> It uses the new and mandatory YAML description format stored in
>> Documentation/netlink/specs/. This forces a single generic netlink
>> family namespace for the entire drm: "drm-ras".
>> But multiple-endpoints are supported within the single family.
>>
>> Any modification to this API needs to be applied to
>> Documentation/netlink/specs/drm_ras.yaml before regenerating the
>> code:
>>
>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>   Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
>>   > include/uapi/drm/drm_ras.h
>>
>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>   Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
>>   > include/drm/drm_ras_nl.h
>>
>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>   Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
>>   > drivers/gpu/drm/drm_ras_nl.c
>>
>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>> Cc: Lukas Wunner <lukas@wunner.de>
>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: netdev@vger.kernel.org
>> Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: fix doc and memory leak
>>      use xe_for_each_start
>>      use standard genlmsg_iput (Jakub Kicinski)
>>
>> v3: add documentation to index
>>      modify documentation to mention uAPI requirements (Rodrigo)
>> ---
>>   Documentation/gpu/drm-ras.rst            | 109 +++++++
>>   Documentation/gpu/index.rst              |   1 +
>>   Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
>>   drivers/gpu/drm/Kconfig                  |   9 +
>>   drivers/gpu/drm/Makefile                 |   1 +
>>   drivers/gpu/drm/drm_drv.c                |   6 +
>>   drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
>>   drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
>>   drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
>>   include/drm/drm_ras.h                    |  76 +++++
>>   include/drm/drm_ras_genl_family.h        |  17 ++
>>   include/drm/drm_ras_nl.h                 |  24 ++
>>   include/uapi/drm/drm_ras.h               |  49 ++++
>>   13 files changed, 869 insertions(+)
>>   create mode 100644 Documentation/gpu/drm-ras.rst
>>   create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>>   create mode 100644 drivers/gpu/drm/drm_ras.c
>>   create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>>   create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>>   create mode 100644 include/drm/drm_ras.h
>>   create mode 100644 include/drm/drm_ras_genl_family.h
>>   create mode 100644 include/drm/drm_ras_nl.h
>>   create mode 100644 include/uapi/drm/drm_ras.h
>>
>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
>> new file mode 100644
>> index 000000000000..cec60cf5d17d
>> --- /dev/null
>> +++ b/Documentation/gpu/drm-ras.rst
>> @@ -0,0 +1,109 @@
>> +.. SPDX-License-Identifier: GPL-2.0+
>> +
>> +============================
>> +DRM RAS over Generic Netlink
>> +============================
>> +
>> +The DRM RAS (Reliability, Availability, Serviceability) interface provides a
>> +standardized way for GPU/accelerator drivers to expose error counters and
>> +other reliability nodes to user space via Generic Netlink. This allows
>> +diagnostic tools, monitoring daemons, or test infrastructure to query hardware
>> +health in a uniform way across different DRM drivers.
>> +
>> +Key Goals:
>> +
>> +* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
>> +  data center monitoring and reliability operations.
>> +* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
>> +  specifications and centralize all RAS-related communication in one namespace.
>> +* Support a basic error counter interface, addressing the immediate, essential
>> +  monitoring needs.
>> +* Offer a flexible, future-proof interface that can be extended to support
>> +  additional types of RAS data in the future.
>> +* Allow multiple nodes per driver, enabling drivers to register separate
>> +  nodes for different IP blocks, sub-blocks, or other logical subdivisions
>> +  as applicable.
>> +
>> +Nodes
>> +=====
>> +
>> +Nodes are logical abstractions representing an error source or block within
>> +the device. Currently, only error counter nodes is supported.
>> +
>> +Drivers are responsible for registering and unregistering nodes via the
>> +`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
>> +
>> +Node Management
>> +-------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
>> +   :doc: DRM RAS Node Management
>> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
>> +   :internal:
>> +
>> +Generic Netlink Usage
>> +=====================
>> +
>> +The interface is implemented as a Generic Netlink family named ``drm-ras``.
>> +User space tools can:
>> +
>> +* List registered nodes with the ``get-nodes`` command.
>> +* List all error counters in an node with the ``get-error-counters`` command.
>> +* Query error counters using the ``query-error-counter`` command.
>> +
>> +YAML-based Interface
>> +--------------------
>> +
>> +The interface is described in a YAML specification:
>> +
>> +:ref:`Documentation/netlink/specs/drm_ras.yaml`
>> +
>> +This YAML is used to auto-generate user space bindings via
>> +``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
>> +attributes and operations.
>> +
>> +Usage Notes
>> +-----------
>> +
>> +* User space must first enumerate nodes to obtain their IDs.
>> +* Node IDs or Node names can be used for all further queries, such as error counters.
>> +* Error counters can be queried by either the Error ID or Error name.
>> +* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
>> +* The interface supports future extension by adding new node types and
>> +  additional attributes.
>> +
>> +Example: List nodes using ynl
>> +
>> +.. code-block:: bash
>> +
>> +    sudo ynl --family drm_ras  --dump list-nodes
>> +    [{'device-name': '0000:03:00.0',
>> +    'node-id': 0,
>> +    'node-name': 'correctable-errors',
>> +    'node-type': 'error-counter'},
>> +    {'device-name': '0000:03:00.0',
>> +     'node-id': 1,
>> +    'node-name': 'nonfatal-errors',
>> +    'node-type': 'error-counter'},
>> +    {'device-name': '0000:03:00.0',
>> +    'node-id': 2,
>> +    'node-name': 'fatal-errors',
>> +    'node-type': 'error-counter'}]
>> +
>> +Example: List all error counters using ynl
>> +
>> +.. code-block:: bash
>> +
>> +
>> +   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
>> +   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
>> +   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
>> +
>> +
>> +Example: Query an error counter for a given node
>> +
>> +.. code-block:: bash
>> +
>> +   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
>> +   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
>> +
>> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
>> index 7dcb15850afd..60c73fdcfeed 100644
>> --- a/Documentation/gpu/index.rst
>> +++ b/Documentation/gpu/index.rst
>> @@ -9,6 +9,7 @@ GPU Driver Developer's Guide
>>      drm-mm
>>      drm-kms
>>      drm-kms-helpers
>> +   drm-ras
>>      drm-uapi
>>      drm-usage-stats
>>      driver-uapi
>> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
>> new file mode 100644
>> index 000000000000..be0e379c5bc9
>> --- /dev/null
>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>> @@ -0,0 +1,130 @@
>> +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
>> +---
>> +name: drm-ras
>> +protocol: genetlink
>> +uapi-header: drm/drm_ras.h
>> +
>> +doc: >-
>> +  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
>> +  Provides a standardized mechanism for DRM drivers to register "nodes"
>> +  representing hardware/software components capable of reporting error counters.
>> +  Userspace tools can query the list of nodes or individual error counters
>> +  via the Generic Netlink interface.
>> +
>> +definitions:
>> +  -
>> +    type: enum
>> +    name: node-type
>> +    value-start: 1
>> +    entries: [error-counter]
>> +    doc: >-
>> +         Type of the node. Currently, only error-counter nodes are
>> +         supported, which expose reliability counters for a hardware/software
>> +         component.
>> +
>> +attribute-sets:
>> +  -
>> +    name: node-attrs
>> +    attributes:
>> +      -
>> +        name: node-id
>> +        type: u32
>> +        doc: >-
>> +             Unique identifier for the node.
>> +             Assigned dynamically by the DRM RAS core upon registration.
>> +      -
>> +        name: device-name
>> +        type: string
>> +        doc: >-
>> +             Device name chosen by the driver at registration.
>> +             Can be a PCI BDF, UUID, or module name if unique.
>> +      -
>> +        name: node-name
>> +        type: string
>> +        doc: >-
>> +             Node name chosen by the driver at registration.
>> +             Can be an IP block name, or any name that identifies the
>> +             RAS node inside the device.
>> +      -
>> +        name: node-type
>> +        type: u32
>> +        doc: Type of this node, identifying its function.
>> +        enum: node-type
>> +  -
>> +    name: error-counter-attrs
>> +    attributes:
>> +      -
>> +        name: node-id
>> +        type: u32
>> +        doc:  Node ID targeted by this error counter operation.
>> +      -
>> +        name: error-id
>> +        type: u32
>> +        doc: Unique identifier for a specific error counter within an node.
>> +      -
>> +        name: error-name
>> +        type: string
>> +        doc: Name of the error.
>> +      -
>> +        name: error-value
>> +        type: u32
>> +        doc: Current value of the requested error counter.
>> +
>> +operations:
>> +  list:
>> +    -
>> +      name: list-nodes
>> +      doc: >-
>> +           Retrieve the full list of currently registered DRM RAS nodes.
>> +           Each node includes its dynamically assigned ID, name, and type.
>> +           **Important:** User space must call this operation first to obtain
>> +           the node IDs. These IDs are required for all subsequent
>> +           operations on nodes, such as querying error counters.

I am curious about security implications of this design. If the complete 
list of RAS nodes is visible for any process on the system (and one 
wants to avoid requiring CAP_NET_ADMIN), there should be some way to 
enforce permission checks when performing these operations if desired.

For example, this might be implemented in the driver's definition of 
callback functions like query_error_counter; some drivers may want to 
ensure that the process can in fact open the file descriptor 
corresponding to the queried device before serving a netlink request. Is 
it enough for a driver to simply return -EPERM in this case? Any driver 
that doesnt wish to protect its RAS nodes need not implement checks in 
their callbacks.

I dont see any such permissions checks in your driver implementation 
which is understandable given that it may not be necessary for your use 
cases. However, this would be a concern for our driver if we were to 
adopt this interface.

>> +      attribute-set: node-attrs
>> +      flags: [admin-perm]
>> +      dump:
>> +        reply:
>> +          attributes:
>> +            - node-id
>> +            - device-name
>> +            - node-name
>> +            - node-type
>> +    -
>> +      name: get-error-counters
>> +      doc: >-
>> +           Retrieve the full list of error counters for a given node.
>> +           The response include the id, the name, and even the current
>> +           value of each counter.
>> +      attribute-set: error-counter-attrs
>> +      flags: [admin-perm]
>> +      dump:
>> +        request:
>> +          attributes:
>> +            - node-id
>> +        reply:
>> +          attributes:
>> +            - error-id
>> +            - error-name
>> +            - error-value
>> +    -
>> +      name: query-error-counter
>> +      doc: >-
>> +           Query the information of a specific error counter for a given node.
>> +           Users must provide the node ID and the error counter ID.
>> +           The response contains the id, the name, and the current value
>> +           of the counter.
>> +      attribute-set: error-counter-attrs
>> +      flags: [admin-perm]
>> +      do:
>> +        request:
>> +          attributes:
>> +            - node-id
>> +            - error-id
>> +        reply:
>> +          attributes:
>> +            - error-id
>> +            - error-name
>> +            - error-value
>> +
>> +kernel-family:
>> +  headers: ["drm/drm_ras_nl.h"]
>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>> index 7e6bc0b3a589..5cfb23b80441 100644
>> --- a/drivers/gpu/drm/Kconfig
>> +++ b/drivers/gpu/drm/Kconfig
>> @@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
>>   	  Smaller QR code are easier to read, but will contain less debugging
>>   	  data. Default is 40.
>>   
>> +config DRM_RAS
>> +	bool "DRM RAS support"
>> +	depends on DRM
>> +	help
>> +	  Enables the DRM RAS (Reliability, Availability and Serviceability)
>> +	  support for DRM drivers. This provides a Generic Netlink interface
>> +	  for error reporting and queries.
>> +	  If in doubt, say "N".
>> +
>>   config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
>>           bool "Enable refcount backtrace history in the DP MST helpers"
>>   	depends on STACKTRACE_SUPPORT
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index 4b3f3ad5058a..cd19573b2d9f 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
>>   drm-$(CONFIG_DRM_PANIC) += drm_panic.o
>>   drm-$(CONFIG_DRM_DRAW) += drm_draw.o
>>   drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
>> +drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
>>   obj-$(CONFIG_DRM)	+= drm.o
>>   
>>   obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 2915118436ce..6b965c3d3307 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -53,6 +53,7 @@
>>   #include <drm/drm_panic.h>
>>   #include <drm/drm_print.h>
>>   #include <drm/drm_privacy_screen_machine.h>
>> +#include <drm/drm_ras_genl_family.h>
>>   
>>   #include "drm_crtc_internal.h"
>>   #include "drm_internal.h"
>> @@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
>>   
>>   static void drm_core_exit(void)
>>   {
>> +	drm_ras_genl_family_unregister();
>>   	drm_privacy_screen_lookup_exit();
>>   	drm_panic_exit();
>>   	accel_core_exit();
>> @@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
>>   
>>   	drm_privacy_screen_lookup_init();
>>   
>> +	ret = drm_ras_genl_family_register();
>> +	if (ret < 0)
>> +		goto error;
>> +
>>   	drm_core_init_complete = true;
>>   
>>   	DRM_DEBUG("Initialized\n");
>> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
>> new file mode 100644
>> index 000000000000..32f3897ce580
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_ras.c
>> @@ -0,0 +1,351 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/kernel.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/xarray.h>
>> +#include <net/genetlink.h>
>> +
>> +#include <drm/drm_ras.h>
>> +
>> +/**
>> + * DOC: DRM RAS Node Management
>> + *
>> + * This module provides the infrastructure to manage RAS (Reliability,
>> + * Availability, and Serviceability) nodes for DRM drivers. Each
>> + * DRM driver may register one or more RAS nodes, which represent
>> + * logical components capable of reporting error counters and other
>> + * reliability metrics.
>> + *
>> + * The nodes are stored in a global xarray `drm_ras_xa` to allow
>> + * efficient lookup by ID. Nodes can be registered or unregistered
>> + * dynamically at runtime.
>> + *
>> + * A Generic Netlink family `drm_ras` exposes two main operations to
>> + * userspace:

Nit: Three main operations.

>> + *
>> + * 1. LIST_NODES: Dump all currently registered RAS nodes.
>> + *    The user receives an array of node IDs, names, and types.
>> + *
>> + * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
>> + *    The user receives an array of error IDs, names, and current value.
>> + *
>> + * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
>> + *    Userspace must provide the node ID and the counter ID, and
>> + *    receives the ID, the error name, and its current value.
>> + *
>> + * Node registration:
>> + * - drm_ras_node_register(): Registers a new node and assigns
>> + *   it a unique ID in the xarray.
>> + * - drm_ras_node_unregister(): Removes a previously registered
>> + *   node from the xarray.
>> + *
>> + * Node type:
>> + * - ERROR_COUNTER:
>> + *     + Currently, only error counters are supported.
>> + *     + The driver must implement the query_error_counter() callback to provide
>> + *       the name and the value of the error counter.
>> + *     + The driver must provide a error_counter_range.last value informing the
>> + *       last valid error ID.
>> + *     + The driver can provide a error_counter_range.first value informing the
>> + *       frst valid error ID.
>> + *     + The error counters in the driver doesn't need to be contiguous, but the
>> + *       driver must return -ENOENT to the query_error_counter as an indication
>> + *       that the ID should be skipped and not listed in the netlink API.
>> + *
>> + * Netlink handlers:
>> + * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
>> + *   operation, iterating over the xarray.
>> + * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
>> + *   operation, iterating over the know valid error_counter_range.
>> + * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
>> + *   operation, fetching a counter value from a specific node.
>> + */
>> +
>> +static DEFINE_XARRAY_ALLOC(drm_ras_xa);
>> +
>> +/*
>> + * The netlink callback context carries dump state across multiple dumpit calls
>> + */
>> +struct drm_ras_ctx {
>> +	/* Which xarray id to restart the dump from */
>> +	unsigned long restart;
>> +};
>> +
>> +/**
>> + * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
>> + * @skb: Netlink message buffer
>> + * @cb: Callback context for multi-part dumps
>> + *
>> + * Iterates over all registered RAS nodes in the global xarray and appends
>> + * their attributes (ID, name, type) to the given netlink message buffer.
>> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
>> + * multi-part dump support. On buffer overflow, updates the context to resume
>> + * from the last node on the next invocation.
>> + *
>> + * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
>> + *          the buffer filled up (requires multi-part continuation), or
>> + *          a negative error code on failure.
>> + */
>> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
>> +				 struct netlink_callback *cb)
>> +{
>> +	const struct genl_info *info = genl_info_dump(cb);
>> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
>> +	struct drm_ras_node *node;
>> +	struct nlattr *hdr;
>> +	unsigned long id;
>> +	int ret;
>> +
>> +	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
>> +		hdr = genlmsg_iput(skb, info);
>> +		if (!hdr) {
>> +			ret = -EMSGSIZE;
>> +			break;
>> +		}
>> +
>> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
>> +		if (ret) {
>> +			genlmsg_cancel(skb, hdr);
>> +			break;
>> +		}
>> +
>> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
>> +				     node->device_name);
>> +		if (ret) {
>> +			genlmsg_cancel(skb, hdr);
>> +			break;
>> +		}
>> +
>> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
>> +				     node->node_name);
>> +		if (ret) {
>> +			genlmsg_cancel(skb, hdr);
>> +			break;
>> +		}
>> +
>> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
>> +				  node->type);
>> +		if (ret) {
>> +			genlmsg_cancel(skb, hdr);
>> +			break;
>> +		}
>> +
>> +		genlmsg_end(skb, hdr);
>> +	}
>> +
>> +	if (ret == -EMSGSIZE)
>> +		ctx->restart = id;
> 
> Jakub had mentioned that we don't need this special handling
> of the -EMSGSIZE, but then I'm not sure what to use in the
> xa_for_each_start, so
> 
> Cc: Jakub Kicinski <kuba@kernel.org>
> 
> to ensure that we are in the right path here.
> 
> Riana, thank you so much for picking up this and addressing all
> the comments. Patch looks good to me.
> 
> Thanks,
> Rodrigo.
> 
>> +
>> +	return ret;
>> +}
>> +
>> +static int get_node_error_counter(u32 node_id, u32 error_id,
>> +				  const char **name, u32 *value)
>> +{
>> +	struct drm_ras_node *node;
>> +
>> +	node = xa_load(&drm_ras_xa, node_id);
>> +	if (!node || !node->query_error_counter)
>> +		return -ENOENT;
>> +
>> +	if (error_id < node->error_counter_range.first ||
>> +	    error_id > node->error_counter_range.last)
>> +		return -EINVAL;
>> +
>> +	return node->query_error_counter(node, error_id, name, value);
>> +}

Regarding the permission check, node->query_error_counter could be 
implemented to return -EPERM in this case by checking driver specified 
fields in node->priv. Thoughts?

>> +
>> +static int msg_reply_value(struct sk_buff *msg, u32 error_id,
>> +			   const char *error_name, u32 value)
>> +{
>> +	int ret;
>> +
>> +	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
>> +			     error_name);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
>> +			   value);
>> +}
>> +
>> +static int doit_reply_value(struct genl_info *info, u32 node_id,
>> +			    u32 error_id)
>> +{
>> +	struct sk_buff *msg;
>> +	struct nlattr *hdr;
>> +	const char *error_name;
>> +	u32 value;
>> +	int ret;
>> +
>> +	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
>> +	if (!msg)
>> +		return -ENOMEM;
>> +
>> +	hdr = genlmsg_iput(msg, info);
>> +	if (!hdr) {
>> +		nlmsg_free(msg);
>> +		return -EMSGSIZE;
>> +	}
>> +
>> +	ret = get_node_error_counter(node_id, error_id,
>> +				     &error_name, &value);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = msg_reply_value(msg, error_id, error_name, value);
>> +	if (ret) {
>> +		genlmsg_cancel(msg, hdr);
>> +		nlmsg_free(msg);
>> +		return ret;
>> +	}
>> +
>> +	genlmsg_end(msg, hdr);
>> +
>> +	return genlmsg_reply(msg, info);
>> +}
>> +
>> +/**
>> + * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
>> + * @skb: Netlink message buffer
>> + * @cb: Callback context for multi-part dumps
>> + *
>> + * Iterates over all error counters in a given Node and appends
>> + * their attributes (ID, name, value) to the given netlink message buffer.
>> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
>> + * multi-part dump support. On buffer overflow, updates the context to resume
>> + * from the last node on the next invocation.
>> + *
>> + * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
>> + *          the buffer filled up (requires multi-part continuation), or
>> + *          a negative error code on failure.
>> + */
>> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
>> +					 struct netlink_callback *cb)
>> +{
>> +	const struct genl_info *info = genl_info_dump(cb);
>> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
>> +	struct drm_ras_node *node;
>> +	struct nlattr *hdr;
>> +	const char *error_name;
>> +	u32 node_id, error_id, value;
>> +	int ret;
>> +
>> +	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
>> +		return -EINVAL;
>> +
>> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>> +
>> +	node = xa_load(&drm_ras_xa, node_id);
>> +	if (!node)
>> +		return -ENOENT;
>> +
>> +	for (error_id = max(node->error_counter_range.first, ctx->restart);
>> +	     error_id <= node->error_counter_range.last;
>> +	     error_id++) {
>> +		ret = get_node_error_counter(node_id, error_id,
>> +					     &error_name, &value);
>> +		/*
>> +		 * For non-contiguous range, driver return -ENOENT as indication
>> +		 * to skip this ID when listing all errors.
>> +		 */
>> +		if (ret == -ENOENT)
>> +			continue;
>> +		if (ret)
>> +			return ret;
>> +
>> +		hdr = genlmsg_iput(skb, info);
>> +
>> +		if (!hdr) {
>> +			ret = -EMSGSIZE;
>> +			break;
>> +		}
>> +
>> +		ret = msg_reply_value(skb, error_id, error_name, value);
>> +		if (ret) {
>> +			genlmsg_cancel(skb, hdr);
>> +			break;
>> +		}
>> +
>> +		genlmsg_end(skb, hdr);
>> +	}
>> +
>> +	if (ret == -EMSGSIZE)
>> +		ctx->restart = error_id;
>> +
>> +	return ret;
>> +}
>> +
>> +/**
>> + * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
>> + * @skb: Netlink message buffer
>> + * @info: Generic Netlink info containing attributes of the request
>> + *
>> + * Extracts the node ID and error ID from the netlink attributes and
>> + * retrieves the current value of the corresponding error counter. Sends the
>> + * result back to the requesting user via the standard Genl reply.
>> + *
>> + * Return: 0 on success, or negative errno on failure.
>> + */
>> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
>> +					struct genl_info *info)
>> +{
>> +	u32 node_id, error_id;
>> +
>> +	if (!info->attrs ||
>> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
>> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
>> +		return -EINVAL;
>> +
>> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>> +	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
>> +
>> +	return doit_reply_value(info, node_id, error_id);
>> +}
>> +
>> +/**
>> + * drm_ras_node_register() - Register a new RAS node
>> + * @node: Node structure to register
>> + *
>> + * Adds the given RAS node to the global node xarray and assigns it
>> + * a unique ID. Both @node->name and @node->type must be valid.
>> + *
>> + * Return: 0 on success, or negative errno on failure:
>> + */
>> +int drm_ras_node_register(struct drm_ras_node *node)
>> +{
>> +	if (!node->device_name || !node->node_name)
>> +		return -EINVAL;
>> +
>> +	/* Currently, only Error Counter Endpoinnts are supported */
>> +	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
>> +		return -EINVAL;
>> +
>> +	/* Mandatorty entries for Error Counter Node */
>> +	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
>> +	    (!node->error_counter_range.last || !node->query_error_counter))
>> +		return -EINVAL;
>> +
>> +	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
>> +}
>> +EXPORT_SYMBOL(drm_ras_node_register);
>> +
>> +/**
>> + * drm_ras_node_unregister() - Unregister a previously registered node
>> + * @node: Node structure to unregister
>> + *
>> + * Removes the given node from the global node xarray using its ID.
>> + */
>> +void drm_ras_node_unregister(struct drm_ras_node *node)
>> +{
>> +	xa_erase(&drm_ras_xa, node->id);
>> +}
>> +EXPORT_SYMBOL(drm_ras_node_unregister);
>> diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
>> new file mode 100644
>> index 000000000000..2d818b8c3808
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_ras_genl_family.c
>> @@ -0,0 +1,42 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_ras_genl_family.h>
>> +#include <drm/drm_ras_nl.h>
>> +
>> +/* Track family registration so the drm_exit can be called at any time */
>> +static bool registered;
>> +
>> +/**
>> + * drm_ras_genl_family_register() - Register drm-ras genl family
>> + *
>> + * Only to be called one at drm_drv_init()
>> + */
>> +int drm_ras_genl_family_register(void)
>> +{
>> +	int ret;
>> +
>> +	registered = false;
>> +
>> +	ret = genl_register_family(&drm_ras_nl_family);
>> +	if (ret)
>> +		return ret;
>> +
>> +	registered = true;
>> +	return 0;
>> +}
>> +
>> +/**
>> + * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
>> + *
>> + * To be called one at drm_drv_exit() at any moment, but only once.
>> + */
>> +void drm_ras_genl_family_unregister(void)
>> +{
>> +	if (registered) {
>> +		genl_unregister_family(&drm_ras_nl_family);
>> +		registered = false;
>> +	}
>> +}
>> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
>> new file mode 100644
>> index 000000000000..fcd1392410e4
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_ras_nl.c
>> @@ -0,0 +1,54 @@
>> +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
>> +/* Do not edit directly, auto-generated from: */
>> +/*	Documentation/netlink/specs/drm_ras.yaml */
>> +/* YNL-GEN kernel source */
>> +
>> +#include <net/netlink.h>
>> +#include <net/genetlink.h>
>> +
>> +#include <uapi/drm/drm_ras.h>
>> +#include <drm/drm_ras_nl.h>
>> +
>> +/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
>> +static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
>> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>> +};
>> +
>> +/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
>> +static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
>> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
>> +};
>> +
>> +/* Ops table for drm_ras */
>> +static const struct genl_split_ops drm_ras_nl_ops[] = {
>> +	{
>> +		.cmd	= DRM_RAS_CMD_LIST_NODES,
>> +		.dumpit	= drm_ras_nl_list_nodes_dumpit,
>> +		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>> +	},
>> +	{
>> +		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
>> +		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
>> +		.policy		= drm_ras_get_error_counters_nl_policy,
>> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>> +	},
>> +	{
>> +		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
>> +		.doit		= drm_ras_nl_query_error_counter_doit,
>> +		.policy		= drm_ras_query_error_counter_nl_policy,
>> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>> +	},
>> +};
>> +
>> +struct genl_family drm_ras_nl_family __ro_after_init = {
>> +	.name		= DRM_RAS_FAMILY_NAME,
>> +	.version	= DRM_RAS_FAMILY_VERSION,
>> +	.netnsok	= true,
>> +	.parallel_ops	= true,
>> +	.module		= THIS_MODULE,
>> +	.split_ops	= drm_ras_nl_ops,
>> +	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
>> +};
>> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
>> new file mode 100644
>> index 000000000000..bba47a282ef8
>> --- /dev/null
>> +++ b/include/drm/drm_ras.h
>> @@ -0,0 +1,76 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_RAS_H__
>> +#define __DRM_RAS_H__
>> +
>> +#include "drm_ras_nl.h"
>> +
>> +/**
>> + * struct drm_ras_node - A DRM RAS Node
>> + */
>> +struct drm_ras_node {
>> +	/** @id: Unique identifier for the node. Dynamically assigned. */
>> +	u32 id;
>> +	/**
>> +	 * @device_name: Human-readable name of the device. Given by the driver.
>> +	 */
>> +	const char *device_name;
>> +	/** @node_name: Human-readable name of the node. Given by the driver. */
>> +	const char *node_name;
>> +	/** @type: Type of the node (enum drm_ras_node_type). */
>> +	enum drm_ras_node_type type;
>> +
>> +	/* Error-Counter Related Callback and Variables */
>> +
>> +	/** @error_counter_range: Range of valid Error IDs for this node. */
>> +	struct {
>> +		/** @first: First valid Error ID. */
>> +		u32 first;
>> +		/** @last: Last valid Error ID. Mandatory entry. */
>> +		u32 last;
>> +	} error_counter_range;
>> +
>> +	/**
>> +	 * @query_error_counter:
>> +	 *
>> +	 * This callback is used by drm-ras to query a specific error counter.
>> +	 * counters supported by this node. Used for input check and to
>> +	 * iterate in all counters.
>> +	 *
>> +	 * Driver should expect query_error_counters() to be called with
>> +	 * error_id from `error_counter_range.first` to
>> +	 * `error_counter_range.last`.
>> +	 *
>> +	 * The @query_error_counter is a mandatory callback for
>> +	 * error_counter_node.
>> +	 *
>> +	 * Returns: 0 on success,
>> +	 *          -ENOENT when error_id is not supported as an indication that
>> +	 *                  drm_ras should silently skip this entry. Used for
>> +	 *                  supporting non-contiguous error ranges.
>> +	 *                  Driver is responsible for maintaining the list of
>> +	 *                  supported error IDs in the range of first to last.
>> +	 *          Other negative values on errors that should terminate the
>> +	 *          netlink query.
>> +	 */
>> +	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
>> +				   const char **name, u32 *val);
>> +
>> +	/** @priv: Driver private data */
>> +	void *priv;
>> +};
>> +

If new node types are frequently added, this struct may contain many
unused fields. It seems like the necessary members for any given node
type are: id, device_name, node_name, type, and priv. However, since
this functionality is designed specifically for RAS, I think its ok.

>> +struct drm_device;
>> +
>> +#if IS_ENABLED(CONFIG_DRM_RAS)
>> +int drm_ras_node_register(struct drm_ras_node *ep);
>> +void drm_ras_node_unregister(struct drm_ras_node *ep);
>> +#else
>> +static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
>> +static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
>> +#endif
>> +
>> +#endif
>> diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
>> new file mode 100644
>> index 000000000000..5931b53429f1
>> --- /dev/null
>> +++ b/include/drm/drm_ras_genl_family.h
>> @@ -0,0 +1,17 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_RAS_GENL_FAMILY_H__
>> +#define __DRM_RAS_GENL_FAMILY_H__
>> +
>> +#if IS_ENABLED(CONFIG_DRM_RAS)
>> +int drm_ras_genl_family_register(void);
>> +void drm_ras_genl_family_unregister(void);
>> +#else
>> +static inline int drm_ras_genl_family_register(void) { return 0; }
>> +static inline void drm_ras_genl_family_unregister(void) { }
>> +#endif
>> +
>> +#endif
>> diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
>> new file mode 100644
>> index 000000000000..9613b7d9ffdb
>> --- /dev/null
>> +++ b/include/drm/drm_ras_nl.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
>> +/* Do not edit directly, auto-generated from: */
>> +/*	Documentation/netlink/specs/drm_ras.yaml */
>> +/* YNL-GEN kernel header */
>> +
>> +#ifndef _LINUX_DRM_RAS_GEN_H
>> +#define _LINUX_DRM_RAS_GEN_H
>> +
>> +#include <net/netlink.h>
>> +#include <net/genetlink.h>
>> +
>> +#include <uapi/drm/drm_ras.h>
>> +#include <drm/drm_ras_nl.h>
>> +
>> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
>> +				 struct netlink_callback *cb);
>> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
>> +					 struct netlink_callback *cb);
>> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
>> +					struct genl_info *info);
>> +
>> +extern struct genl_family drm_ras_nl_family;
>> +
>> +#endif /* _LINUX_DRM_RAS_GEN_H */
>> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
>> new file mode 100644
>> index 000000000000..3415ba345ac8
>> --- /dev/null
>> +++ b/include/uapi/drm/drm_ras.h
>> @@ -0,0 +1,49 @@
>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
>> +/* Do not edit directly, auto-generated from: */
>> +/*	Documentation/netlink/specs/drm_ras.yaml */
>> +/* YNL-GEN uapi header */
>> +
>> +#ifndef _UAPI_LINUX_DRM_RAS_H
>> +#define _UAPI_LINUX_DRM_RAS_H
>> +
>> +#define DRM_RAS_FAMILY_NAME	"drm-ras"
>> +#define DRM_RAS_FAMILY_VERSION	1
>> +
>> +/*
>> + * Type of the node. Currently, only error-counter nodes are supported, which
>> + * expose reliability counters for a hardware/software component.
>> + */
>> +enum drm_ras_node_type {
>> +	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
>> +};
>> +
>> +enum {
>> +	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
>> +	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
>> +	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
>> +	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
>> +
>> +	__DRM_RAS_A_NODE_ATTRS_MAX,
>> +	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
>> +};
>> +
>> +enum {
>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
>> +
>> +	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
>> +};
>> +
>> +enum {
>> +	DRM_RAS_CMD_LIST_NODES = 1,
>> +	DRM_RAS_CMD_GET_ERROR_COUNTERS,
>> +	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
>> +
>> +	__DRM_RAS_CMD_MAX,
>> +	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
>> +};
>> +
>> +#endif /* _UAPI_LINUX_DRM_RAS_H */
>> -- 
>> 2.47.1
>>

Thanks,

Zack


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-08 22:36     ` Zack McKevitt
@ 2026-01-09 20:57       ` Rodrigo Vivi
  2026-01-13  8:20         ` Riana Tauro
  0 siblings, 1 reply; 31+ messages in thread
From: Rodrigo Vivi @ 2026-01-09 20:57 UTC (permalink / raw)
  To: Zack McKevitt
  Cc: Riana Tauro, Jakub Kicinski, intel-xe, dri-devel,
	aravind.iddamsetty, anshuman.gupta, joonas.lahtinen, lukas,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet, netdev

On Thu, Jan 08, 2026 at 03:36:45PM -0700, Zack McKevitt wrote:
> 
> 
> On 12/9/2025 2:35 PM, Rodrigo Vivi wrote:
> 
> Apologies for the delay getting back to this. We are still supportive of
> this functionality making it into the DRM subsystem but have a couple of
> questions.
> 
> > On Fri, Dec 05, 2025 at 02:09:33PM +0530, Riana Tauro wrote:
> > > From: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > 
> > > Introduces the DRM RAS infrastructure over generic netlink.
> > > 
> > > The new interface allows drivers to expose RAS nodes and their
> > > associated error counters to userspace in a structured and extensible
> > > way. Each drm_ras node can register its own set of error counters, which
> > > are then discoverable and queryable through netlink operations. This
> > > lays the groundwork for reporting and managing hardware error states
> > > in a unified manner across different DRM drivers.
> > > 
> > > Currently is only supports error-counter nodes. But it can be
> > > extended later.
> > > 
> > > The registration is also no tied to any drm node, so it can be
> > > used by accel devices as well.
> 
> Thank you for including the userspace reference implementation. I have
> begun prototyping an extension for our qaic accel driver to incorporate
> telemetry functionality by adding a new node type to drm_ras. Overall,
> extending the interface is intuitive.

making it extensible was one of the main goals here...

> 
> > > 
> > > It uses the new and mandatory YAML description format stored in
> > > Documentation/netlink/specs/. This forces a single generic netlink
> > > family namespace for the entire drm: "drm-ras".
> > > But multiple-endpoints are supported within the single family.
> > > 
> > > Any modification to this API needs to be applied to
> > > Documentation/netlink/specs/drm_ras.yaml before regenerating the
> > > code:
> > > 
> > > $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
> > >   Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
> > >   > include/uapi/drm/drm_ras.h
> > > 
> > > $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
> > >   Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
> > >   > include/drm/drm_ras_nl.h
> > > 
> > > $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
> > >   Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
> > >   > drivers/gpu/drm/drm_ras_nl.c
> > > 
> > > Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> > > Cc: Lukas Wunner <lukas@wunner.de>
> > > Cc: Lijo Lazar <lijo.lazar@amd.com>
> > > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > > Cc: Jakub Kicinski <kuba@kernel.org>
> > > Cc: David S. Miller <davem@davemloft.net>
> > > Cc: Paolo Abeni <pabeni@redhat.com>
> > > Cc: Eric Dumazet <edumazet@google.com>
> > > Cc: netdev@vger.kernel.org
> > > Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> > > Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
> > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > > v2: fix doc and memory leak
> > >      use xe_for_each_start
> > >      use standard genlmsg_iput (Jakub Kicinski)
> > > 
> > > v3: add documentation to index
> > >      modify documentation to mention uAPI requirements (Rodrigo)
> > > ---
> > >   Documentation/gpu/drm-ras.rst            | 109 +++++++
> > >   Documentation/gpu/index.rst              |   1 +
> > >   Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
> > >   drivers/gpu/drm/Kconfig                  |   9 +
> > >   drivers/gpu/drm/Makefile                 |   1 +
> > >   drivers/gpu/drm/drm_drv.c                |   6 +
> > >   drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
> > >   drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
> > >   drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
> > >   include/drm/drm_ras.h                    |  76 +++++
> > >   include/drm/drm_ras_genl_family.h        |  17 ++
> > >   include/drm/drm_ras_nl.h                 |  24 ++
> > >   include/uapi/drm/drm_ras.h               |  49 ++++
> > >   13 files changed, 869 insertions(+)
> > >   create mode 100644 Documentation/gpu/drm-ras.rst
> > >   create mode 100644 Documentation/netlink/specs/drm_ras.yaml
> > >   create mode 100644 drivers/gpu/drm/drm_ras.c
> > >   create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
> > >   create mode 100644 drivers/gpu/drm/drm_ras_nl.c
> > >   create mode 100644 include/drm/drm_ras.h
> > >   create mode 100644 include/drm/drm_ras_genl_family.h
> > >   create mode 100644 include/drm/drm_ras_nl.h
> > >   create mode 100644 include/uapi/drm/drm_ras.h
> > > 
> > > diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> > > new file mode 100644
> > > index 000000000000..cec60cf5d17d
> > > --- /dev/null
> > > +++ b/Documentation/gpu/drm-ras.rst
> > > @@ -0,0 +1,109 @@
> > > +.. SPDX-License-Identifier: GPL-2.0+
> > > +
> > > +============================
> > > +DRM RAS over Generic Netlink
> > > +============================
> > > +
> > > +The DRM RAS (Reliability, Availability, Serviceability) interface provides a
> > > +standardized way for GPU/accelerator drivers to expose error counters and
> > > +other reliability nodes to user space via Generic Netlink. This allows
> > > +diagnostic tools, monitoring daemons, or test infrastructure to query hardware
> > > +health in a uniform way across different DRM drivers.
> > > +
> > > +Key Goals:
> > > +
> > > +* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
> > > +  data center monitoring and reliability operations.
> > > +* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
> > > +  specifications and centralize all RAS-related communication in one namespace.
> > > +* Support a basic error counter interface, addressing the immediate, essential
> > > +  monitoring needs.
> > > +* Offer a flexible, future-proof interface that can be extended to support
> > > +  additional types of RAS data in the future.
> > > +* Allow multiple nodes per driver, enabling drivers to register separate
> > > +  nodes for different IP blocks, sub-blocks, or other logical subdivisions
> > > +  as applicable.
> > > +
> > > +Nodes
> > > +=====
> > > +
> > > +Nodes are logical abstractions representing an error source or block within
> > > +the device. Currently, only error counter nodes is supported.
> > > +
> > > +Drivers are responsible for registering and unregistering nodes via the
> > > +`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
> > > +
> > > +Node Management
> > > +-------------------
> > > +
> > > +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
> > > +   :doc: DRM RAS Node Management
> > > +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
> > > +   :internal:
> > > +
> > > +Generic Netlink Usage
> > > +=====================
> > > +
> > > +The interface is implemented as a Generic Netlink family named ``drm-ras``.
> > > +User space tools can:
> > > +
> > > +* List registered nodes with the ``get-nodes`` command.
> > > +* List all error counters in an node with the ``get-error-counters`` command.
> > > +* Query error counters using the ``query-error-counter`` command.
> > > +
> > > +YAML-based Interface
> > > +--------------------
> > > +
> > > +The interface is described in a YAML specification:
> > > +
> > > +:ref:`Documentation/netlink/specs/drm_ras.yaml`
> > > +
> > > +This YAML is used to auto-generate user space bindings via
> > > +``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
> > > +attributes and operations.
> > > +
> > > +Usage Notes
> > > +-----------
> > > +
> > > +* User space must first enumerate nodes to obtain their IDs.
> > > +* Node IDs or Node names can be used for all further queries, such as error counters.
> > > +* Error counters can be queried by either the Error ID or Error name.
> > > +* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
> > > +* The interface supports future extension by adding new node types and
> > > +  additional attributes.
> > > +
> > > +Example: List nodes using ynl
> > > +
> > > +.. code-block:: bash
> > > +
> > > +    sudo ynl --family drm_ras  --dump list-nodes
> > > +    [{'device-name': '0000:03:00.0',
> > > +    'node-id': 0,
> > > +    'node-name': 'correctable-errors',
> > > +    'node-type': 'error-counter'},
> > > +    {'device-name': '0000:03:00.0',
> > > +     'node-id': 1,
> > > +    'node-name': 'nonfatal-errors',
> > > +    'node-type': 'error-counter'},
> > > +    {'device-name': '0000:03:00.0',
> > > +    'node-id': 2,
> > > +    'node-name': 'fatal-errors',
> > > +    'node-type': 'error-counter'}]
> > > +
> > > +Example: List all error counters using ynl
> > > +
> > > +.. code-block:: bash
> > > +
> > > +
> > > +   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> > > +   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
> > > +   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
> > > +
> > > +
> > > +Example: Query an error counter for a given node
> > > +
> > > +.. code-block:: bash
> > > +
> > > +   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
> > > +   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
> > > +
> > > diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
> > > index 7dcb15850afd..60c73fdcfeed 100644
> > > --- a/Documentation/gpu/index.rst
> > > +++ b/Documentation/gpu/index.rst
> > > @@ -9,6 +9,7 @@ GPU Driver Developer's Guide
> > >      drm-mm
> > >      drm-kms
> > >      drm-kms-helpers
> > > +   drm-ras
> > >      drm-uapi
> > >      drm-usage-stats
> > >      driver-uapi
> > > diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> > > new file mode 100644
> > > index 000000000000..be0e379c5bc9
> > > --- /dev/null
> > > +++ b/Documentation/netlink/specs/drm_ras.yaml
> > > @@ -0,0 +1,130 @@
> > > +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
> > > +---
> > > +name: drm-ras
> > > +protocol: genetlink
> > > +uapi-header: drm/drm_ras.h
> > > +
> > > +doc: >-
> > > +  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
> > > +  Provides a standardized mechanism for DRM drivers to register "nodes"
> > > +  representing hardware/software components capable of reporting error counters.
> > > +  Userspace tools can query the list of nodes or individual error counters
> > > +  via the Generic Netlink interface.
> > > +
> > > +definitions:
> > > +  -
> > > +    type: enum
> > > +    name: node-type
> > > +    value-start: 1
> > > +    entries: [error-counter]
> > > +    doc: >-
> > > +         Type of the node. Currently, only error-counter nodes are
> > > +         supported, which expose reliability counters for a hardware/software
> > > +         component.
> > > +
> > > +attribute-sets:
> > > +  -
> > > +    name: node-attrs
> > > +    attributes:
> > > +      -
> > > +        name: node-id
> > > +        type: u32
> > > +        doc: >-
> > > +             Unique identifier for the node.
> > > +             Assigned dynamically by the DRM RAS core upon registration.
> > > +      -
> > > +        name: device-name
> > > +        type: string
> > > +        doc: >-
> > > +             Device name chosen by the driver at registration.
> > > +             Can be a PCI BDF, UUID, or module name if unique.
> > > +      -
> > > +        name: node-name
> > > +        type: string
> > > +        doc: >-
> > > +             Node name chosen by the driver at registration.
> > > +             Can be an IP block name, or any name that identifies the
> > > +             RAS node inside the device.
> > > +      -
> > > +        name: node-type
> > > +        type: u32
> > > +        doc: Type of this node, identifying its function.
> > > +        enum: node-type
> > > +  -
> > > +    name: error-counter-attrs
> > > +    attributes:
> > > +      -
> > > +        name: node-id
> > > +        type: u32
> > > +        doc:  Node ID targeted by this error counter operation.
> > > +      -
> > > +        name: error-id
> > > +        type: u32
> > > +        doc: Unique identifier for a specific error counter within an node.
> > > +      -
> > > +        name: error-name
> > > +        type: string
> > > +        doc: Name of the error.
> > > +      -
> > > +        name: error-value
> > > +        type: u32
> > > +        doc: Current value of the requested error counter.
> > > +
> > > +operations:
> > > +  list:
> > > +    -
> > > +      name: list-nodes
> > > +      doc: >-
> > > +           Retrieve the full list of currently registered DRM RAS nodes.
> > > +           Each node includes its dynamically assigned ID, name, and type.
> > > +           **Important:** User space must call this operation first to obtain
> > > +           the node IDs. These IDs are required for all subsequent
> > > +           operations on nodes, such as querying error counters.
> 
> I am curious about security implications of this design.

hmm... very good point you are raising here.

This current design relies entirely in the CAP_NET_ADMIN.
No driver would have the flexibility to choose anything differently.
Please notice that the flag admin-perm is hardcoded in this yaml file.

> If the complete
> list of RAS nodes is visible for any process on the system (and one wants to
> avoid requiring CAP_NET_ADMIN), there should be some way to enforce
> permission checks when performing these operations if desired.

Right now, there's no way that the driver would choose not avoid requiring
CAP_NET_ADMIN...

Only way would be the admin to give the cap_net_admin to the tool with:

$ sudo setcap cap_net_admin+ep /bin/drm_ras_tool

but not ideal and not granular anyway...

> 
> For example, this might be implemented in the driver's definition of
> callback functions like query_error_counter; some drivers may want to ensure
> that the process can in fact open the file descriptor corresponding to the
> queried device before serving a netlink request. Is it enough for a driver
> to simply return -EPERM in this case? Any driver that doesnt wish to protect
> its RAS nodes need not implement checks in their callbacks.

Fair enough. If we want to give the option to the drivers, then we need:

1. to first remove all the admin-perm flags below and leave the driver to
pick up their policy on when to return something or -EPERM.
2. Document this security responsibility and list a few possibilities.
3. In our Xe case here I believe the easiest option is to use something like:

struct scm_creds *creds = NETLINK_CREDS(cb->skb);
if (!gid_eq(creds->gid, GLOBAL_ROOT_GID))
    return -EPERM

or something like that?!

perhaps drivers could implement some form of cookie or pre-authorization with
ioctls or sysfs, and then store in the priv?

Thoughts?
Other options?

> 
> I dont see any such permissions checks in your driver implementation which
> is understandable given that it may not be necessary for your use cases.
> However, this would be a concern for our driver if we were to adopt this
> interface.

yeap, this case was entirely with admin-perm, so not needed at all...
But I see your point and this is really not giving any flexibility to
other drivers.

> 
> > > +      attribute-set: node-attrs
> > > +      flags: [admin-perm]
> > > +      dump:
> > > +        reply:
> > > +          attributes:
> > > +            - node-id
> > > +            - device-name
> > > +            - node-name
> > > +            - node-type
> > > +    -
> > > +      name: get-error-counters
> > > +      doc: >-
> > > +           Retrieve the full list of error counters for a given node.
> > > +           The response include the id, the name, and even the current
> > > +           value of each counter.
> > > +      attribute-set: error-counter-attrs
> > > +      flags: [admin-perm]
> > > +      dump:
> > > +        request:
> > > +          attributes:
> > > +            - node-id
> > > +        reply:
> > > +          attributes:
> > > +            - error-id
> > > +            - error-name
> > > +            - error-value
> > > +    -
> > > +      name: query-error-counter
> > > +      doc: >-
> > > +           Query the information of a specific error counter for a given node.
> > > +           Users must provide the node ID and the error counter ID.
> > > +           The response contains the id, the name, and the current value
> > > +           of the counter.
> > > +      attribute-set: error-counter-attrs
> > > +      flags: [admin-perm]
> > > +      do:
> > > +        request:
> > > +          attributes:
> > > +            - node-id
> > > +            - error-id
> > > +        reply:
> > > +          attributes:
> > > +            - error-id
> > > +            - error-name
> > > +            - error-value
> > > +
> > > +kernel-family:
> > > +  headers: ["drm/drm_ras_nl.h"]
> > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > > index 7e6bc0b3a589..5cfb23b80441 100644
> > > --- a/drivers/gpu/drm/Kconfig
> > > +++ b/drivers/gpu/drm/Kconfig
> > > @@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
> > >   	  Smaller QR code are easier to read, but will contain less debugging
> > >   	  data. Default is 40.
> > > +config DRM_RAS
> > > +	bool "DRM RAS support"
> > > +	depends on DRM
> > > +	help
> > > +	  Enables the DRM RAS (Reliability, Availability and Serviceability)
> > > +	  support for DRM drivers. This provides a Generic Netlink interface
> > > +	  for error reporting and queries.
> > > +	  If in doubt, say "N".
> > > +
> > >   config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
> > >           bool "Enable refcount backtrace history in the DP MST helpers"
> > >   	depends on STACKTRACE_SUPPORT
> > > diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> > > index 4b3f3ad5058a..cd19573b2d9f 100644
> > > --- a/drivers/gpu/drm/Makefile
> > > +++ b/drivers/gpu/drm/Makefile
> > > @@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
> > >   drm-$(CONFIG_DRM_PANIC) += drm_panic.o
> > >   drm-$(CONFIG_DRM_DRAW) += drm_draw.o
> > >   drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
> > > +drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
> > >   obj-$(CONFIG_DRM)	+= drm.o
> > >   obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index 2915118436ce..6b965c3d3307 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -53,6 +53,7 @@
> > >   #include <drm/drm_panic.h>
> > >   #include <drm/drm_print.h>
> > >   #include <drm/drm_privacy_screen_machine.h>
> > > +#include <drm/drm_ras_genl_family.h>
> > >   #include "drm_crtc_internal.h"
> > >   #include "drm_internal.h"
> > > @@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
> > >   static void drm_core_exit(void)
> > >   {
> > > +	drm_ras_genl_family_unregister();
> > >   	drm_privacy_screen_lookup_exit();
> > >   	drm_panic_exit();
> > >   	accel_core_exit();
> > > @@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
> > >   	drm_privacy_screen_lookup_init();
> > > +	ret = drm_ras_genl_family_register();
> > > +	if (ret < 0)
> > > +		goto error;
> > > +
> > >   	drm_core_init_complete = true;
> > >   	DRM_DEBUG("Initialized\n");
> > > diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> > > new file mode 100644
> > > index 000000000000..32f3897ce580
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/drm_ras.c
> > > @@ -0,0 +1,351 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/module.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/netdevice.h>
> > > +#include <linux/xarray.h>
> > > +#include <net/genetlink.h>
> > > +
> > > +#include <drm/drm_ras.h>
> > > +
> > > +/**
> > > + * DOC: DRM RAS Node Management
> > > + *
> > > + * This module provides the infrastructure to manage RAS (Reliability,
> > > + * Availability, and Serviceability) nodes for DRM drivers. Each
> > > + * DRM driver may register one or more RAS nodes, which represent
> > > + * logical components capable of reporting error counters and other
> > > + * reliability metrics.
> > > + *
> > > + * The nodes are stored in a global xarray `drm_ras_xa` to allow
> > > + * efficient lookup by ID. Nodes can be registered or unregistered
> > > + * dynamically at runtime.
> > > + *
> > > + * A Generic Netlink family `drm_ras` exposes two main operations to
> > > + * userspace:
> 
> Nit: Three main operations.

ops, my bad, sorry

> 
> > > + *
> > > + * 1. LIST_NODES: Dump all currently registered RAS nodes.
> > > + *    The user receives an array of node IDs, names, and types.
> > > + *
> > > + * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
> > > + *    The user receives an array of error IDs, names, and current value.
> > > + *
> > > + * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
> > > + *    Userspace must provide the node ID and the counter ID, and
> > > + *    receives the ID, the error name, and its current value.
> > > + *
> > > + * Node registration:
> > > + * - drm_ras_node_register(): Registers a new node and assigns
> > > + *   it a unique ID in the xarray.
> > > + * - drm_ras_node_unregister(): Removes a previously registered
> > > + *   node from the xarray.
> > > + *
> > > + * Node type:
> > > + * - ERROR_COUNTER:
> > > + *     + Currently, only error counters are supported.
> > > + *     + The driver must implement the query_error_counter() callback to provide
> > > + *       the name and the value of the error counter.
> > > + *     + The driver must provide a error_counter_range.last value informing the
> > > + *       last valid error ID.
> > > + *     + The driver can provide a error_counter_range.first value informing the
> > > + *       frst valid error ID.
> > > + *     + The error counters in the driver doesn't need to be contiguous, but the
> > > + *       driver must return -ENOENT to the query_error_counter as an indication
> > > + *       that the ID should be skipped and not listed in the netlink API.
> > > + *
> > > + * Netlink handlers:
> > > + * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
> > > + *   operation, iterating over the xarray.
> > > + * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
> > > + *   operation, iterating over the know valid error_counter_range.
> > > + * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
> > > + *   operation, fetching a counter value from a specific node.
> > > + */
> > > +
> > > +static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> > > +
> > > +/*
> > > + * The netlink callback context carries dump state across multiple dumpit calls
> > > + */
> > > +struct drm_ras_ctx {
> > > +	/* Which xarray id to restart the dump from */
> > > +	unsigned long restart;
> > > +};
> > > +
> > > +/**
> > > + * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
> > > + * @skb: Netlink message buffer
> > > + * @cb: Callback context for multi-part dumps
> > > + *
> > > + * Iterates over all registered RAS nodes in the global xarray and appends
> > > + * their attributes (ID, name, type) to the given netlink message buffer.
> > > + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
> > > + * multi-part dump support. On buffer overflow, updates the context to resume
> > > + * from the last node on the next invocation.
> > > + *
> > > + * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
> > > + *          the buffer filled up (requires multi-part continuation), or
> > > + *          a negative error code on failure.
> > > + */
> > > +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
> > > +				 struct netlink_callback *cb)
> > > +{
> > > +	const struct genl_info *info = genl_info_dump(cb);
> > > +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
> > > +	struct drm_ras_node *node;
> > > +	struct nlattr *hdr;
> > > +	unsigned long id;
> > > +	int ret;
> > > +
> > > +	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
> > > +		hdr = genlmsg_iput(skb, info);
> > > +		if (!hdr) {
> > > +			ret = -EMSGSIZE;
> > > +			break;
> > > +		}
> > > +
> > > +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
> > > +		if (ret) {
> > > +			genlmsg_cancel(skb, hdr);
> > > +			break;
> > > +		}
> > > +
> > > +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> > > +				     node->device_name);
> > > +		if (ret) {
> > > +			genlmsg_cancel(skb, hdr);
> > > +			break;
> > > +		}
> > > +
> > > +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> > > +				     node->node_name);
> > > +		if (ret) {
> > > +			genlmsg_cancel(skb, hdr);
> > > +			break;
> > > +		}
> > > +
> > > +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> > > +				  node->type);
> > > +		if (ret) {
> > > +			genlmsg_cancel(skb, hdr);
> > > +			break;
> > > +		}
> > > +
> > > +		genlmsg_end(skb, hdr);
> > > +	}
> > > +
> > > +	if (ret == -EMSGSIZE)
> > > +		ctx->restart = id;
> > 
> > Jakub had mentioned that we don't need this special handling
> > of the -EMSGSIZE, but then I'm not sure what to use in the
> > xa_for_each_start, so
> > 
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > 
> > to ensure that we are in the right path here.
> > 
> > Riana, thank you so much for picking up this and addressing all
> > the comments. Patch looks good to me.
> > 
> > Thanks,
> > Rodrigo.
> > 
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static int get_node_error_counter(u32 node_id, u32 error_id,
> > > +				  const char **name, u32 *value)
> > > +{
> > > +	struct drm_ras_node *node;
> > > +
> > > +	node = xa_load(&drm_ras_xa, node_id);
> > > +	if (!node || !node->query_error_counter)
> > > +		return -ENOENT;
> > > +
> > > +	if (error_id < node->error_counter_range.first ||
> > > +	    error_id > node->error_counter_range.last)
> > > +		return -EINVAL;
> > > +
> > > +	return node->query_error_counter(node, error_id, name, value);
> > > +}
> 
> Regarding the permission check, node->query_error_counter could be
> implemented to return -EPERM in this case by checking driver specified
> fields in node->priv. Thoughts?

Yeap, please let me know your thoughts above on how drivers could check
and then return here and let's come to a flexible but secure design.

> 
> > > +
> > > +static int msg_reply_value(struct sk_buff *msg, u32 error_id,
> > > +			   const char *error_name, u32 value)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> > > +			     error_name);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
> > > +			   value);
> > > +}
> > > +
> > > +static int doit_reply_value(struct genl_info *info, u32 node_id,
> > > +			    u32 error_id)
> > > +{
> > > +	struct sk_buff *msg;
> > > +	struct nlattr *hdr;
> > > +	const char *error_name;
> > > +	u32 value;
> > > +	int ret;
> > > +
> > > +	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
> > > +	if (!msg)
> > > +		return -ENOMEM;
> > > +
> > > +	hdr = genlmsg_iput(msg, info);
> > > +	if (!hdr) {
> > > +		nlmsg_free(msg);
> > > +		return -EMSGSIZE;
> > > +	}
> > > +
> > > +	ret = get_node_error_counter(node_id, error_id,
> > > +				     &error_name, &value);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	ret = msg_reply_value(msg, error_id, error_name, value);
> > > +	if (ret) {
> > > +		genlmsg_cancel(msg, hdr);
> > > +		nlmsg_free(msg);
> > > +		return ret;
> > > +	}
> > > +
> > > +	genlmsg_end(msg, hdr);
> > > +
> > > +	return genlmsg_reply(msg, info);
> > > +}
> > > +
> > > +/**
> > > + * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
> > > + * @skb: Netlink message buffer
> > > + * @cb: Callback context for multi-part dumps
> > > + *
> > > + * Iterates over all error counters in a given Node and appends
> > > + * their attributes (ID, name, value) to the given netlink message buffer.
> > > + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
> > > + * multi-part dump support. On buffer overflow, updates the context to resume
> > > + * from the last node on the next invocation.
> > > + *
> > > + * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
> > > + *          the buffer filled up (requires multi-part continuation), or
> > > + *          a negative error code on failure.
> > > + */
> > > +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
> > > +					 struct netlink_callback *cb)
> > > +{
> > > +	const struct genl_info *info = genl_info_dump(cb);
> > > +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
> > > +	struct drm_ras_node *node;
> > > +	struct nlattr *hdr;
> > > +	const char *error_name;
> > > +	u32 node_id, error_id, value;
> > > +	int ret;
> > > +
> > > +	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
> > > +		return -EINVAL;
> > > +
> > > +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> > > +
> > > +	node = xa_load(&drm_ras_xa, node_id);
> > > +	if (!node)
> > > +		return -ENOENT;
> > > +
> > > +	for (error_id = max(node->error_counter_range.first, ctx->restart);
> > > +	     error_id <= node->error_counter_range.last;
> > > +	     error_id++) {
> > > +		ret = get_node_error_counter(node_id, error_id,
> > > +					     &error_name, &value);
> > > +		/*
> > > +		 * For non-contiguous range, driver return -ENOENT as indication
> > > +		 * to skip this ID when listing all errors.
> > > +		 */
> > > +		if (ret == -ENOENT)
> > > +			continue;
> > > +		if (ret)
> > > +			return ret;
> > > +
> > > +		hdr = genlmsg_iput(skb, info);
> > > +
> > > +		if (!hdr) {
> > > +			ret = -EMSGSIZE;
> > > +			break;
> > > +		}
> > > +
> > > +		ret = msg_reply_value(skb, error_id, error_name, value);
> > > +		if (ret) {
> > > +			genlmsg_cancel(skb, hdr);
> > > +			break;
> > > +		}
> > > +
> > > +		genlmsg_end(skb, hdr);
> > > +	}
> > > +
> > > +	if (ret == -EMSGSIZE)
> > > +		ctx->restart = error_id;
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
> > > + * @skb: Netlink message buffer
> > > + * @info: Generic Netlink info containing attributes of the request
> > > + *
> > > + * Extracts the node ID and error ID from the netlink attributes and
> > > + * retrieves the current value of the corresponding error counter. Sends the
> > > + * result back to the requesting user via the standard Genl reply.
> > > + *
> > > + * Return: 0 on success, or negative errno on failure.
> > > + */
> > > +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
> > > +					struct genl_info *info)
> > > +{
> > > +	u32 node_id, error_id;
> > > +
> > > +	if (!info->attrs ||
> > > +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
> > > +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
> > > +		return -EINVAL;
> > > +
> > > +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> > > +	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> > > +
> > > +	return doit_reply_value(info, node_id, error_id);
> > > +}
> > > +
> > > +/**
> > > + * drm_ras_node_register() - Register a new RAS node
> > > + * @node: Node structure to register
> > > + *
> > > + * Adds the given RAS node to the global node xarray and assigns it
> > > + * a unique ID. Both @node->name and @node->type must be valid.
> > > + *
> > > + * Return: 0 on success, or negative errno on failure:
> > > + */
> > > +int drm_ras_node_register(struct drm_ras_node *node)
> > > +{
> > > +	if (!node->device_name || !node->node_name)
> > > +		return -EINVAL;
> > > +
> > > +	/* Currently, only Error Counter Endpoinnts are supported */
> > > +	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
> > > +		return -EINVAL;
> > > +
> > > +	/* Mandatorty entries for Error Counter Node */
> > > +	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
> > > +	    (!node->error_counter_range.last || !node->query_error_counter))
> > > +		return -EINVAL;
> > > +
> > > +	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
> > > +}
> > > +EXPORT_SYMBOL(drm_ras_node_register);
> > > +
> > > +/**
> > > + * drm_ras_node_unregister() - Unregister a previously registered node
> > > + * @node: Node structure to unregister
> > > + *
> > > + * Removes the given node from the global node xarray using its ID.
> > > + */
> > > +void drm_ras_node_unregister(struct drm_ras_node *node)
> > > +{
> > > +	xa_erase(&drm_ras_xa, node->id);
> > > +}
> > > +EXPORT_SYMBOL(drm_ras_node_unregister);
> > > diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
> > > new file mode 100644
> > > index 000000000000..2d818b8c3808
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/drm_ras_genl_family.c
> > > @@ -0,0 +1,42 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#include <drm/drm_ras_genl_family.h>
> > > +#include <drm/drm_ras_nl.h>
> > > +
> > > +/* Track family registration so the drm_exit can be called at any time */
> > > +static bool registered;
> > > +
> > > +/**
> > > + * drm_ras_genl_family_register() - Register drm-ras genl family
> > > + *
> > > + * Only to be called one at drm_drv_init()
> > > + */
> > > +int drm_ras_genl_family_register(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	registered = false;
> > > +
> > > +	ret = genl_register_family(&drm_ras_nl_family);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	registered = true;
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
> > > + *
> > > + * To be called one at drm_drv_exit() at any moment, but only once.
> > > + */
> > > +void drm_ras_genl_family_unregister(void)
> > > +{
> > > +	if (registered) {
> > > +		genl_unregister_family(&drm_ras_nl_family);
> > > +		registered = false;
> > > +	}
> > > +}
> > > diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> > > new file mode 100644
> > > index 000000000000..fcd1392410e4
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/drm_ras_nl.c
> > > @@ -0,0 +1,54 @@
> > > +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
> > > +/* Do not edit directly, auto-generated from: */
> > > +/*	Documentation/netlink/specs/drm_ras.yaml */
> > > +/* YNL-GEN kernel source */
> > > +
> > > +#include <net/netlink.h>
> > > +#include <net/genetlink.h>
> > > +
> > > +#include <uapi/drm/drm_ras.h>
> > > +#include <drm/drm_ras_nl.h>
> > > +
> > > +/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
> > > +static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
> > > +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> > > +};
> > > +
> > > +/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
> > > +static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> > > +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> > > +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> > > +};
> > > +
> > > +/* Ops table for drm_ras */
> > > +static const struct genl_split_ops drm_ras_nl_ops[] = {
> > > +	{
> > > +		.cmd	= DRM_RAS_CMD_LIST_NODES,
> > > +		.dumpit	= drm_ras_nl_list_nodes_dumpit,
> > > +		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> > > +	},
> > > +	{
> > > +		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
> > > +		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
> > > +		.policy		= drm_ras_get_error_counters_nl_policy,
> > > +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
> > > +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> > > +	},
> > > +	{
> > > +		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
> > > +		.doit		= drm_ras_nl_query_error_counter_doit,
> > > +		.policy		= drm_ras_query_error_counter_nl_policy,
> > > +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> > > +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> > > +	},
> > > +};
> > > +
> > > +struct genl_family drm_ras_nl_family __ro_after_init = {
> > > +	.name		= DRM_RAS_FAMILY_NAME,
> > > +	.version	= DRM_RAS_FAMILY_VERSION,
> > > +	.netnsok	= true,
> > > +	.parallel_ops	= true,
> > > +	.module		= THIS_MODULE,
> > > +	.split_ops	= drm_ras_nl_ops,
> > > +	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
> > > +};
> > > diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> > > new file mode 100644
> > > index 000000000000..bba47a282ef8
> > > --- /dev/null
> > > +++ b/include/drm/drm_ras.h
> > > @@ -0,0 +1,76 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_RAS_H__
> > > +#define __DRM_RAS_H__
> > > +
> > > +#include "drm_ras_nl.h"
> > > +
> > > +/**
> > > + * struct drm_ras_node - A DRM RAS Node
> > > + */
> > > +struct drm_ras_node {
> > > +	/** @id: Unique identifier for the node. Dynamically assigned. */
> > > +	u32 id;
> > > +	/**
> > > +	 * @device_name: Human-readable name of the device. Given by the driver.
> > > +	 */
> > > +	const char *device_name;
> > > +	/** @node_name: Human-readable name of the node. Given by the driver. */
> > > +	const char *node_name;
> > > +	/** @type: Type of the node (enum drm_ras_node_type). */
> > > +	enum drm_ras_node_type type;
> > > +
> > > +	/* Error-Counter Related Callback and Variables */
> > > +
> > > +	/** @error_counter_range: Range of valid Error IDs for this node. */
> > > +	struct {
> > > +		/** @first: First valid Error ID. */
> > > +		u32 first;
> > > +		/** @last: Last valid Error ID. Mandatory entry. */
> > > +		u32 last;
> > > +	} error_counter_range;
> > > +
> > > +	/**
> > > +	 * @query_error_counter:
> > > +	 *
> > > +	 * This callback is used by drm-ras to query a specific error counter.
> > > +	 * counters supported by this node. Used for input check and to
> > > +	 * iterate in all counters.
> > > +	 *
> > > +	 * Driver should expect query_error_counters() to be called with
> > > +	 * error_id from `error_counter_range.first` to
> > > +	 * `error_counter_range.last`.
> > > +	 *
> > > +	 * The @query_error_counter is a mandatory callback for
> > > +	 * error_counter_node.
> > > +	 *
> > > +	 * Returns: 0 on success,
> > > +	 *          -ENOENT when error_id is not supported as an indication that
> > > +	 *                  drm_ras should silently skip this entry. Used for
> > > +	 *                  supporting non-contiguous error ranges.
> > > +	 *                  Driver is responsible for maintaining the list of
> > > +	 *                  supported error IDs in the range of first to last.
> > > +	 *          Other negative values on errors that should terminate the
> > > +	 *          netlink query.
> > > +	 */
> > > +	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
> > > +				   const char **name, u32 *val);
> > > +
> > > +	/** @priv: Driver private data */
> > > +	void *priv;
> > > +};
> > > +
> 
> If new node types are frequently added, this struct may contain many
> unused fields. It seems like the necessary members for any given node
> type are: id, device_name, node_name, type, and priv. However, since
> this functionality is designed specifically for RAS, I think its ok.

Yeap, that was the thought.

Thank you so much for the review and thoughts here,
Rodrigo.

> 
> > > +struct drm_device;
> > > +
> > > +#if IS_ENABLED(CONFIG_DRM_RAS)
> > > +int drm_ras_node_register(struct drm_ras_node *ep);
> > > +void drm_ras_node_unregister(struct drm_ras_node *ep);
> > > +#else
> > > +static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
> > > +static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
> > > +#endif
> > > +
> > > +#endif
> > > diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
> > > new file mode 100644
> > > index 000000000000..5931b53429f1
> > > --- /dev/null
> > > +++ b/include/drm/drm_ras_genl_family.h
> > > @@ -0,0 +1,17 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_RAS_GENL_FAMILY_H__
> > > +#define __DRM_RAS_GENL_FAMILY_H__
> > > +
> > > +#if IS_ENABLED(CONFIG_DRM_RAS)
> > > +int drm_ras_genl_family_register(void);
> > > +void drm_ras_genl_family_unregister(void);
> > > +#else
> > > +static inline int drm_ras_genl_family_register(void) { return 0; }
> > > +static inline void drm_ras_genl_family_unregister(void) { }
> > > +#endif
> > > +
> > > +#endif
> > > diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
> > > new file mode 100644
> > > index 000000000000..9613b7d9ffdb
> > > --- /dev/null
> > > +++ b/include/drm/drm_ras_nl.h
> > > @@ -0,0 +1,24 @@
> > > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
> > > +/* Do not edit directly, auto-generated from: */
> > > +/*	Documentation/netlink/specs/drm_ras.yaml */
> > > +/* YNL-GEN kernel header */
> > > +
> > > +#ifndef _LINUX_DRM_RAS_GEN_H
> > > +#define _LINUX_DRM_RAS_GEN_H
> > > +
> > > +#include <net/netlink.h>
> > > +#include <net/genetlink.h>
> > > +
> > > +#include <uapi/drm/drm_ras.h>
> > > +#include <drm/drm_ras_nl.h>
> > > +
> > > +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
> > > +				 struct netlink_callback *cb);
> > > +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
> > > +					 struct netlink_callback *cb);
> > > +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
> > > +					struct genl_info *info);
> > > +
> > > +extern struct genl_family drm_ras_nl_family;
> > > +
> > > +#endif /* _LINUX_DRM_RAS_GEN_H */
> > > diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> > > new file mode 100644
> > > index 000000000000..3415ba345ac8
> > > --- /dev/null
> > > +++ b/include/uapi/drm/drm_ras.h
> > > @@ -0,0 +1,49 @@
> > > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
> > > +/* Do not edit directly, auto-generated from: */
> > > +/*	Documentation/netlink/specs/drm_ras.yaml */
> > > +/* YNL-GEN uapi header */
> > > +
> > > +#ifndef _UAPI_LINUX_DRM_RAS_H
> > > +#define _UAPI_LINUX_DRM_RAS_H
> > > +
> > > +#define DRM_RAS_FAMILY_NAME	"drm-ras"
> > > +#define DRM_RAS_FAMILY_VERSION	1
> > > +
> > > +/*
> > > + * Type of the node. Currently, only error-counter nodes are supported, which
> > > + * expose reliability counters for a hardware/software component.
> > > + */
> > > +enum drm_ras_node_type {
> > > +	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
> > > +};
> > > +
> > > +enum {
> > > +	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
> > > +	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
> > > +	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
> > > +	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
> > > +
> > > +	__DRM_RAS_A_NODE_ATTRS_MAX,
> > > +	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
> > > +};
> > > +
> > > +enum {
> > > +	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
> > > +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> > > +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> > > +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
> > > +
> > > +	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
> > > +	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
> > > +};
> > > +
> > > +enum {
> > > +	DRM_RAS_CMD_LIST_NODES = 1,
> > > +	DRM_RAS_CMD_GET_ERROR_COUNTERS,
> > > +	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
> > > +
> > > +	__DRM_RAS_CMD_MAX,
> > > +	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
> > > +};
> > > +
> > > +#endif /* _UAPI_LINUX_DRM_RAS_H */
> > > -- 
> > > 2.47.1
> > > 
> 
> Thanks,
> 
> Zack
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-09 20:57       ` Rodrigo Vivi
@ 2026-01-13  8:20         ` Riana Tauro
  2026-01-15 23:39           ` Zack McKevitt
  0 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2026-01-13  8:20 UTC (permalink / raw)
  To: Rodrigo Vivi, Zack McKevitt
  Cc: Jakub Kicinski, intel-xe, dri-devel, aravind.iddamsetty,
	anshuman.gupta, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Lijo Lazar, Hawking Zhang, David S. Miller,
	Paolo Abeni, Eric Dumazet, netdev

Hi Rodrigo/Zack

On 1/10/2026 2:27 AM, Rodrigo Vivi wrote:
> On Thu, Jan 08, 2026 at 03:36:45PM -0700, Zack McKevitt wrote:
>>
>>
>> On 12/9/2025 2:35 PM, Rodrigo Vivi wrote:
>>
>> Apologies for the delay getting back to this. We are still supportive of
>> this functionality making it into the DRM subsystem but have a couple of
>> questions.
>>
>>> On Fri, Dec 05, 2025 at 02:09:33PM +0530, Riana Tauro wrote:
>>>> From: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>>>
>>>> Introduces the DRM RAS infrastructure over generic netlink.
>>>>
>>>> The new interface allows drivers to expose RAS nodes and their
>>>> associated error counters to userspace in a structured and extensible
>>>> way. Each drm_ras node can register its own set of error counters, which
>>>> are then discoverable and queryable through netlink operations. This
>>>> lays the groundwork for reporting and managing hardware error states
>>>> in a unified manner across different DRM drivers.
>>>>
>>>> Currently is only supports error-counter nodes. But it can be
>>>> extended later.
>>>>
>>>> The registration is also no tied to any drm node, so it can be
>>>> used by accel devices as well.
>>
>> Thank you for including the userspace reference implementation. I have
>> begun prototyping an extension for our qaic accel driver to incorporate
>> telemetry functionality by adding a new node type to drm_ras. Overall,
>> extending the interface is intuitive.
> 
> making it extensible was one of the main goals here...
> 
>>
>>>>
>>>> It uses the new and mandatory YAML description format stored in
>>>> Documentation/netlink/specs/. This forces a single generic netlink
>>>> family namespace for the entire drm: "drm-ras".
>>>> But multiple-endpoints are supported within the single family.
>>>>
>>>> Any modification to this API needs to be applied to
>>>> Documentation/netlink/specs/drm_ras.yaml before regenerating the
>>>> code:
>>>>
>>>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>>>    Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
>>>>    > include/uapi/drm/drm_ras.h
>>>>
>>>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>>>    Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
>>>>    > include/drm/drm_ras_nl.h
>>>>
>>>> $ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
>>>>    Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
>>>>    > drivers/gpu/drm/drm_ras_nl.c
>>>>
>>>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>>>> Cc: Lukas Wunner <lukas@wunner.de>
>>>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>> Cc: David S. Miller <davem@davemloft.net>
>>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>>> Cc: Eric Dumazet <edumazet@google.com>
>>>> Cc: netdev@vger.kernel.org
>>>> Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>>>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>>>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>> ---
>>>> v2: fix doc and memory leak
>>>>       use xe_for_each_start
>>>>       use standard genlmsg_iput (Jakub Kicinski)
>>>>
>>>> v3: add documentation to index
>>>>       modify documentation to mention uAPI requirements (Rodrigo)
>>>> ---
>>>>    Documentation/gpu/drm-ras.rst            | 109 +++++++
>>>>    Documentation/gpu/index.rst              |   1 +
>>>>    Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
>>>>    drivers/gpu/drm/Kconfig                  |   9 +
>>>>    drivers/gpu/drm/Makefile                 |   1 +
>>>>    drivers/gpu/drm/drm_drv.c                |   6 +
>>>>    drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
>>>>    drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
>>>>    drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
>>>>    include/drm/drm_ras.h                    |  76 +++++
>>>>    include/drm/drm_ras_genl_family.h        |  17 ++
>>>>    include/drm/drm_ras_nl.h                 |  24 ++
>>>>    include/uapi/drm/drm_ras.h               |  49 ++++
>>>>    13 files changed, 869 insertions(+)
>>>>    create mode 100644 Documentation/gpu/drm-ras.rst
>>>>    create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>>>>    create mode 100644 drivers/gpu/drm/drm_ras.c
>>>>    create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>>>>    create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>>>>    create mode 100644 include/drm/drm_ras.h
>>>>    create mode 100644 include/drm/drm_ras_genl_family.h
>>>>    create mode 100644 include/drm/drm_ras_nl.h
>>>>    create mode 100644 include/uapi/drm/drm_ras.h
>>>>
>>>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
>>>> new file mode 100644
>>>> index 000000000000..cec60cf5d17d
>>>> --- /dev/null
>>>> +++ b/Documentation/gpu/drm-ras.rst
>>>> @@ -0,0 +1,109 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0+
>>>> +
>>>> +============================
>>>> +DRM RAS over Generic Netlink
>>>> +============================
>>>> +
>>>> +The DRM RAS (Reliability, Availability, Serviceability) interface provides a
>>>> +standardized way for GPU/accelerator drivers to expose error counters and
>>>> +other reliability nodes to user space via Generic Netlink. This allows
>>>> +diagnostic tools, monitoring daemons, or test infrastructure to query hardware
>>>> +health in a uniform way across different DRM drivers.
>>>> +
>>>> +Key Goals:
>>>> +
>>>> +* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
>>>> +  data center monitoring and reliability operations.
>>>> +* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
>>>> +  specifications and centralize all RAS-related communication in one namespace.
>>>> +* Support a basic error counter interface, addressing the immediate, essential
>>>> +  monitoring needs.
>>>> +* Offer a flexible, future-proof interface that can be extended to support
>>>> +  additional types of RAS data in the future.
>>>> +* Allow multiple nodes per driver, enabling drivers to register separate
>>>> +  nodes for different IP blocks, sub-blocks, or other logical subdivisions
>>>> +  as applicable.
>>>> +
>>>> +Nodes
>>>> +=====
>>>> +
>>>> +Nodes are logical abstractions representing an error source or block within
>>>> +the device. Currently, only error counter nodes is supported.
>>>> +
>>>> +Drivers are responsible for registering and unregistering nodes via the
>>>> +`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
>>>> +
>>>> +Node Management
>>>> +-------------------
>>>> +
>>>> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
>>>> +   :doc: DRM RAS Node Management
>>>> +.. kernel-doc:: drivers/gpu/drm/drm_ras.c
>>>> +   :internal:
>>>> +
>>>> +Generic Netlink Usage
>>>> +=====================
>>>> +
>>>> +The interface is implemented as a Generic Netlink family named ``drm-ras``.
>>>> +User space tools can:
>>>> +
>>>> +* List registered nodes with the ``get-nodes`` command.
>>>> +* List all error counters in an node with the ``get-error-counters`` command.
>>>> +* Query error counters using the ``query-error-counter`` command.
>>>> +
>>>> +YAML-based Interface
>>>> +--------------------
>>>> +
>>>> +The interface is described in a YAML specification:
>>>> +
>>>> +:ref:`Documentation/netlink/specs/drm_ras.yaml`
>>>> +
>>>> +This YAML is used to auto-generate user space bindings via
>>>> +``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
>>>> +attributes and operations.
>>>> +
>>>> +Usage Notes
>>>> +-----------
>>>> +
>>>> +* User space must first enumerate nodes to obtain their IDs.
>>>> +* Node IDs or Node names can be used for all further queries, such as error counters.
>>>> +* Error counters can be queried by either the Error ID or Error name.
>>>> +* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
>>>> +* The interface supports future extension by adding new node types and
>>>> +  additional attributes.
>>>> +
>>>> +Example: List nodes using ynl
>>>> +
>>>> +.. code-block:: bash
>>>> +
>>>> +    sudo ynl --family drm_ras  --dump list-nodes
>>>> +    [{'device-name': '0000:03:00.0',
>>>> +    'node-id': 0,
>>>> +    'node-name': 'correctable-errors',
>>>> +    'node-type': 'error-counter'},
>>>> +    {'device-name': '0000:03:00.0',
>>>> +     'node-id': 1,
>>>> +    'node-name': 'nonfatal-errors',
>>>> +    'node-type': 'error-counter'},
>>>> +    {'device-name': '0000:03:00.0',
>>>> +    'node-id': 2,
>>>> +    'node-name': 'fatal-errors',
>>>> +    'node-type': 'error-counter'}]
>>>> +
>>>> +Example: List all error counters using ynl
>>>> +
>>>> +.. code-block:: bash
>>>> +
>>>> +
>>>> +   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
>>>> +   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
>>>> +   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
>>>> +
>>>> +
>>>> +Example: Query an error counter for a given node
>>>> +
>>>> +.. code-block:: bash
>>>> +
>>>> +   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
>>>> +   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
>>>> +
>>>> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
>>>> index 7dcb15850afd..60c73fdcfeed 100644
>>>> --- a/Documentation/gpu/index.rst
>>>> +++ b/Documentation/gpu/index.rst
>>>> @@ -9,6 +9,7 @@ GPU Driver Developer's Guide
>>>>       drm-mm
>>>>       drm-kms
>>>>       drm-kms-helpers
>>>> +   drm-ras
>>>>       drm-uapi
>>>>       drm-usage-stats
>>>>       driver-uapi
>>>> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
>>>> new file mode 100644
>>>> index 000000000000..be0e379c5bc9
>>>> --- /dev/null
>>>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>>>> @@ -0,0 +1,130 @@
>>>> +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
>>>> +---
>>>> +name: drm-ras
>>>> +protocol: genetlink
>>>> +uapi-header: drm/drm_ras.h
>>>> +
>>>> +doc: >-
>>>> +  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
>>>> +  Provides a standardized mechanism for DRM drivers to register "nodes"
>>>> +  representing hardware/software components capable of reporting error counters.
>>>> +  Userspace tools can query the list of nodes or individual error counters
>>>> +  via the Generic Netlink interface.
>>>> +
>>>> +definitions:
>>>> +  -
>>>> +    type: enum
>>>> +    name: node-type
>>>> +    value-start: 1
>>>> +    entries: [error-counter]
>>>> +    doc: >-
>>>> +         Type of the node. Currently, only error-counter nodes are
>>>> +         supported, which expose reliability counters for a hardware/software
>>>> +         component.
>>>> +
>>>> +attribute-sets:
>>>> +  -
>>>> +    name: node-attrs
>>>> +    attributes:
>>>> +      -
>>>> +        name: node-id
>>>> +        type: u32
>>>> +        doc: >-
>>>> +             Unique identifier for the node.
>>>> +             Assigned dynamically by the DRM RAS core upon registration.
>>>> +      -
>>>> +        name: device-name
>>>> +        type: string
>>>> +        doc: >-
>>>> +             Device name chosen by the driver at registration.
>>>> +             Can be a PCI BDF, UUID, or module name if unique.
>>>> +      -
>>>> +        name: node-name
>>>> +        type: string
>>>> +        doc: >-
>>>> +             Node name chosen by the driver at registration.
>>>> +             Can be an IP block name, or any name that identifies the
>>>> +             RAS node inside the device.
>>>> +      -
>>>> +        name: node-type
>>>> +        type: u32
>>>> +        doc: Type of this node, identifying its function.
>>>> +        enum: node-type
>>>> +  -
>>>> +    name: error-counter-attrs
>>>> +    attributes:
>>>> +      -
>>>> +        name: node-id
>>>> +        type: u32
>>>> +        doc:  Node ID targeted by this error counter operation.
>>>> +      -
>>>> +        name: error-id
>>>> +        type: u32
>>>> +        doc: Unique identifier for a specific error counter within an node.
>>>> +      -
>>>> +        name: error-name
>>>> +        type: string
>>>> +        doc: Name of the error.
>>>> +      -
>>>> +        name: error-value
>>>> +        type: u32
>>>> +        doc: Current value of the requested error counter.
>>>> +
>>>> +operations:
>>>> +  list:
>>>> +    -
>>>> +      name: list-nodes
>>>> +      doc: >-
>>>> +           Retrieve the full list of currently registered DRM RAS nodes.
>>>> +           Each node includes its dynamically assigned ID, name, and type.
>>>> +           **Important:** User space must call this operation first to obtain
>>>> +           the node IDs. These IDs are required for all subsequent
>>>> +           operations on nodes, such as querying error counters.
>>
>> I am curious about security implications of this design.
> 
> hmm... very good point you are raising here.
> 
> This current design relies entirely in the CAP_NET_ADMIN.
> No driver would have the flexibility to choose anything differently.
> Please notice that the flag admin-perm is hardcoded in this yaml file.
> 
>> If the complete
>> list of RAS nodes is visible for any process on the system (and one wants to
>> avoid requiring CAP_NET_ADMIN), there should be some way to enforce
>> permission checks when performing these operations if desired.
> 
> Right now, there's no way that the driver would choose not avoid requiring
> CAP_NET_ADMIN...
> 
> Only way would be the admin to give the cap_net_admin to the tool with:
> 
> $ sudo setcap cap_net_admin+ep /bin/drm_ras_tool
> 
> but not ideal and not granular anyway...
> 
>>
>> For example, this might be implemented in the driver's definition of
>> callback functions like query_error_counter; some drivers may want to ensure
>> that the process can in fact open the file descriptor corresponding to the
>> queried device before serving a netlink request. Is it enough for a driver
>> to simply return -EPERM in this case? Any driver that doesnt wish to protect
>> its RAS nodes need not implement checks in their callbacks.
> 
> Fair enough. If we want to give the option to the drivers, then we need:
> 
> 1. to first remove all the admin-perm flags below and leave the driver to
> pick up their policy on when to return something or -EPERM.
> 2. Document this security responsibility and list a few possibilities.
> 3. In our Xe case here I believe the easiest option is to use something like:
> 
> struct scm_creds *creds = NETLINK_CREDS(cb->skb);
> if (!gid_eq(creds->gid, GLOBAL_ROOT_GID))
>      return -EPERM

The driver currently does not have access to the callback or the 
skbuffer. Sending these details as param to driver won't be right as
drm_ras needs to handle all the netlink buffers.

How about using pre_doit & start calls? If driver has a pre callback , 
it's the responsibility of the driver to check permissions/any-pre 
conditions, else the CAP_NET_ADMIN permission will be checked.

@Zack / @Rodrigo thoughts?
@Zack Will this work for your usecase?

yaml
+	dump:
+        pre: drm-ras-nl-pre-list-nodes


drm_ras.c :

+       if (node->pre_list_nodes)
+                return node->pre_list_nodes(node);
+
+        return check_permissions(cb->skb);  //Checks creds

Thanks
Riana

> 
> or something like that?!
> 
> perhaps drivers could implement some form of cookie or pre-authorization with
> ioctls or sysfs, and then store in the priv?
> 
> Thoughts?
> Other options?
> 
>>
>> I dont see any such permissions checks in your driver implementation which
>> is understandable given that it may not be necessary for your use cases.
>> However, this would be a concern for our driver if we were to adopt this
>> interface.
> 
> yeap, this case was entirely with admin-perm, so not needed at all...
> But I see your point and this is really not giving any flexibility to
> other drivers.
> 
>>
>>>> +      attribute-set: node-attrs
>>>> +      flags: [admin-perm]
>>>> +      dump:
>>>> +        reply:
>>>> +          attributes:
>>>> +            - node-id
>>>> +            - device-name
>>>> +            - node-name
>>>> +            - node-type
>>>> +    -
>>>> +      name: get-error-counters
>>>> +      doc: >-
>>>> +           Retrieve the full list of error counters for a given node.
>>>> +           The response include the id, the name, and even the current
>>>> +           value of each counter.
>>>> +      attribute-set: error-counter-attrs
>>>> +      flags: [admin-perm]
>>>> +      dump:
>>>> +        request:
>>>> +          attributes:
>>>> +            - node-id
>>>> +        reply:
>>>> +          attributes:
>>>> +            - error-id
>>>> +            - error-name
>>>> +            - error-value
>>>> +    -
>>>> +      name: query-error-counter
>>>> +      doc: >-
>>>> +           Query the information of a specific error counter for a given node.
>>>> +           Users must provide the node ID and the error counter ID.
>>>> +           The response contains the id, the name, and the current value
>>>> +           of the counter.
>>>> +      attribute-set: error-counter-attrs
>>>> +      flags: [admin-perm]
>>>> +      do:
>>>> +        request:
>>>> +          attributes:
>>>> +            - node-id
>>>> +            - error-id
>>>> +        reply:
>>>> +          attributes:
>>>> +            - error-id
>>>> +            - error-name
>>>> +            - error-value
>>>> +
>>>> +kernel-family:
>>>> +  headers: ["drm/drm_ras_nl.h"]
>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>> index 7e6bc0b3a589..5cfb23b80441 100644
>>>> --- a/drivers/gpu/drm/Kconfig
>>>> +++ b/drivers/gpu/drm/Kconfig
>>>> @@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
>>>>    	  Smaller QR code are easier to read, but will contain less debugging
>>>>    	  data. Default is 40.
>>>> +config DRM_RAS
>>>> +	bool "DRM RAS support"
>>>> +	depends on DRM
>>>> +	help
>>>> +	  Enables the DRM RAS (Reliability, Availability and Serviceability)
>>>> +	  support for DRM drivers. This provides a Generic Netlink interface
>>>> +	  for error reporting and queries.
>>>> +	  If in doubt, say "N".
>>>> +
>>>>    config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
>>>>            bool "Enable refcount backtrace history in the DP MST helpers"
>>>>    	depends on STACKTRACE_SUPPORT
>>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>>> index 4b3f3ad5058a..cd19573b2d9f 100644
>>>> --- a/drivers/gpu/drm/Makefile
>>>> +++ b/drivers/gpu/drm/Makefile
>>>> @@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
>>>>    drm-$(CONFIG_DRM_PANIC) += drm_panic.o
>>>>    drm-$(CONFIG_DRM_DRAW) += drm_draw.o
>>>>    drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
>>>> +drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
>>>>    obj-$(CONFIG_DRM)	+= drm.o
>>>>    obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>> index 2915118436ce..6b965c3d3307 100644
>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>> @@ -53,6 +53,7 @@
>>>>    #include <drm/drm_panic.h>
>>>>    #include <drm/drm_print.h>
>>>>    #include <drm/drm_privacy_screen_machine.h>
>>>> +#include <drm/drm_ras_genl_family.h>
>>>>    #include "drm_crtc_internal.h"
>>>>    #include "drm_internal.h"
>>>> @@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
>>>>    static void drm_core_exit(void)
>>>>    {
>>>> +	drm_ras_genl_family_unregister();
>>>>    	drm_privacy_screen_lookup_exit();
>>>>    	drm_panic_exit();
>>>>    	accel_core_exit();
>>>> @@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
>>>>    	drm_privacy_screen_lookup_init();
>>>> +	ret = drm_ras_genl_family_register();
>>>> +	if (ret < 0)
>>>> +		goto error;
>>>> +
>>>>    	drm_core_init_complete = true;
>>>>    	DRM_DEBUG("Initialized\n");
>>>> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
>>>> new file mode 100644
>>>> index 000000000000..32f3897ce580
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/drm_ras.c
>>>> @@ -0,0 +1,351 @@
>>>> +// SPDX-License-Identifier: MIT
>>>> +/*
>>>> + * Copyright © 2025 Intel Corporation
>>>> + */
>>>> +
>>>> +#include <linux/module.h>
>>>> +#include <linux/kernel.h>
>>>> +#include <linux/netdevice.h>
>>>> +#include <linux/xarray.h>
>>>> +#include <net/genetlink.h>
>>>> +
>>>> +#include <drm/drm_ras.h>
>>>> +
>>>> +/**
>>>> + * DOC: DRM RAS Node Management
>>>> + *
>>>> + * This module provides the infrastructure to manage RAS (Reliability,
>>>> + * Availability, and Serviceability) nodes for DRM drivers. Each
>>>> + * DRM driver may register one or more RAS nodes, which represent
>>>> + * logical components capable of reporting error counters and other
>>>> + * reliability metrics.
>>>> + *
>>>> + * The nodes are stored in a global xarray `drm_ras_xa` to allow
>>>> + * efficient lookup by ID. Nodes can be registered or unregistered
>>>> + * dynamically at runtime.
>>>> + *
>>>> + * A Generic Netlink family `drm_ras` exposes two main operations to
>>>> + * userspace:
>>
>> Nit: Three main operations.
> 
> ops, my bad, sorry
> 
>>
>>>> + *
>>>> + * 1. LIST_NODES: Dump all currently registered RAS nodes.
>>>> + *    The user receives an array of node IDs, names, and types.
>>>> + *
>>>> + * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
>>>> + *    The user receives an array of error IDs, names, and current value.
>>>> + *
>>>> + * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
>>>> + *    Userspace must provide the node ID and the counter ID, and
>>>> + *    receives the ID, the error name, and its current value.
>>>> + *
>>>> + * Node registration:
>>>> + * - drm_ras_node_register(): Registers a new node and assigns
>>>> + *   it a unique ID in the xarray.
>>>> + * - drm_ras_node_unregister(): Removes a previously registered
>>>> + *   node from the xarray.
>>>> + *
>>>> + * Node type:
>>>> + * - ERROR_COUNTER:
>>>> + *     + Currently, only error counters are supported.
>>>> + *     + The driver must implement the query_error_counter() callback to provide
>>>> + *       the name and the value of the error counter.
>>>> + *     + The driver must provide a error_counter_range.last value informing the
>>>> + *       last valid error ID.
>>>> + *     + The driver can provide a error_counter_range.first value informing the
>>>> + *       frst valid error ID.
>>>> + *     + The error counters in the driver doesn't need to be contiguous, but the
>>>> + *       driver must return -ENOENT to the query_error_counter as an indication
>>>> + *       that the ID should be skipped and not listed in the netlink API.
>>>> + *
>>>> + * Netlink handlers:
>>>> + * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
>>>> + *   operation, iterating over the xarray.
>>>> + * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
>>>> + *   operation, iterating over the know valid error_counter_range.
>>>> + * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
>>>> + *   operation, fetching a counter value from a specific node.
>>>> + */
>>>> +
>>>> +static DEFINE_XARRAY_ALLOC(drm_ras_xa);
>>>> +
>>>> +/*
>>>> + * The netlink callback context carries dump state across multiple dumpit calls
>>>> + */
>>>> +struct drm_ras_ctx {
>>>> +	/* Which xarray id to restart the dump from */
>>>> +	unsigned long restart;
>>>> +};
>>>> +
>>>> +/**
>>>> + * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
>>>> + * @skb: Netlink message buffer
>>>> + * @cb: Callback context for multi-part dumps
>>>> + *
>>>> + * Iterates over all registered RAS nodes in the global xarray and appends
>>>> + * their attributes (ID, name, type) to the given netlink message buffer.
>>>> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
>>>> + * multi-part dump support. On buffer overflow, updates the context to resume
>>>> + * from the last node on the next invocation.
>>>> + *
>>>> + * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
>>>> + *          the buffer filled up (requires multi-part continuation), or
>>>> + *          a negative error code on failure.
>>>> + */
>>>> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
>>>> +				 struct netlink_callback *cb)
>>>> +{
>>>> +	const struct genl_info *info = genl_info_dump(cb);
>>>> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
>>>> +	struct drm_ras_node *node;
>>>> +	struct nlattr *hdr;
>>>> +	unsigned long id;
>>>> +	int ret;
>>>> +
>>>> +	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
>>>> +		hdr = genlmsg_iput(skb, info);
>>>> +		if (!hdr) {
>>>> +			ret = -EMSGSIZE;
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
>>>> +		if (ret) {
>>>> +			genlmsg_cancel(skb, hdr);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
>>>> +				     node->device_name);
>>>> +		if (ret) {
>>>> +			genlmsg_cancel(skb, hdr);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
>>>> +				     node->node_name);
>>>> +		if (ret) {
>>>> +			genlmsg_cancel(skb, hdr);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
>>>> +				  node->type);
>>>> +		if (ret) {
>>>> +			genlmsg_cancel(skb, hdr);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		genlmsg_end(skb, hdr);
>>>> +	}
>>>> +
>>>> +	if (ret == -EMSGSIZE)
>>>> +		ctx->restart = id;
>>>
>>> Jakub had mentioned that we don't need this special handling
>>> of the -EMSGSIZE, but then I'm not sure what to use in the
>>> xa_for_each_start, so
>>>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>
>>> to ensure that we are in the right path here.
>>>
>>> Riana, thank you so much for picking up this and addressing all
>>> the comments. Patch looks good to me.
>>>
>>> Thanks,
>>> Rodrigo.
>>>
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int get_node_error_counter(u32 node_id, u32 error_id,
>>>> +				  const char **name, u32 *value)
>>>> +{
>>>> +	struct drm_ras_node *node;
>>>> +
>>>> +	node = xa_load(&drm_ras_xa, node_id);
>>>> +	if (!node || !node->query_error_counter)
>>>> +		return -ENOENT;
>>>> +
>>>> +	if (error_id < node->error_counter_range.first ||
>>>> +	    error_id > node->error_counter_range.last)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return node->query_error_counter(node, error_id, name, value);
>>>> +}
>>
>> Regarding the permission check, node->query_error_counter could be
>> implemented to return -EPERM in this case by checking driver specified
>> fields in node->priv. Thoughts?
> 
> Yeap, please let me know your thoughts above on how drivers could check
> and then return here and let's come to a flexible but secure design.
> 
>>
>>>> +
>>>> +static int msg_reply_value(struct sk_buff *msg, u32 error_id,
>>>> +			   const char *error_name, u32 value)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
>>>> +			     error_name);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
>>>> +			   value);
>>>> +}
>>>> +
>>>> +static int doit_reply_value(struct genl_info *info, u32 node_id,
>>>> +			    u32 error_id)
>>>> +{
>>>> +	struct sk_buff *msg;
>>>> +	struct nlattr *hdr;
>>>> +	const char *error_name;
>>>> +	u32 value;
>>>> +	int ret;
>>>> +
>>>> +	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
>>>> +	if (!msg)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	hdr = genlmsg_iput(msg, info);
>>>> +	if (!hdr) {
>>>> +		nlmsg_free(msg);
>>>> +		return -EMSGSIZE;
>>>> +	}
>>>> +
>>>> +	ret = get_node_error_counter(node_id, error_id,
>>>> +				     &error_name, &value);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	ret = msg_reply_value(msg, error_id, error_name, value);
>>>> +	if (ret) {
>>>> +		genlmsg_cancel(msg, hdr);
>>>> +		nlmsg_free(msg);
>>>> +		return ret;
>>>> +	}
>>>> +
>>>> +	genlmsg_end(msg, hdr);
>>>> +
>>>> +	return genlmsg_reply(msg, info);
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
>>>> + * @skb: Netlink message buffer
>>>> + * @cb: Callback context for multi-part dumps
>>>> + *
>>>> + * Iterates over all error counters in a given Node and appends
>>>> + * their attributes (ID, name, value) to the given netlink message buffer.
>>>> + * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
>>>> + * multi-part dump support. On buffer overflow, updates the context to resume
>>>> + * from the last node on the next invocation.
>>>> + *
>>>> + * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
>>>> + *          the buffer filled up (requires multi-part continuation), or
>>>> + *          a negative error code on failure.
>>>> + */
>>>> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
>>>> +					 struct netlink_callback *cb)
>>>> +{
>>>> +	const struct genl_info *info = genl_info_dump(cb);
>>>> +	struct drm_ras_ctx *ctx = (void *)cb->ctx;
>>>> +	struct drm_ras_node *node;
>>>> +	struct nlattr *hdr;
>>>> +	const char *error_name;
>>>> +	u32 node_id, error_id, value;
>>>> +	int ret;
>>>> +
>>>> +	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
>>>> +		return -EINVAL;
>>>> +
>>>> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>>>> +
>>>> +	node = xa_load(&drm_ras_xa, node_id);
>>>> +	if (!node)
>>>> +		return -ENOENT;
>>>> +
>>>> +	for (error_id = max(node->error_counter_range.first, ctx->restart);
>>>> +	     error_id <= node->error_counter_range.last;
>>>> +	     error_id++) {
>>>> +		ret = get_node_error_counter(node_id, error_id,
>>>> +					     &error_name, &value);
>>>> +		/*
>>>> +		 * For non-contiguous range, driver return -ENOENT as indication
>>>> +		 * to skip this ID when listing all errors.
>>>> +		 */
>>>> +		if (ret == -ENOENT)
>>>> +			continue;
>>>> +		if (ret)
>>>> +			return ret;
>>>> +
>>>> +		hdr = genlmsg_iput(skb, info);
>>>> +
>>>> +		if (!hdr) {
>>>> +			ret = -EMSGSIZE;
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		ret = msg_reply_value(skb, error_id, error_name, value);
>>>> +		if (ret) {
>>>> +			genlmsg_cancel(skb, hdr);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		genlmsg_end(skb, hdr);
>>>> +	}
>>>> +
>>>> +	if (ret == -EMSGSIZE)
>>>> +		ctx->restart = error_id;
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
>>>> + * @skb: Netlink message buffer
>>>> + * @info: Generic Netlink info containing attributes of the request
>>>> + *
>>>> + * Extracts the node ID and error ID from the netlink attributes and
>>>> + * retrieves the current value of the corresponding error counter. Sends the
>>>> + * result back to the requesting user via the standard Genl reply.
>>>> + *
>>>> + * Return: 0 on success, or negative errno on failure.
>>>> + */
>>>> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
>>>> +					struct genl_info *info)
>>>> +{
>>>> +	u32 node_id, error_id;
>>>> +
>>>> +	if (!info->attrs ||
>>>> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
>>>> +	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
>>>> +		return -EINVAL;
>>>> +
>>>> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>>>> +	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
>>>> +
>>>> +	return doit_reply_value(info, node_id, error_id);
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_ras_node_register() - Register a new RAS node
>>>> + * @node: Node structure to register
>>>> + *
>>>> + * Adds the given RAS node to the global node xarray and assigns it
>>>> + * a unique ID. Both @node->name and @node->type must be valid.
>>>> + *
>>>> + * Return: 0 on success, or negative errno on failure:
>>>> + */
>>>> +int drm_ras_node_register(struct drm_ras_node *node)
>>>> +{
>>>> +	if (!node->device_name || !node->node_name)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Currently, only Error Counter Endpoinnts are supported */
>>>> +	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Mandatorty entries for Error Counter Node */
>>>> +	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
>>>> +	    (!node->error_counter_range.last || !node->query_error_counter))
>>>> +		return -EINVAL;
>>>> +
>>>> +	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_ras_node_register);
>>>> +
>>>> +/**
>>>> + * drm_ras_node_unregister() - Unregister a previously registered node
>>>> + * @node: Node structure to unregister
>>>> + *
>>>> + * Removes the given node from the global node xarray using its ID.
>>>> + */
>>>> +void drm_ras_node_unregister(struct drm_ras_node *node)
>>>> +{
>>>> +	xa_erase(&drm_ras_xa, node->id);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_ras_node_unregister);
>>>> diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
>>>> new file mode 100644
>>>> index 000000000000..2d818b8c3808
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/drm_ras_genl_family.c
>>>> @@ -0,0 +1,42 @@
>>>> +// SPDX-License-Identifier: MIT
>>>> +/*
>>>> + * Copyright © 2025 Intel Corporation
>>>> + */
>>>> +
>>>> +#include <drm/drm_ras_genl_family.h>
>>>> +#include <drm/drm_ras_nl.h>
>>>> +
>>>> +/* Track family registration so the drm_exit can be called at any time */
>>>> +static bool registered;
>>>> +
>>>> +/**
>>>> + * drm_ras_genl_family_register() - Register drm-ras genl family
>>>> + *
>>>> + * Only to be called one at drm_drv_init()
>>>> + */
>>>> +int drm_ras_genl_family_register(void)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	registered = false;
>>>> +
>>>> +	ret = genl_register_family(&drm_ras_nl_family);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	registered = true;
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
>>>> + *
>>>> + * To be called one at drm_drv_exit() at any moment, but only once.
>>>> + */
>>>> +void drm_ras_genl_family_unregister(void)
>>>> +{
>>>> +	if (registered) {
>>>> +		genl_unregister_family(&drm_ras_nl_family);
>>>> +		registered = false;
>>>> +	}
>>>> +}
>>>> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
>>>> new file mode 100644
>>>> index 000000000000..fcd1392410e4
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/drm_ras_nl.c
>>>> @@ -0,0 +1,54 @@
>>>> +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
>>>> +/* Do not edit directly, auto-generated from: */
>>>> +/*	Documentation/netlink/specs/drm_ras.yaml */
>>>> +/* YNL-GEN kernel source */
>>>> +
>>>> +#include <net/netlink.h>
>>>> +#include <net/genetlink.h>
>>>> +
>>>> +#include <uapi/drm/drm_ras.h>
>>>> +#include <drm/drm_ras_nl.h>
>>>> +
>>>> +/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
>>>> +static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
>>>> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>>> +};
>>>> +
>>>> +/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
>>>> +static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
>>>> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>>> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
>>>> +};
>>>> +
>>>> +/* Ops table for drm_ras */
>>>> +static const struct genl_split_ops drm_ras_nl_ops[] = {
>>>> +	{
>>>> +		.cmd	= DRM_RAS_CMD_LIST_NODES,
>>>> +		.dumpit	= drm_ras_nl_list_nodes_dumpit,
>>>> +		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>>>> +	},
>>>> +	{
>>>> +		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
>>>> +		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
>>>> +		.policy		= drm_ras_get_error_counters_nl_policy,
>>>> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>>>> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>>>> +	},
>>>> +	{
>>>> +		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
>>>> +		.doit		= drm_ras_nl_query_error_counter_doit,
>>>> +		.policy		= drm_ras_query_error_counter_nl_policy,
>>>> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>>>> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>>>> +	},
>>>> +};
>>>> +
>>>> +struct genl_family drm_ras_nl_family __ro_after_init = {
>>>> +	.name		= DRM_RAS_FAMILY_NAME,
>>>> +	.version	= DRM_RAS_FAMILY_VERSION,
>>>> +	.netnsok	= true,
>>>> +	.parallel_ops	= true,
>>>> +	.module		= THIS_MODULE,
>>>> +	.split_ops	= drm_ras_nl_ops,
>>>> +	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
>>>> +};
>>>> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
>>>> new file mode 100644
>>>> index 000000000000..bba47a282ef8
>>>> --- /dev/null
>>>> +++ b/include/drm/drm_ras.h
>>>> @@ -0,0 +1,76 @@
>>>> +/* SPDX-License-Identifier: MIT */
>>>> +/*
>>>> + * Copyright © 2025 Intel Corporation
>>>> + */
>>>> +
>>>> +#ifndef __DRM_RAS_H__
>>>> +#define __DRM_RAS_H__
>>>> +
>>>> +#include "drm_ras_nl.h"
>>>> +
>>>> +/**
>>>> + * struct drm_ras_node - A DRM RAS Node
>>>> + */
>>>> +struct drm_ras_node {
>>>> +	/** @id: Unique identifier for the node. Dynamically assigned. */
>>>> +	u32 id;
>>>> +	/**
>>>> +	 * @device_name: Human-readable name of the device. Given by the driver.
>>>> +	 */
>>>> +	const char *device_name;
>>>> +	/** @node_name: Human-readable name of the node. Given by the driver. */
>>>> +	const char *node_name;
>>>> +	/** @type: Type of the node (enum drm_ras_node_type). */
>>>> +	enum drm_ras_node_type type;
>>>> +
>>>> +	/* Error-Counter Related Callback and Variables */
>>>> +
>>>> +	/** @error_counter_range: Range of valid Error IDs for this node. */
>>>> +	struct {
>>>> +		/** @first: First valid Error ID. */
>>>> +		u32 first;
>>>> +		/** @last: Last valid Error ID. Mandatory entry. */
>>>> +		u32 last;
>>>> +	} error_counter_range;
>>>> +
>>>> +	/**
>>>> +	 * @query_error_counter:
>>>> +	 *
>>>> +	 * This callback is used by drm-ras to query a specific error counter.
>>>> +	 * counters supported by this node. Used for input check and to
>>>> +	 * iterate in all counters.
>>>> +	 *
>>>> +	 * Driver should expect query_error_counters() to be called with
>>>> +	 * error_id from `error_counter_range.first` to
>>>> +	 * `error_counter_range.last`.
>>>> +	 *
>>>> +	 * The @query_error_counter is a mandatory callback for
>>>> +	 * error_counter_node.
>>>> +	 *
>>>> +	 * Returns: 0 on success,
>>>> +	 *          -ENOENT when error_id is not supported as an indication that
>>>> +	 *                  drm_ras should silently skip this entry. Used for
>>>> +	 *                  supporting non-contiguous error ranges.
>>>> +	 *                  Driver is responsible for maintaining the list of
>>>> +	 *                  supported error IDs in the range of first to last.
>>>> +	 *          Other negative values on errors that should terminate the
>>>> +	 *          netlink query.
>>>> +	 */
>>>> +	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
>>>> +				   const char **name, u32 *val);
>>>> +
>>>> +	/** @priv: Driver private data */
>>>> +	void *priv;
>>>> +};
>>>> +
>>
>> If new node types are frequently added, this struct may contain many
>> unused fields. It seems like the necessary members for any given node
>> type are: id, device_name, node_name, type, and priv. However, since
>> this functionality is designed specifically for RAS, I think its ok.
> 
> Yeap, that was the thought.
> 
> Thank you so much for the review and thoughts here,
> Rodrigo.
> 
>>
>>>> +struct drm_device;
>>>> +
>>>> +#if IS_ENABLED(CONFIG_DRM_RAS)
>>>> +int drm_ras_node_register(struct drm_ras_node *ep);
>>>> +void drm_ras_node_unregister(struct drm_ras_node *ep);
>>>> +#else
>>>> +static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
>>>> +static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
>>>> +#endif
>>>> +
>>>> +#endif
>>>> diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
>>>> new file mode 100644
>>>> index 000000000000..5931b53429f1
>>>> --- /dev/null
>>>> +++ b/include/drm/drm_ras_genl_family.h
>>>> @@ -0,0 +1,17 @@
>>>> +/* SPDX-License-Identifier: MIT */
>>>> +/*
>>>> + * Copyright © 2025 Intel Corporation
>>>> + */
>>>> +
>>>> +#ifndef __DRM_RAS_GENL_FAMILY_H__
>>>> +#define __DRM_RAS_GENL_FAMILY_H__
>>>> +
>>>> +#if IS_ENABLED(CONFIG_DRM_RAS)
>>>> +int drm_ras_genl_family_register(void);
>>>> +void drm_ras_genl_family_unregister(void);
>>>> +#else
>>>> +static inline int drm_ras_genl_family_register(void) { return 0; }
>>>> +static inline void drm_ras_genl_family_unregister(void) { }
>>>> +#endif
>>>> +
>>>> +#endif
>>>> diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
>>>> new file mode 100644
>>>> index 000000000000..9613b7d9ffdb
>>>> --- /dev/null
>>>> +++ b/include/drm/drm_ras_nl.h
>>>> @@ -0,0 +1,24 @@
>>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
>>>> +/* Do not edit directly, auto-generated from: */
>>>> +/*	Documentation/netlink/specs/drm_ras.yaml */
>>>> +/* YNL-GEN kernel header */
>>>> +
>>>> +#ifndef _LINUX_DRM_RAS_GEN_H
>>>> +#define _LINUX_DRM_RAS_GEN_H
>>>> +
>>>> +#include <net/netlink.h>
>>>> +#include <net/genetlink.h>
>>>> +
>>>> +#include <uapi/drm/drm_ras.h>
>>>> +#include <drm/drm_ras_nl.h>
>>>> +
>>>> +int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
>>>> +				 struct netlink_callback *cb);
>>>> +int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
>>>> +					 struct netlink_callback *cb);
>>>> +int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
>>>> +					struct genl_info *info);
>>>> +
>>>> +extern struct genl_family drm_ras_nl_family;
>>>> +
>>>> +#endif /* _LINUX_DRM_RAS_GEN_H */
>>>> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
>>>> new file mode 100644
>>>> index 000000000000..3415ba345ac8
>>>> --- /dev/null
>>>> +++ b/include/uapi/drm/drm_ras.h
>>>> @@ -0,0 +1,49 @@
>>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
>>>> +/* Do not edit directly, auto-generated from: */
>>>> +/*	Documentation/netlink/specs/drm_ras.yaml */
>>>> +/* YNL-GEN uapi header */
>>>> +
>>>> +#ifndef _UAPI_LINUX_DRM_RAS_H
>>>> +#define _UAPI_LINUX_DRM_RAS_H
>>>> +
>>>> +#define DRM_RAS_FAMILY_NAME	"drm-ras"
>>>> +#define DRM_RAS_FAMILY_VERSION	1
>>>> +
>>>> +/*
>>>> + * Type of the node. Currently, only error-counter nodes are supported, which
>>>> + * expose reliability counters for a hardware/software component.
>>>> + */
>>>> +enum drm_ras_node_type {
>>>> +	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
>>>> +};
>>>> +
>>>> +enum {
>>>> +	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
>>>> +	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
>>>> +	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
>>>> +	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
>>>> +
>>>> +	__DRM_RAS_A_NODE_ATTRS_MAX,
>>>> +	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
>>>> +};
>>>> +
>>>> +enum {
>>>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
>>>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>>>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
>>>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
>>>> +
>>>> +	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
>>>> +	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
>>>> +};
>>>> +
>>>> +enum {
>>>> +	DRM_RAS_CMD_LIST_NODES = 1,
>>>> +	DRM_RAS_CMD_GET_ERROR_COUNTERS,
>>>> +	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
>>>> +
>>>> +	__DRM_RAS_CMD_MAX,
>>>> +	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
>>>> +};
>>>> +
>>>> +#endif /* _UAPI_LINUX_DRM_RAS_H */
>>>> -- 
>>>> 2.47.1
>>>>
>>
>> Thanks,
>>
>> Zack
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-13  8:20         ` Riana Tauro
@ 2026-01-15 23:39           ` Zack McKevitt
  2026-01-16  5:56             ` Riana Tauro
  0 siblings, 1 reply; 31+ messages in thread
From: Zack McKevitt @ 2026-01-15 23:39 UTC (permalink / raw)
  To: Riana Tauro, Rodrigo Vivi
  Cc: Jakub Kicinski, intel-xe, dri-devel, aravind.iddamsetty,
	anshuman.gupta, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Lijo Lazar, Hawking Zhang, David S. Miller,
	Paolo Abeni, Eric Dumazet, netdev, Jeff Hugo



On 1/13/2026 1:20 AM, Riana Tauro wrote:
>>>>> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/ 
>>>>> Documentation/netlink/specs/drm_ras.yaml
>>>>> new file mode 100644
>>>>> index 000000000000..be0e379c5bc9
>>>>> --- /dev/null
>>>>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>>>>> @@ -0,0 +1,130 @@
>>>>> +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>>>>> BSD-3-Clause)
>>>>> +---
>>>>> +name: drm-ras
>>>>> +protocol: genetlink
>>>>> +uapi-header: drm/drm_ras.h
>>>>> +
>>>>> +doc: >-
>>>>> +  DRM RAS (Reliability, Availability, Serviceability) over Generic 
>>>>> Netlink.
>>>>> +  Provides a standardized mechanism for DRM drivers to register 
>>>>> "nodes"
>>>>> +  representing hardware/software components capable of reporting 
>>>>> error counters.
>>>>> +  Userspace tools can query the list of nodes or individual error 
>>>>> counters
>>>>> +  via the Generic Netlink interface.
>>>>> +
>>>>> +definitions:
>>>>> +  -
>>>>> +    type: enum
>>>>> +    name: node-type
>>>>> +    value-start: 1
>>>>> +    entries: [error-counter]
>>>>> +    doc: >-
>>>>> +         Type of the node. Currently, only error-counter nodes are
>>>>> +         supported, which expose reliability counters for a 
>>>>> hardware/software
>>>>> +         component.
>>>>> +
>>>>> +attribute-sets:
>>>>> +  -
>>>>> +    name: node-attrs
>>>>> +    attributes:
>>>>> +      -
>>>>> +        name: node-id
>>>>> +        type: u32
>>>>> +        doc: >-
>>>>> +             Unique identifier for the node.
>>>>> +             Assigned dynamically by the DRM RAS core upon 
>>>>> registration.
>>>>> +      -
>>>>> +        name: device-name
>>>>> +        type: string
>>>>> +        doc: >-
>>>>> +             Device name chosen by the driver at registration.
>>>>> +             Can be a PCI BDF, UUID, or module name if unique.
>>>>> +      -
>>>>> +        name: node-name
>>>>> +        type: string
>>>>> +        doc: >-
>>>>> +             Node name chosen by the driver at registration.
>>>>> +             Can be an IP block name, or any name that identifies the
>>>>> +             RAS node inside the device.
>>>>> +      -
>>>>> +        name: node-type
>>>>> +        type: u32
>>>>> +        doc: Type of this node, identifying its function.
>>>>> +        enum: node-type
>>>>> +  -
>>>>> +    name: error-counter-attrs
>>>>> +    attributes:
>>>>> +      -
>>>>> +        name: node-id
>>>>> +        type: u32
>>>>> +        doc:  Node ID targeted by this error counter operation.
>>>>> +      -
>>>>> +        name: error-id
>>>>> +        type: u32
>>>>> +        doc: Unique identifier for a specific error counter within 
>>>>> an node.
>>>>> +      -
>>>>> +        name: error-name
>>>>> +        type: string
>>>>> +        doc: Name of the error.
>>>>> +      -
>>>>> +        name: error-value
>>>>> +        type: u32
>>>>> +        doc: Current value of the requested error counter.
>>>>> +
>>>>> +operations:
>>>>> +  list:
>>>>> +    -
>>>>> +      name: list-nodes
>>>>> +      doc: >-
>>>>> +           Retrieve the full list of currently registered DRM RAS 
>>>>> nodes.
>>>>> +           Each node includes its dynamically assigned ID, name, 
>>>>> and type.
>>>>> +           **Important:** User space must call this operation 
>>>>> first to obtain
>>>>> +           the node IDs. These IDs are required for all subsequent
>>>>> +           operations on nodes, such as querying error counters.
>>>
>>> I am curious about security implications of this design.
>>
>> hmm... very good point you are raising here.
>>
>> This current design relies entirely in the CAP_NET_ADMIN.
>> No driver would have the flexibility to choose anything differently.
>> Please notice that the flag admin-perm is hardcoded in this yaml file.
>>
>>> If the complete
>>> list of RAS nodes is visible for any process on the system (and one 
>>> wants to
>>> avoid requiring CAP_NET_ADMIN), there should be some way to enforce
>>> permission checks when performing these operations if desired.
>>
>> Right now, there's no way that the driver would choose not avoid 
>> requiring
>> CAP_NET_ADMIN...
>>
>> Only way would be the admin to give the cap_net_admin to the tool with:
>>
>> $ sudo setcap cap_net_admin+ep /bin/drm_ras_tool
>>
>> but not ideal and not granular anyway...
>>
>>>
>>> For example, this might be implemented in the driver's definition of
>>> callback functions like query_error_counter; some drivers may want to 
>>> ensure
>>> that the process can in fact open the file descriptor corresponding 
>>> to the
>>> queried device before serving a netlink request. Is it enough for a 
>>> driver
>>> to simply return -EPERM in this case? Any driver that doesnt wish to 
>>> protect
>>> its RAS nodes need not implement checks in their callbacks.
>>
>> Fair enough. If we want to give the option to the drivers, then we need:
>>
>> 1. to first remove all the admin-perm flags below and leave the driver to
>> pick up their policy on when to return something or -EPERM.
>> 2. Document this security responsibility and list a few possibilities.
>> 3. In our Xe case here I believe the easiest option is to use 
>> something like:
>>
>> struct scm_creds *creds = NETLINK_CREDS(cb->skb);
>> if (!gid_eq(creds->gid, GLOBAL_ROOT_GID))
>>      return -EPERM
> 
> The driver currently does not have access to the callback or the 
> skbuffer. Sending these details as param to driver won't be right as
> drm_ras needs to handle all the netlink buffers.
> 
> How about using pre_doit & start calls? If driver has a pre callback , 
> it's the responsibility of the driver to check permissions/any-pre 
> conditions, else the CAP_NET_ADMIN permission will be checked.
> 
> @Zack / @Rodrigo thoughts?
> @Zack Will this work for your usecase?
> 
> yaml
> +    dump:
> +        pre: drm-ras-nl-pre-list-nodes
> 
> 
> drm_ras.c :
> 
> +       if (node->pre_list_nodes)
> +                return node->pre_list_nodes(node);
> +
> +        return check_permissions(cb->skb);  //Checks creds
> 
> Thanks
> Riana
> 

I agree that a pre_doit is probably the best solution for this.

Not entirely sure what a driver specific implementation would look like 
yet, but I think that as long as the driver callback has a way to access 
the 'current' task_struct pointer corresponding to the user process then 
this approach seems like the best option from the netlink side.

Since this is mostly a concern for our specific use case, perhaps we can 
incorporate this functionality in our change down the road when we 
expand the interface for telemetry?

Let me know what you think.

Zack

>>
>> or something like that?!
>>
>> perhaps drivers could implement some form of cookie or pre- 
>> authorization with
>> ioctls or sysfs, and then store in the priv?
>>
>> Thoughts?
>> Other options?
>>
>>>
>>> I dont see any such permissions checks in your driver implementation 
>>> which
>>> is understandable given that it may not be necessary for your use cases.
>>> However, this would be a concern for our driver if we were to adopt this
>>> interface.
>>
>> yeap, this case was entirely with admin-perm, so not needed at all...
>> But I see your point and this is really not giving any flexibility to
>> other drivers.
>>

>>>>>
>>>
>>> Thanks,
>>>
>>> Zack
>>>
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-15 23:39           ` Zack McKevitt
@ 2026-01-16  5:56             ` Riana Tauro
  2026-01-16 20:26               ` Rodrigo Vivi
  0 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2026-01-16  5:56 UTC (permalink / raw)
  To: Zack McKevitt, Rodrigo Vivi
  Cc: Jakub Kicinski, intel-xe, dri-devel, aravind.iddamsetty,
	anshuman.gupta, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Lijo Lazar, Hawking Zhang, David S. Miller,
	Paolo Abeni, Eric Dumazet, netdev, Jeff Hugo



On 1/16/2026 5:09 AM, Zack McKevitt wrote:
> 
> 
> On 1/13/2026 1:20 AM, Riana Tauro wrote:
>>>>>> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/ 
>>>>>> Documentation/netlink/specs/drm_ras.yaml
>>>>>> new file mode 100644
>>>>>> index 000000000000..be0e379c5bc9
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>>>>>> @@ -0,0 +1,130 @@
>>>>>> +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>>>>>> BSD-3-Clause)
>>>>>> +---
>>>>>> +name: drm-ras
>>>>>> +protocol: genetlink
>>>>>> +uapi-header: drm/drm_ras.h
>>>>>> +
>>>>>> +doc: >-
>>>>>> +  DRM RAS (Reliability, Availability, Serviceability) over 
>>>>>> Generic Netlink.
>>>>>> +  Provides a standardized mechanism for DRM drivers to register 
>>>>>> "nodes"
>>>>>> +  representing hardware/software components capable of reporting 
>>>>>> error counters.
>>>>>> +  Userspace tools can query the list of nodes or individual error 
>>>>>> counters
>>>>>> +  via the Generic Netlink interface.
>>>>>> +
>>>>>> +definitions:
>>>>>> +  -
>>>>>> +    type: enum
>>>>>> +    name: node-type
>>>>>> +    value-start: 1
>>>>>> +    entries: [error-counter]
>>>>>> +    doc: >-
>>>>>> +         Type of the node. Currently, only error-counter nodes are
>>>>>> +         supported, which expose reliability counters for a 
>>>>>> hardware/software
>>>>>> +         component.
>>>>>> +
>>>>>> +attribute-sets:
>>>>>> +  -
>>>>>> +    name: node-attrs
>>>>>> +    attributes:
>>>>>> +      -
>>>>>> +        name: node-id
>>>>>> +        type: u32
>>>>>> +        doc: >-
>>>>>> +             Unique identifier for the node.
>>>>>> +             Assigned dynamically by the DRM RAS core upon 
>>>>>> registration.
>>>>>> +      -
>>>>>> +        name: device-name
>>>>>> +        type: string
>>>>>> +        doc: >-
>>>>>> +             Device name chosen by the driver at registration.
>>>>>> +             Can be a PCI BDF, UUID, or module name if unique.
>>>>>> +      -
>>>>>> +        name: node-name
>>>>>> +        type: string
>>>>>> +        doc: >-
>>>>>> +             Node name chosen by the driver at registration.
>>>>>> +             Can be an IP block name, or any name that identifies 
>>>>>> the
>>>>>> +             RAS node inside the device.
>>>>>> +      -
>>>>>> +        name: node-type
>>>>>> +        type: u32
>>>>>> +        doc: Type of this node, identifying its function.
>>>>>> +        enum: node-type
>>>>>> +  -
>>>>>> +    name: error-counter-attrs
>>>>>> +    attributes:
>>>>>> +      -
>>>>>> +        name: node-id
>>>>>> +        type: u32
>>>>>> +        doc:  Node ID targeted by this error counter operation.
>>>>>> +      -
>>>>>> +        name: error-id
>>>>>> +        type: u32
>>>>>> +        doc: Unique identifier for a specific error counter 
>>>>>> within an node.
>>>>>> +      -
>>>>>> +        name: error-name
>>>>>> +        type: string
>>>>>> +        doc: Name of the error.
>>>>>> +      -
>>>>>> +        name: error-value
>>>>>> +        type: u32
>>>>>> +        doc: Current value of the requested error counter.
>>>>>> +
>>>>>> +operations:
>>>>>> +  list:
>>>>>> +    -
>>>>>> +      name: list-nodes
>>>>>> +      doc: >-
>>>>>> +           Retrieve the full list of currently registered DRM RAS 
>>>>>> nodes.
>>>>>> +           Each node includes its dynamically assigned ID, name, 
>>>>>> and type.
>>>>>> +           **Important:** User space must call this operation 
>>>>>> first to obtain
>>>>>> +           the node IDs. These IDs are required for all subsequent
>>>>>> +           operations on nodes, such as querying error counters.
>>>>
>>>> I am curious about security implications of this design.
>>>
>>> hmm... very good point you are raising here.
>>>
>>> This current design relies entirely in the CAP_NET_ADMIN.
>>> No driver would have the flexibility to choose anything differently.
>>> Please notice that the flag admin-perm is hardcoded in this yaml file.
>>>
>>>> If the complete
>>>> list of RAS nodes is visible for any process on the system (and one 
>>>> wants to
>>>> avoid requiring CAP_NET_ADMIN), there should be some way to enforce
>>>> permission checks when performing these operations if desired.
>>>
>>> Right now, there's no way that the driver would choose not avoid 
>>> requiring
>>> CAP_NET_ADMIN...
>>>
>>> Only way would be the admin to give the cap_net_admin to the tool with:
>>>
>>> $ sudo setcap cap_net_admin+ep /bin/drm_ras_tool
>>>
>>> but not ideal and not granular anyway...
>>>
>>>>
>>>> For example, this might be implemented in the driver's definition of
>>>> callback functions like query_error_counter; some drivers may want 
>>>> to ensure
>>>> that the process can in fact open the file descriptor corresponding 
>>>> to the
>>>> queried device before serving a netlink request. Is it enough for a 
>>>> driver
>>>> to simply return -EPERM in this case? Any driver that doesnt wish to 
>>>> protect
>>>> its RAS nodes need not implement checks in their callbacks.
>>>
>>> Fair enough. If we want to give the option to the drivers, then we need:
>>>
>>> 1. to first remove all the admin-perm flags below and leave the 
>>> driver to
>>> pick up their policy on when to return something or -EPERM.
>>> 2. Document this security responsibility and list a few possibilities.
>>> 3. In our Xe case here I believe the easiest option is to use 
>>> something like:
>>>
>>> struct scm_creds *creds = NETLINK_CREDS(cb->skb);
>>> if (!gid_eq(creds->gid, GLOBAL_ROOT_GID))
>>>      return -EPERM
>>
>> The driver currently does not have access to the callback or the 
>> skbuffer. Sending these details as param to driver won't be right as
>> drm_ras needs to handle all the netlink buffers.
>>
>> How about using pre_doit & start calls? If driver has a pre callback , 
>> it's the responsibility of the driver to check permissions/any-pre 
>> conditions, else the CAP_NET_ADMIN permission will be checked.
>>
>> @Zack / @Rodrigo thoughts?
>> @Zack Will this work for your usecase?
>>
>> yaml
>> +    dump:
>> +        pre: drm-ras-nl-pre-list-nodes
>>
>>
>> drm_ras.c :
>>
>> +       if (node->pre_list_nodes)
>> +                return node->pre_list_nodes(node);
>> +
>> +        return check_permissions(cb->skb);  //Checks creds
>>
>> Thanks
>> Riana
>>
> 
> I agree that a pre_doit is probably the best solution for this.
> 
> Not entirely sure what a driver specific implementation would look like 
> yet, but I think that as long as the driver callback has a way to access 
> the 'current' task_struct pointer corresponding to the user process then 
> this approach seems like the best option from the netlink side.
> 
> Since this is mostly a concern for our specific use case, perhaps we can 
> incorporate this functionality in our change down the road when we 
> expand the interface for telemetry?


Yeah using pre_doit we can allow driver to make decisions based on
the private data or driver specific permissions. However we will need to 
check the CAP_NET_ADMIN when no driver callback is implemented in the 
netlink layer as a default .

Thank you, you can incorporate the changes when you add telemetry nodes.

For now, I will retain the admin-perm in flags.

I will address the rest of the review comments and send out a new 
revision shortly.

Thank you
Riana


> 
> Let me know what you think.
> 
> Zack
> 
>>>
>>> or something like that?!
>>>
>>> perhaps drivers could implement some form of cookie or pre- 
>>> authorization with
>>> ioctls or sysfs, and then store in the priv?
>>>
>>> Thoughts?
>>> Other options?
>>>
>>>>
>>>> I dont see any such permissions checks in your driver implementation 
>>>> which
>>>> is understandable given that it may not be necessary for your use 
>>>> cases.
>>>> However, this would be a concern for our driver if we were to adopt 
>>>> this
>>>> interface.
>>>
>>> yeap, this case was entirely with admin-perm, so not needed at all...
>>> But I see your point and this is really not giving any flexibility to
>>> other drivers.
>>>
> 
>>>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Zack
>>>>
>>
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-01-16  5:56             ` Riana Tauro
@ 2026-01-16 20:26               ` Rodrigo Vivi
  0 siblings, 0 replies; 31+ messages in thread
From: Rodrigo Vivi @ 2026-01-16 20:26 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Zack McKevitt, Jakub Kicinski, intel-xe, dri-devel,
	aravind.iddamsetty, anshuman.gupta, joonas.lahtinen, lukas,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet, netdev, Jeff Hugo

On Fri, Jan 16, 2026 at 11:26:36AM +0530, Riana Tauro wrote:
> 
> 
> On 1/16/2026 5:09 AM, Zack McKevitt wrote:
> > 
> > 
> > On 1/13/2026 1:20 AM, Riana Tauro wrote:
> > > > > > > diff --git
> > > > > > > a/Documentation/netlink/specs/drm_ras.yaml b/
> > > > > > > Documentation/netlink/specs/drm_ras.yaml
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..be0e379c5bc9
> > > > > > > --- /dev/null
> > > > > > > +++ b/Documentation/netlink/specs/drm_ras.yaml
> > > > > > > @@ -0,0 +1,130 @@
> > > > > > > +# SPDX-License-Identifier: ((GPL-2.0 WITH
> > > > > > > Linux-syscall-note) OR BSD-3-Clause)
> > > > > > > +---
> > > > > > > +name: drm-ras
> > > > > > > +protocol: genetlink
> > > > > > > +uapi-header: drm/drm_ras.h
> > > > > > > +
> > > > > > > +doc: >-
> > > > > > > +  DRM RAS (Reliability, Availability,
> > > > > > > Serviceability) over Generic Netlink.
> > > > > > > +  Provides a standardized mechanism for DRM drivers
> > > > > > > to register "nodes"
> > > > > > > +  representing hardware/software components capable
> > > > > > > of reporting error counters.
> > > > > > > +  Userspace tools can query the list of nodes or
> > > > > > > individual error counters
> > > > > > > +  via the Generic Netlink interface.
> > > > > > > +
> > > > > > > +definitions:
> > > > > > > +  -
> > > > > > > +    type: enum
> > > > > > > +    name: node-type
> > > > > > > +    value-start: 1
> > > > > > > +    entries: [error-counter]
> > > > > > > +    doc: >-
> > > > > > > +         Type of the node. Currently, only error-counter nodes are
> > > > > > > +         supported, which expose reliability
> > > > > > > counters for a hardware/software
> > > > > > > +         component.
> > > > > > > +
> > > > > > > +attribute-sets:
> > > > > > > +  -
> > > > > > > +    name: node-attrs
> > > > > > > +    attributes:
> > > > > > > +      -
> > > > > > > +        name: node-id
> > > > > > > +        type: u32
> > > > > > > +        doc: >-
> > > > > > > +             Unique identifier for the node.
> > > > > > > +             Assigned dynamically by the DRM RAS
> > > > > > > core upon registration.
> > > > > > > +      -
> > > > > > > +        name: device-name
> > > > > > > +        type: string
> > > > > > > +        doc: >-
> > > > > > > +             Device name chosen by the driver at registration.
> > > > > > > +             Can be a PCI BDF, UUID, or module name if unique.
> > > > > > > +      -
> > > > > > > +        name: node-name
> > > > > > > +        type: string
> > > > > > > +        doc: >-
> > > > > > > +             Node name chosen by the driver at registration.
> > > > > > > +             Can be an IP block name, or any name
> > > > > > > that identifies the
> > > > > > > +             RAS node inside the device.
> > > > > > > +      -
> > > > > > > +        name: node-type
> > > > > > > +        type: u32
> > > > > > > +        doc: Type of this node, identifying its function.
> > > > > > > +        enum: node-type
> > > > > > > +  -
> > > > > > > +    name: error-counter-attrs
> > > > > > > +    attributes:
> > > > > > > +      -
> > > > > > > +        name: node-id
> > > > > > > +        type: u32
> > > > > > > +        doc:  Node ID targeted by this error counter operation.
> > > > > > > +      -
> > > > > > > +        name: error-id
> > > > > > > +        type: u32
> > > > > > > +        doc: Unique identifier for a specific error
> > > > > > > counter within an node.
> > > > > > > +      -
> > > > > > > +        name: error-name
> > > > > > > +        type: string
> > > > > > > +        doc: Name of the error.
> > > > > > > +      -
> > > > > > > +        name: error-value
> > > > > > > +        type: u32
> > > > > > > +        doc: Current value of the requested error counter.
> > > > > > > +
> > > > > > > +operations:
> > > > > > > +  list:
> > > > > > > +    -
> > > > > > > +      name: list-nodes
> > > > > > > +      doc: >-
> > > > > > > +           Retrieve the full list of currently
> > > > > > > registered DRM RAS nodes.
> > > > > > > +           Each node includes its dynamically
> > > > > > > assigned ID, name, and type.
> > > > > > > +           **Important:** User space must call this
> > > > > > > operation first to obtain
> > > > > > > +           the node IDs. These IDs are required for all subsequent
> > > > > > > +           operations on nodes, such as querying error counters.
> > > > > 
> > > > > I am curious about security implications of this design.
> > > > 
> > > > hmm... very good point you are raising here.
> > > > 
> > > > This current design relies entirely in the CAP_NET_ADMIN.
> > > > No driver would have the flexibility to choose anything differently.
> > > > Please notice that the flag admin-perm is hardcoded in this yaml file.
> > > > 
> > > > > If the complete
> > > > > list of RAS nodes is visible for any process on the system
> > > > > (and one wants to
> > > > > avoid requiring CAP_NET_ADMIN), there should be some way to enforce
> > > > > permission checks when performing these operations if desired.
> > > > 
> > > > Right now, there's no way that the driver would choose not avoid
> > > > requiring
> > > > CAP_NET_ADMIN...
> > > > 
> > > > Only way would be the admin to give the cap_net_admin to the tool with:
> > > > 
> > > > $ sudo setcap cap_net_admin+ep /bin/drm_ras_tool
> > > > 
> > > > but not ideal and not granular anyway...
> > > > 
> > > > > 
> > > > > For example, this might be implemented in the driver's definition of
> > > > > callback functions like query_error_counter; some drivers
> > > > > may want to ensure
> > > > > that the process can in fact open the file descriptor
> > > > > corresponding to the
> > > > > queried device before serving a netlink request. Is it
> > > > > enough for a driver
> > > > > to simply return -EPERM in this case? Any driver that doesnt
> > > > > wish to protect
> > > > > its RAS nodes need not implement checks in their callbacks.
> > > > 
> > > > Fair enough. If we want to give the option to the drivers, then we need:
> > > > 
> > > > 1. to first remove all the admin-perm flags below and leave the
> > > > driver to
> > > > pick up their policy on when to return something or -EPERM.
> > > > 2. Document this security responsibility and list a few possibilities.
> > > > 3. In our Xe case here I believe the easiest option is to use
> > > > something like:
> > > > 
> > > > struct scm_creds *creds = NETLINK_CREDS(cb->skb);
> > > > if (!gid_eq(creds->gid, GLOBAL_ROOT_GID))
> > > >      return -EPERM
> > > 
> > > The driver currently does not have access to the callback or the
> > > skbuffer. Sending these details as param to driver won't be right as
> > > drm_ras needs to handle all the netlink buffers.
> > > 
> > > How about using pre_doit & start calls? If driver has a pre callback
> > > , it's the responsibility of the driver to check permissions/any-pre
> > > conditions, else the CAP_NET_ADMIN permission will be checked.
> > > 
> > > @Zack / @Rodrigo thoughts?
> > > @Zack Will this work for your usecase?
> > > 
> > > yaml
> > > +    dump:
> > > +        pre: drm-ras-nl-pre-list-nodes
> > > 
> > > 
> > > drm_ras.c :
> > > 
> > > +       if (node->pre_list_nodes)
> > > +                return node->pre_list_nodes(node);
> > > +
> > > +        return check_permissions(cb->skb);  //Checks creds
> > > 
> > > Thanks
> > > Riana
> > > 
> > 
> > I agree that a pre_doit is probably the best solution for this.
> > 
> > Not entirely sure what a driver specific implementation would look like
> > yet, but I think that as long as the driver callback has a way to access
> > the 'current' task_struct pointer corresponding to the user process then
> > this approach seems like the best option from the netlink side.
> > 
> > Since this is mostly a concern for our specific use case, perhaps we can
> > incorporate this functionality in our change down the road when we
> > expand the interface for telemetry?

Yes, as it can be changed transparently, let's do that...

> 
> 
> Yeah using pre_doit we can allow driver to make decisions based on
> the private data or driver specific permissions. However we will need to
> check the CAP_NET_ADMIN when no driver callback is implemented in the
> netlink layer as a default .
> 
> Thank you, you can incorporate the changes when you add telemetry nodes.
> 
> For now, I will retain the admin-perm in flags.

Cool then, when they come with their case we remove it and force in the
pre_doit as well.

ack..

> 
> I will address the rest of the review comments and send out a new revision
> shortly.
> 
> Thank you
> Riana
> 
> 
> > 
> > Let me know what you think.
> > 
> > Zack
> > 
> > > > 
> > > > or something like that?!
> > > > 
> > > > perhaps drivers could implement some form of cookie or pre-
> > > > authorization with
> > > > ioctls or sysfs, and then store in the priv?
> > > > 
> > > > Thoughts?
> > > > Other options?
> > > > 
> > > > > 
> > > > > I dont see any such permissions checks in your driver
> > > > > implementation which
> > > > > is understandable given that it may not be necessary for
> > > > > your use cases.
> > > > > However, this would be a concern for our driver if we were
> > > > > to adopt this
> > > > > interface.
> > > > 
> > > > yeap, this case was entirely with admin-perm, so not needed at all...
> > > > But I see your point and this is really not giving any flexibility to
> > > > other drivers.
> > > > 
> > 
> > > > > > > 
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Zack
> > > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
  2025-12-05  8:39 ` [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2025-12-05  8:39 ` Riana Tauro
  2025-12-09  8:22   ` Raag Jadav
  2025-12-09 21:57   ` Rodrigo Vivi
  2025-12-05  8:39 ` [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Riana Tauro @ 2025-12-05  8:39 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	lukas, simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, Riana Tauro

Allocate correctable, nonfatal and fatal nodes per xe device.
Each node contains error classes, counters and respective
query counter functions.

Add basic functionality to create and register drm nodes.
Below operations can be performed using Generic netlink DRM RAS interface

List Nodes:

$ sudo ynl --family drm_ras  --dump list-nodes
[{'device-name': '0000:03:00.0',
  'node-id': 0,
  'node-name': 'correctable-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 1,
  'node-name': 'nonfatal-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 2,
  'node-name': 'fatal-errors',
  'node-type': 'error-counter'}]

Get Error counters:

$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
 {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]

Query Error counter:

$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
    Add documentation
    Modify commit message
---
 drivers/gpu/drm/xe/Makefile           |   1 +
 drivers/gpu/drm/xe/xe_device_types.h  |   4 +
 drivers/gpu/drm/xe/xe_drm_ras.c       | 199 ++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h       |  12 ++
 drivers/gpu/drm/xe/xe_drm_ras_types.h |  40 ++++++
 drivers/gpu/drm/xe/xe_hw_error.c      |  64 ++++-----
 include/uapi/drm/xe_drm.h             |  82 +++++++++++
 7 files changed, 368 insertions(+), 34 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index a7e13a676f7d..bc417ef19280 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -41,6 +41,7 @@ xe-y += xe_bb.o \
 	xe_device_sysfs.o \
 	xe_dma_buf.o \
 	xe_drm_client.o \
+	xe_drm_ras.o \
 	xe_eu_stall.o \
 	xe_exec.o \
 	xe_exec_queue.o \
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 9de73353223f..d6ea275700e1 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -13,6 +13,7 @@
 #include <drm/ttm/ttm_device.h>
 
 #include "xe_devcoredump_types.h"
+#include "xe_drm_ras_types.h"
 #include "xe_heci_gsc.h"
 #include "xe_late_bind_fw_types.h"
 #include "xe_lmtt_types.h"
@@ -361,6 +362,9 @@ struct xe_device {
 		bool oob_initialized;
 	} wa_active;
 
+	/** @ras: ras structure for device */
+	struct xe_drm_ras ras;
+
 	/** @survivability: survivability information for device */
 	struct xe_survivability survivability;
 
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
new file mode 100644
index 000000000000..764b14b1edf8
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <drm/drm_managed.h>
+#include <drm/drm_ras.h>
+#include <linux/bitmap.h>
+
+#include "xe_device.h"
+#include "xe_drm_ras.h"
+
+static const char * const errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+
+static int hw_query_error_counter(struct xe_drm_ras_counter *info,
+				  u32 error_id, const char **name, u32 *val)
+{
+	if (error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)
+		return -EINVAL;
+
+	if (!info[error_id].name)
+		return -ENOENT;
+
+	*name = info[error_id].name;
+	*val = atomic64_read(&info[error_id].counter);
+
+	return 0;
+}
+
+static int query_non_fatal_error_counters(struct drm_ras_node *ep,
+					  u32 error_id, const char **name,
+					  u32 *val)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_NONFATAL];
+
+	return hw_query_error_counter(info, error_id, name, val);
+}
+
+static int query_fatal_error_counters(struct drm_ras_node *ep,
+				      u32 error_id, const char **name,
+				      u32 *val)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_FATAL];
+
+	return hw_query_error_counter(info, error_id, name, val);
+}
+
+static int query_correctable_error_counters(struct drm_ras_node *ep,
+					    u32 error_id, const char **name,
+					    u32 *val)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_CORRECTABLE];
+
+	return hw_query_error_counter(info, error_id, name, val);
+}
+
+static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe,
+							     int count)
+{
+	struct xe_drm_ras_counter *counter;
+	int i;
+
+	counter = drmm_kzalloc(&xe->drm, count * sizeof(struct xe_drm_ras_counter), GFP_KERNEL);
+	if (!counter)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < count; i++) {
+		if (!errors[i])
+			continue;
+
+		counter[i].name = errors[i];
+		atomic64_set(&counter[i].counter, 0);
+	}
+
+	return counter;
+}
+
+static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
+			      const enum drm_xe_ras_error_severity severity)
+{
+	struct xe_drm_ras *ras = &xe->ras;
+	int count = 0, ret = 0;
+
+	count = DRM_XE_RAS_ERROR_CLASS_MAX;
+	node->error_counter_range.first = DRM_XE_RAS_ERROR_CORE_COMPUTE;
+	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
+
+	ras->info[severity] = allocate_and_copy_counters(xe, count);
+	if (IS_ERR(ras->info[severity]))
+		return PTR_ERR(ras->info[severity]);
+
+	switch (severity) {
+	case DRM_XE_RAS_ERROR_CORRECTABLE:
+		node->query_error_counter = query_correctable_error_counters;
+		break;
+	case DRM_XE_RAS_ERROR_NONFATAL:
+		node->query_error_counter = query_non_fatal_error_counters;
+		break;
+	case DRM_XE_RAS_ERROR_FATAL:
+		node->query_error_counter = query_fatal_error_counters;
+		break;
+	default:
+		break;
+	}
+
+	return ret;
+}
+
+static int register_nodes(struct xe_device *xe)
+{
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+	struct xe_drm_ras *ras = &xe->ras;
+	const char *device_name;
+	int i = 0, ret;
+
+	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
+				pci_domain_nr(pdev->bus), pdev->bus->number,
+				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
+		struct drm_ras_node *node = &ras->node[i];
+
+		node->device_name = device_name;
+		node->node_name = error_severity[i];
+		node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
+		node->priv = xe;
+
+		ret = assign_node_params(xe, node, i);
+		if (ret)
+			return ret;
+
+		ret = drm_ras_node_register(node);
+		if (ret) {
+			drm_err(&xe->drm, "Failed to register drm ras tile node\n");
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void xe_drm_ras_unregister_nodes(void *arg)
+{
+	struct xe_device *xe = arg;
+	struct xe_drm_ras *ras = &xe->ras;
+	int i = 0;
+
+	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
+		struct drm_ras_node *node = &ras->node[i];
+
+		drm_ras_node_unregister(node);
+
+		if (i == 0)
+			kfree(node->device_name);
+	}
+}
+
+/**
+ * xe_drm_ras_allocate_nodes - Allocate drm ras nodes
+ * @xe: xe device instance
+ *
+ * Allocate xe drm ras nodes for all error severities per device
+ *
+ * Return: 0 on success, error code on failure
+ */
+int xe_drm_ras_allocate_nodes(struct xe_device *xe)
+{
+	struct xe_drm_ras *ras = &xe->ras;
+	struct drm_ras_node *node;
+	int err;
+
+	node = drmm_kzalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX * sizeof(struct drm_ras_node),
+			    GFP_KERNEL);
+	if (!node)
+		return -ENOMEM;
+
+	ras->node = node;
+
+	err = register_nodes(xe);
+	if (err) {
+		drm_err(&xe->drm, "Failed to register drm ras node\n");
+		return err;
+	}
+
+	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
+	if (err) {
+		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
+		return err;
+	}
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
new file mode 100644
index 000000000000..6272b5da4e6d
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+#ifndef XE_DRM_RAS_H_
+#define XE_DRM_RAS_H_
+
+struct xe_device;
+
+int xe_drm_ras_allocate_nodes(struct xe_device *xe);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
new file mode 100644
index 000000000000..409d6fa54a23
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_DRM_RAS_TYPES_H_
+#define _XE_DRM_RAS_TYPES_H_
+
+#include <drm/xe_drm.h>
+#include <linux/atomic.h>
+
+struct drm_ras_node;
+
+/**
+ * struct xe_drm_ras_counter - xe ras counter
+ *
+ * This structure contains error class and counter information
+ */
+struct xe_drm_ras_counter {
+	/** @name: error class name */
+	const char *name;
+	/** @counter: count of error */
+	atomic64_t counter;
+};
+
+/**
+ * struct xe_drm_ras - xe drm ras structure
+ *
+ * This structure has details of error counters
+ */
+struct xe_drm_ras {
+	/** @node: DRM RAS node */
+	struct drm_ras_node *node;
+
+	/** @info: info array for all types of errors */
+	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
+
+};
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 8c65291f36fc..d63078d00b56 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -10,20 +10,14 @@
 #include "regs/xe_irq_regs.h"
 
 #include "xe_device.h"
+#include "xe_drm_ras.h"
 #include "xe_hw_error.h"
 #include "xe_mmio.h"
 #include "xe_survivability_mode.h"
 
 #define  HEC_UNCORR_FW_ERR_BITS 4
 extern struct fault_attr inject_csc_hw_error;
-
-/* Error categories reported by hardware */
-enum hardware_error {
-	HARDWARE_ERROR_CORRECTABLE = 0,
-	HARDWARE_ERROR_NONFATAL = 1,
-	HARDWARE_ERROR_FATAL = 2,
-	HARDWARE_ERROR_MAX,
-};
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
 
 static const char * const hec_uncorrected_fw_errors[] = {
 	"Fatal",
@@ -32,20 +26,6 @@ static const char * const hec_uncorrected_fw_errors[] = {
 	"Data Corruption"
 };
 
-static const char *hw_error_to_str(const enum hardware_error hw_err)
-{
-	switch (hw_err) {
-	case HARDWARE_ERROR_CORRECTABLE:
-		return "CORRECTABLE";
-	case HARDWARE_ERROR_NONFATAL:
-		return "NONFATAL";
-	case HARDWARE_ERROR_FATAL:
-		return "FATAL";
-	default:
-		return "UNKNOWN";
-	}
-}
-
 static bool fault_inject_csc_hw_error(void)
 {
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
@@ -62,9 +42,10 @@ static void csc_hw_error_work(struct work_struct *work)
 		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");
 }
 
-static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+static void csc_hw_error_handler(struct xe_tile *tile,
+				 const enum drm_xe_ras_error_severity severity)
 {
-	const char *hw_err_str = hw_error_to_str(hw_err);
+	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_mmio *mmio = &tile->mmio;
 	u32 base, err_bit, err_src;
@@ -78,7 +59,7 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
 	if (!err_src) {
 		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
-				    tile->id, hw_err_str);
+				    tile->id, severity_str);
 		return;
 	}
 
@@ -87,7 +68,7 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
 			drm_err_ratelimited(&xe->drm, HW_ERR
 					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
-					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
+					     severity_str, hec_uncorrected_fw_errors[err_bit],
 					     err_bit);
 
 			schedule_work(&tile->csc_hw_error_work);
@@ -97,9 +78,9 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
 }
 
-static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_severity severity)
 {
-	const char *hw_err_str = hw_error_to_str(hw_err);
+	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
 	unsigned long flags;
 	u32 err_src;
@@ -108,17 +89,17 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 		return;
 
 	spin_lock_irqsave(&xe->irq.lock, flags);
-	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
+	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(severity));
 	if (!err_src) {
 		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
-				    tile->id, hw_err_str);
+				    tile->id, severity_str);
 		goto unlock;
 	}
 
 	if (err_src & XE_CSC_ERROR)
-		csc_hw_error_handler(tile, hw_err);
+		csc_hw_error_handler(tile, severity);
 
-	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
 
 unlock:
 	spin_unlock_irqrestore(&xe->irq.lock, flags);
@@ -136,16 +117,30 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
  */
 void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 {
-	enum hardware_error hw_err;
+	u32 hw_err;
 
 	if (fault_inject_csc_hw_error())
 		schedule_work(&tile->csc_hw_error_work);
 
-	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+	for (hw_err = 0; hw_err < DRM_XE_RAS_ERROR_SEVERITY_MAX; hw_err++)
 		if (master_ctl & ERROR_IRQ(hw_err))
 			hw_error_source_handler(tile, hw_err);
 }
 
+static int hw_error_info_init(struct xe_device *xe)
+{
+	int ret;
+
+	if (xe->info.platform != XE_PVC)
+		return 0;
+
+	ret = xe_drm_ras_allocate_nodes(xe);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
 /*
  * Process hardware errors during boot
  */
@@ -178,5 +173,6 @@ void xe_hw_error_init(struct xe_device *xe)
 
 	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
 
+	hw_error_info_init(xe);
 	process_hw_errors(xe);
 }
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 0d99bb0cd20a..3f6c38908b70 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -2294,6 +2294,88 @@ struct drm_xe_vm_query_mem_range_attr {
 
 };
 
+/**
+ * DOC: Xe DRM RAS
+ *
+ * The enums and strings defined below map to the attributes of the DRM RAS Netlink Interface.
+ * Refer to Documentation/netlink/specs/drm_ras.yaml for complete interface specification.
+ *
+ * Node Registration
+ * =================
+ *
+ * The driver registers DRM RAS nodes for each error severity level.
+ * enum drm_xe_ras_error_severity defines the node-id, while DRM_XE_RAS_ERROR_SEVERITY_NAMES maps
+ * node-id to node-name.
+ *
+ * Error Classification
+ * ====================
+ *
+ * Each node contains a list of error counters. Each error is identified by a error-id and
+ * an error-name. enum drm_xe_ras_error_class defines the error-id, while
+ * DRM_XE_RAS_ERROR_CLASS_NAMES maps error-id to error-name.
+ *
+ * User Interface
+ * ==============
+ *
+ * To retrieve error values of a error counter, userspace applications should
+ * follow the below steps:
+ *
+ * 1. Use command LIST_NODES to enumerate all available nodes
+ * 2. Select node by node-id or node-name
+ * 3. Use command GET_ERROR_COUNTERS to list errors of specific node
+ * 4. Query specific error values using either error-id or error-name
+ *
+ * .. code-block:: C
+ *
+ *	// Lookup tables for ID-to-name resolution
+ *	static const char *nodes[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+ *	static const char *errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
+ *
+ */
+
+/**
+ * enum drm_xe_ras_error_severity - Supported drm ras error severity.
+ */
+enum drm_xe_ras_error_severity {
+	/** @DRM_XE_RAS_ERROR_CORRECTABLE: Correctable Error */
+	DRM_XE_RAS_ERROR_CORRECTABLE = 0,
+	/** @DRM_XE_RAS_ERROR_NONFATAL: Non fatal Error */
+	DRM_XE_RAS_ERROR_NONFATAL,
+	/** @DRM_XE_RAS_ERROR_FATAL: Fatal error */
+	DRM_XE_RAS_ERROR_FATAL,
+	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
+	DRM_XE_RAS_ERROR_SEVERITY_MAX, /* non-ABI */
+};
+
+/**
+ * enum drm_xe_ras_error_class - Supported drm ras error classes.
+ */
+enum drm_xe_ras_error_class {
+	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
+	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
+	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
+	DRM_XE_RAS_ERROR_SOC_INTERNAL,
+	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
+	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
+};
+
+/*
+ * Error severity to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_SEVERITY_NAMES {				\
+	[DRM_XE_RAS_ERROR_CORRECTABLE] = "correctable-errors",		\
+	[DRM_XE_RAS_ERROR_NONFATAL] = "nonfatal-errors",		\
+	[DRM_XE_RAS_ERROR_FATAL] = "fatal-errors",			\
+}
+
+/*
+ * Error class to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_CLASS_NAMES {					\
+	[DRM_XE_RAS_ERROR_CORE_COMPUTE] =  "Core Compute Error",	\
+	[DRM_XE_RAS_ERROR_SOC_INTERNAL] =  "SOC Internal Error",	\
+}
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2025-12-05  8:39 ` [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
@ 2025-12-09  8:22   ` Raag Jadav
  2026-01-09  8:08     ` Riana Tauro
  2025-12-09 21:57   ` Rodrigo Vivi
  1 sibling, 1 reply; 31+ messages in thread
From: Raag Jadav @ 2025-12-09  8:22 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar

On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
> Allocate correctable, nonfatal and fatal nodes per xe device.
> Each node contains error classes, counters and respective
> query counter functions.
> 
> Add basic functionality to create and register drm nodes.
> Below operations can be performed using Generic netlink DRM RAS interface
> 
> List Nodes:
> 
> $ sudo ynl --family drm_ras  --dump list-nodes
> [{'device-name': '0000:03:00.0',
>   'node-id': 0,
>   'node-name': 'correctable-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 1,
>   'node-name': 'nonfatal-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 2,
>   'node-name': 'fatal-errors',
>   'node-type': 'error-counter'}]
> 
> Get Error counters:
> 
> $ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
>  {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]

Minor bikeshedding: Is there anything like 'SOC External'? If not, perhaps
simply 'SOC' would be sufficient.

> Query Error counter:
> 
> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
> {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}

One more (sorry): So this means graphics will be a different id? Or do they
overlap? How does it work?


Also,

[*] I'm not much informed about the history here but the 'error' term
seems slapped onto almost everything. We already know it's RAS so perhaps
we add it only where make sense and try to simplify some of the naming? 

...

> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
> new file mode 100644
> index 000000000000..764b14b1edf8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include <drm/drm_managed.h>
> +#include <drm/drm_ras.h>
> +#include <linux/bitmap.h>
> +
> +#include "xe_device.h"

This inherits some of the debt that should not be there, so let's try to
get away with xe_device_types.h where possible. But please double check.

> +#include "xe_drm_ras.h"

...

> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe,
> +							     int count)
> +{
> +	struct xe_drm_ras_counter *counter;
> +	int i;
> +
> +	counter = drmm_kzalloc(&xe->drm, count * sizeof(struct xe_drm_ras_counter), GFP_KERNEL);

Why not drmm_kcalloc()? We get a bonus overflow protection.

> +	if (!counter)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < count; i++) {
> +		if (!errors[i])
> +			continue;
> +
> +		counter[i].name = errors[i];
> +		atomic64_set(&counter[i].counter, 0);
> +	}
> +
> +	return counter;
> +}
> +
> +static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
> +			      const enum drm_xe_ras_error_severity severity)
> +{
> +	struct xe_drm_ras *ras = &xe->ras;
> +	int count = 0, ret = 0;

Redundant initializations, also why do we need them?

> +	count = DRM_XE_RAS_ERROR_CLASS_MAX;
> +	node->error_counter_range.first = DRM_XE_RAS_ERROR_CORE_COMPUTE;
> +	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
> +
> +	ras->info[severity] = allocate_and_copy_counters(xe, count);

This looks like count should be derived from first and last, or did I
miss something?

> +	if (IS_ERR(ras->info[severity]))
> +		return PTR_ERR(ras->info[severity]);
> +
> +	switch (severity) {
> +	case DRM_XE_RAS_ERROR_CORRECTABLE:
> +		node->query_error_counter = query_correctable_error_counters;
> +		break;
> +	case DRM_XE_RAS_ERROR_NONFATAL:
> +		node->query_error_counter = query_non_fatal_error_counters;
> +		break;
> +	case DRM_XE_RAS_ERROR_FATAL:
> +		node->query_error_counter = query_fatal_error_counters;
> +		break;
> +	default:

Do we need this?

> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int register_nodes(struct xe_device *xe)
> +{
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	const char *device_name;
> +	int i = 0, ret;

Redundant initialization. Also, ret belongs to the loop below.

> +	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
> +				pci_domain_nr(pdev->bus), pdev->bus->number,
> +				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +
> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {

We could potentially have a nice for_each_error_severity() now ;)

> +		struct drm_ras_node *node = &ras->node[i];
> +
> +		node->device_name = device_name;
> +		node->node_name = error_severity[i];
> +		node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
> +		node->priv = xe;
> +
> +		ret = assign_node_params(xe, node, i);
> +		if (ret)
> +			return ret;
> +
> +		ret = drm_ras_node_register(node);
> +		if (ret) {
> +			drm_err(&xe->drm, "Failed to register drm ras tile node\n");
> +			return ret;
> +		}
> +	}
> +
> +	return 0;
> +}

...

> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
> +{
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct drm_ras_node *node;
> +	int err;
> +
> +	node = drmm_kzalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX * sizeof(struct drm_ras_node),
> +			    GFP_KERNEL);

Ditto for drmm_kcalloc().

> +	if (!node)
> +		return -ENOMEM;
> +
> +	ras->node = node;
> +
> +	err = register_nodes(xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to register drm ras node\n");

You wanted to call drm_err_probe(), didn't you ...?

Ah, not upstream yet :(
But perhaps an opportunity worth considering.

> +		return err;
> +	}
> +
> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");

Ditto.

> +		return err;
> +	}
> +
> +	return 0;

...

> +/**
> + * struct xe_drm_ras_counter - xe ras counter
> + *
> + * This structure contains error class and counter information
> + */
> +struct xe_drm_ras_counter {
> +	/** @name: error class name */
> +	const char *name;
> +	/** @counter: count of error */
> +	atomic64_t counter;
> +};
> +
> +/**
> + * struct xe_drm_ras - xe drm ras structure
> + *
> + * This structure has details of error counters
> + */
> +struct xe_drm_ras {
> +	/** @node: DRM RAS node */
> +	struct drm_ras_node *node;
> +
> +	/** @info: info array for all types of errors */
> +	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
> +
> +};

Either separate the members with blank lines or don't, but let's be
consistent.

...

>  void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>  {
> -	enum hardware_error hw_err;
> +	u32 hw_err;
>  
>  	if (fault_inject_csc_hw_error())
>  		schedule_work(&tile->csc_hw_error_work);
>  
> -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
> +	for (hw_err = 0; hw_err < DRM_XE_RAS_ERROR_SEVERITY_MAX; hw_err++)

for_each_error_severity()

>  		if (master_ctl & ERROR_IRQ(hw_err))
>  			hw_error_source_handler(tile, hw_err);
>  }

...

> +/**
> + * enum drm_xe_ras_error_severity - Supported drm ras error severity.
> + */
> +enum drm_xe_ras_error_severity {
> +	/** @DRM_XE_RAS_ERROR_CORRECTABLE: Correctable Error */
> +	DRM_XE_RAS_ERROR_CORRECTABLE = 0,
> +	/** @DRM_XE_RAS_ERROR_NONFATAL: Non fatal Error */
> +	DRM_XE_RAS_ERROR_NONFATAL,
> +	/** @DRM_XE_RAS_ERROR_FATAL: Fatal error */
> +	DRM_XE_RAS_ERROR_FATAL,
> +	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
> +	DRM_XE_RAS_ERROR_SEVERITY_MAX, /* non-ABI */

This is guaranteed last member, so redundant comma.

> +};
> +
> +/**
> + * enum drm_xe_ras_error_class - Supported drm ras error classes.
> + */
> +enum drm_xe_ras_error_class {
> +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
> +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
> +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
> +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
> +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
> +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */

Ditto.

> +};

Also, all of the enums share the same DRM_XE_RAS_ERROR_* prefix, so let's try
to have distinguishable naming. Perhaps [*] would be useful here as well ;)

Raag

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2025-12-09  8:22   ` Raag Jadav
@ 2026-01-09  8:08     ` Riana Tauro
  2026-01-09 14:13       ` Rodrigo Vivi
  0 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2026-01-09  8:08 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar

Hi Raag

Thank you for the review

On 12/9/2025 1:52 PM, Raag Jadav wrote:
> On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
>> Allocate correctable, nonfatal and fatal nodes per xe device.
>> Each node contains error classes, counters and respective
>> query counter functions.
>>
>> Add basic functionality to create and register drm nodes.
>> Below operations can be performed using Generic netlink DRM RAS interface
>>
>> List Nodes:
>>
>> $ sudo ynl --family drm_ras  --dump list-nodes
>> [{'device-name': '0000:03:00.0',
>>    'node-id': 0,
>>    'node-name': 'correctable-errors',
>>    'node-type': 'error-counter'},
>>   {'device-name': '0000:03:00.0',
>>    'node-id': 1,
>>    'node-name': 'nonfatal-errors',
>>    'node-type': 'error-counter'},
>>   {'device-name': '0000:03:00.0',
>>    'node-id': 2,
>>    'node-name': 'fatal-errors',
>>    'node-type': 'error-counter'}]
>>
>> Get Error counters:
>>
>> $ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
>> [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
>>   {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]
> 
> Minor bikeshedding: Is there anything like 'SOC External'? If not, perhaps
> simply 'SOC' would be sufficient.

Agree. SoC should be sufficient

> 
>> Query Error counter:
>>
>> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
>> {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
> 
> One more (sorry): So this means graphics will be a different id? Or do they
> overlap? How does it work?
> 

Did not get this question.

> 
> Also,
> 
> [*] I'm not much informed about the history here but the 'error' term
> seems slapped onto almost everything. We already know it's RAS so perhaps
> we add it only where make sense and try to simplify some of the naming?

Let's keep the errors in the node-names. Removing it from error-name 
should be okay. Wil fix ths


> 
> ...
> 
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
>> new file mode 100644
>> index 000000000000..764b14b1edf8
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
>> @@ -0,0 +1,199 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_managed.h>
>> +#include <drm/drm_ras.h>
>> +#include <linux/bitmap.h>
>> +
>> +#include "xe_device.h"
> 
> This inherits some of the debt that should not be there, so let's try to
> get away with xe_device_types.h where possible. But please double check.
> 
>> +#include "xe_drm_ras.h"
> 
> ...
> 
>> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe,
>> +							     int count)
>> +{
>> +	struct xe_drm_ras_counter *counter;
>> +	int i;
>> +
>> +	counter = drmm_kzalloc(&xe->drm, count * sizeof(struct xe_drm_ras_counter), GFP_KERNEL);
> 
> Why not drmm_kcalloc()? We get a bonus overflow protection.

Will check

> 
>> +	if (!counter)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	for (i = 0; i < count; i++) {
>> +		if (!errors[i])
>> +			continue;
>> +
>> +		counter[i].name = errors[i];
>> +		atomic64_set(&counter[i].counter, 0);
>> +	}
>> +
>> +	return counter;
>> +}
>> +
>> +static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
>> +			      const enum drm_xe_ras_error_severity severity)
>> +{
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	int count = 0, ret = 0;
> 
> Redundant initializations, also why do we need them?

redundant code from previous rev. Will remove

> 
>> +	count = DRM_XE_RAS_ERROR_CLASS_MAX;
>> +	node->error_counter_range.first = DRM_XE_RAS_ERROR_CORE_COMPUTE;
>> +	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
>> +
>> +	ras->info[severity] = allocate_and_copy_counters(xe, count);
> 
> This looks like count should be derived from first and last, or did I
> miss something?

assigned it directly. Can be done

> 
>> +	if (IS_ERR(ras->info[severity]))
>> +		return PTR_ERR(ras->info[severity]);
>> +
>> +	switch (severity) {
>> +	case DRM_XE_RAS_ERROR_CORRECTABLE:
>> +		node->query_error_counter = query_correctable_error_counters;
>> +		break;
>> +	case DRM_XE_RAS_ERROR_NONFATAL:
>> +		node->query_error_counter = query_non_fatal_error_counters;
>> +		break;
>> +	case DRM_XE_RAS_ERROR_FATAL:
>> +		node->query_error_counter = query_fatal_error_counters;
>> +		break;
>> +	default:
> 
> Do we need this?

yes.

> 
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int register_nodes(struct xe_device *xe)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	const char *device_name;
>> +	int i = 0, ret;
> 
> Redundant initialization. Also, ret belongs to the loop below.
> 
>> +	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
>> +				pci_domain_nr(pdev->bus), pdev->bus->number,
>> +				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
>> +
>> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
> 
> We could potentially have a nice for_each_error_severity() now ;)

Sure. Will check. If its used in multiple places then its worth having a 
helper

> 
>> +		struct drm_ras_node *node = &ras->node[i];
>> +
>> +		node->device_name = device_name;
>> +		node->node_name = error_severity[i];
>> +		node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
>> +		node->priv = xe;
>> +
>> +		ret = assign_node_params(xe, node, i);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ret = drm_ras_node_register(node);
>> +		if (ret) {
>> +			drm_err(&xe->drm, "Failed to register drm ras tile node\n");
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
> 
> ...
> 
>> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
>> +{
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct drm_ras_node *node;
>> +	int err;
>> +
>> +	node = drmm_kzalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX * sizeof(struct drm_ras_node),
>> +			    GFP_KERNEL);
> 
> Ditto for drmm_kcalloc().
> 
>> +	if (!node)
>> +		return -ENOMEM;
>> +
>> +	ras->node = node;
>> +
>> +	err = register_nodes(xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to register drm ras node\n");
> 
> You wanted to call drm_err_probe(), didn't you ...?
> 
> Ah, not upstream yet :(
> But perhaps an opportunity worth considering.
> 
>> +		return err;
>> +	}
>> +
>> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
> 
> Ditto.
> 
>> +		return err;
>> +	}
>> +
>> +	return 0;
> 
> ...
> 
>> +/**
>> + * struct xe_drm_ras_counter - xe ras counter
>> + *
>> + * This structure contains error class and counter information
>> + */
>> +struct xe_drm_ras_counter {
>> +	/** @name: error class name */
>> +	const char *name;
>> +	/** @counter: count of error */
>> +	atomic64_t counter;
>> +};
>> +
>> +/**
>> + * struct xe_drm_ras - xe drm ras structure
>> + *
>> + * This structure has details of error counters
>> + */
>> +struct xe_drm_ras {
>> +	/** @node: DRM RAS node */
>> +	struct drm_ras_node *node;
>> +
>> +	/** @info: info array for all types of errors */
>> +	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
>> +
>> +};
> 
> Either separate the members with blank lines or don't, but let's be
> consistent.

Will fix

> 
> ...
> 
>>   void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>>   {
>> -	enum hardware_error hw_err;
>> +	u32 hw_err;
>>   
>>   	if (fault_inject_csc_hw_error())
>>   		schedule_work(&tile->csc_hw_error_work);
>>   
>> -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>> +	for (hw_err = 0; hw_err < DRM_XE_RAS_ERROR_SEVERITY_MAX; hw_err++)
> 
> for_each_error_severity()
> 
>>   		if (master_ctl & ERROR_IRQ(hw_err))
>>   			hw_error_source_handler(tile, hw_err);
>>   }
> 
> ...
> 
>> +/**
>> + * enum drm_xe_ras_error_severity - Supported drm ras error severity.
>> + */
>> +enum drm_xe_ras_error_severity {
>> +	/** @DRM_XE_RAS_ERROR_CORRECTABLE: Correctable Error */
>> +	DRM_XE_RAS_ERROR_CORRECTABLE = 0,
>> +	/** @DRM_XE_RAS_ERROR_NONFATAL: Non fatal Error */
>> +	DRM_XE_RAS_ERROR_NONFATAL,
>> +	/** @DRM_XE_RAS_ERROR_FATAL: Fatal error */
>> +	DRM_XE_RAS_ERROR_FATAL,
>> +	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
>> +	DRM_XE_RAS_ERROR_SEVERITY_MAX, /* non-ABI */
> 
> This is guaranteed last member, so redundant comma.

ok

> 
>> +};
>> +
>> +/**
>> + * enum drm_xe_ras_error_class - Supported drm ras error classes.
>> + */
>> +enum drm_xe_ras_error_class {
>> +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
>> +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
>> +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
>> +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
>> +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
>> +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
> 
> Ditto.
> 
>> +};
> 
> Also, all of the enums share the same DRM_XE_RAS_ERROR_* prefix, so let's try
> to have distinguishable naming. Perhaps [*] would be useful here as well ;)

DRM_XE_RAS_ERROR_SEVERITY_* will cause longer names. Any suggestions?

Thanks
Riana

> 
> Raag


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-09  8:08     ` Riana Tauro
@ 2026-01-09 14:13       ` Rodrigo Vivi
  2026-01-09 15:58         ` Raag Jadav
  0 siblings, 1 reply; 31+ messages in thread
From: Rodrigo Vivi @ 2026-01-09 14:13 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Raag Jadav, intel-xe, dri-devel, aravind.iddamsetty,
	anshuman.gupta, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar

On Fri, Jan 09, 2026 at 01:38:44PM +0530, Riana Tauro wrote:
> Hi Raag
> 
> Thank you for the review
> 
> On 12/9/2025 1:52 PM, Raag Jadav wrote:
> > On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
> > > Allocate correctable, nonfatal and fatal nodes per xe device.
> > > Each node contains error classes, counters and respective
> > > query counter functions.
> > > 
> > > Add basic functionality to create and register drm nodes.
> > > Below operations can be performed using Generic netlink DRM RAS interface
> > > 
> > > List Nodes:
> > > 
> > > $ sudo ynl --family drm_ras  --dump list-nodes
> > > [{'device-name': '0000:03:00.0',
> > >    'node-id': 0,
> > >    'node-name': 'correctable-errors',
> > >    'node-type': 'error-counter'},
> > >   {'device-name': '0000:03:00.0',
> > >    'node-id': 1,
> > >    'node-name': 'nonfatal-errors',
> > >    'node-type': 'error-counter'},
> > >   {'device-name': '0000:03:00.0',
> > >    'node-id': 2,
> > >    'node-name': 'fatal-errors',
> > >    'node-type': 'error-counter'}]
> > > 
> > > Get Error counters:
> > > 
> > > $ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> > > [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
> > >   {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]
> > 
> > Minor bikeshedding: Is there anything like 'SOC External'? If not, perhaps
> > simply 'SOC' would be sufficient.
> 
> Agree. SoC should be sufficient
> 
> > 
> > > Query Error counter:
> > > 
> > > $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
> > > {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
> > 
> > One more (sorry): So this means graphics will be a different id? Or do they
> > overlap? How does it work?
> > 
> 
> Did not get this question.
> 
> > 
> > Also,
> > 
> > [*] I'm not much informed about the history here but the 'error' term
> > seems slapped onto almost everything. We already know it's RAS so perhaps
> > we add it only where make sense and try to simplify some of the naming?
> 
> Let's keep the errors in the node-names. Removing it from error-name should
> be okay. Wil fix ths
> 
> 
> > 
> > ...
> > 
> > > diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
> > > new file mode 100644
> > > index 000000000000..764b14b1edf8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
> > > @@ -0,0 +1,199 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#include <drm/drm_managed.h>
> > > +#include <drm/drm_ras.h>
> > > +#include <linux/bitmap.h>
> > > +
> > > +#include "xe_device.h"
> > 
> > This inherits some of the debt that should not be there, so let's try to
> > get away with xe_device_types.h where possible. But please double check.
> > 
> > > +#include "xe_drm_ras.h"
> > 
> > ...
> > 
> > > +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe,
> > > +							     int count)
> > > +{
> > > +	struct xe_drm_ras_counter *counter;
> > > +	int i;
> > > +
> > > +	counter = drmm_kzalloc(&xe->drm, count * sizeof(struct xe_drm_ras_counter), GFP_KERNEL);
> > 
> > Why not drmm_kcalloc()? We get a bonus overflow protection.
> 
> Will check
> 
> > 
> > > +	if (!counter)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	for (i = 0; i < count; i++) {
> > > +		if (!errors[i])
> > > +			continue;
> > > +
> > > +		counter[i].name = errors[i];
> > > +		atomic64_set(&counter[i].counter, 0);
> > > +	}
> > > +
> > > +	return counter;
> > > +}
> > > +
> > > +static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
> > > +			      const enum drm_xe_ras_error_severity severity)
> > > +{
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	int count = 0, ret = 0;
> > 
> > Redundant initializations, also why do we need them?
> 
> redundant code from previous rev. Will remove
> 
> > 
> > > +	count = DRM_XE_RAS_ERROR_CLASS_MAX;
> > > +	node->error_counter_range.first = DRM_XE_RAS_ERROR_CORE_COMPUTE;
> > > +	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
> > > +
> > > +	ras->info[severity] = allocate_and_copy_counters(xe, count);
> > 
> > This looks like count should be derived from first and last, or did I
> > miss something?
> 
> assigned it directly. Can be done
> 
> > 
> > > +	if (IS_ERR(ras->info[severity]))
> > > +		return PTR_ERR(ras->info[severity]);
> > > +
> > > +	switch (severity) {
> > > +	case DRM_XE_RAS_ERROR_CORRECTABLE:
> > > +		node->query_error_counter = query_correctable_error_counters;
> > > +		break;
> > > +	case DRM_XE_RAS_ERROR_NONFATAL:
> > > +		node->query_error_counter = query_non_fatal_error_counters;
> > > +		break;
> > > +	case DRM_XE_RAS_ERROR_FATAL:
> > > +		node->query_error_counter = query_fatal_error_counters;
> > > +		break;
> > > +	default:
> > 
> > Do we need this?
> 
> yes.
> 
> > 
> > > +		break;
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static int register_nodes(struct xe_device *xe)
> > > +{
> > > +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	const char *device_name;
> > > +	int i = 0, ret;
> > 
> > Redundant initialization. Also, ret belongs to the loop below.
> > 
> > > +	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
> > > +				pci_domain_nr(pdev->bus), pdev->bus->number,
> > > +				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> > > +
> > > +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
> > 
> > We could potentially have a nice for_each_error_severity() now ;)
> 
> Sure. Will check. If its used in multiple places then its worth having a
> helper
> 
> > 
> > > +		struct drm_ras_node *node = &ras->node[i];
> > > +
> > > +		node->device_name = device_name;
> > > +		node->node_name = error_severity[i];
> > > +		node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
> > > +		node->priv = xe;
> > > +
> > > +		ret = assign_node_params(xe, node, i);
> > > +		if (ret)
> > > +			return ret;
> > > +
> > > +		ret = drm_ras_node_register(node);
> > > +		if (ret) {
> > > +			drm_err(&xe->drm, "Failed to register drm ras tile node\n");
> > > +			return ret;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > 
> > ...
> > 
> > > +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
> > > +{
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	struct drm_ras_node *node;
> > > +	int err;
> > > +
> > > +	node = drmm_kzalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX * sizeof(struct drm_ras_node),
> > > +			    GFP_KERNEL);
> > 
> > Ditto for drmm_kcalloc().
> > 
> > > +	if (!node)
> > > +		return -ENOMEM;
> > > +
> > > +	ras->node = node;
> > > +
> > > +	err = register_nodes(xe);
> > > +	if (err) {
> > > +		drm_err(&xe->drm, "Failed to register drm ras node\n");
> > 
> > You wanted to call drm_err_probe(), didn't you ...?
> > 
> > Ah, not upstream yet :(
> > But perhaps an opportunity worth considering.
> > 
> > > +		return err;
> > > +	}
> > > +
> > > +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
> > > +	if (err) {
> > > +		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
> > 
> > Ditto.
> > 
> > > +		return err;
> > > +	}
> > > +
> > > +	return 0;
> > 
> > ...
> > 
> > > +/**
> > > + * struct xe_drm_ras_counter - xe ras counter
> > > + *
> > > + * This structure contains error class and counter information
> > > + */
> > > +struct xe_drm_ras_counter {
> > > +	/** @name: error class name */
> > > +	const char *name;
> > > +	/** @counter: count of error */
> > > +	atomic64_t counter;
> > > +};
> > > +
> > > +/**
> > > + * struct xe_drm_ras - xe drm ras structure
> > > + *
> > > + * This structure has details of error counters
> > > + */
> > > +struct xe_drm_ras {
> > > +	/** @node: DRM RAS node */
> > > +	struct drm_ras_node *node;
> > > +
> > > +	/** @info: info array for all types of errors */
> > > +	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
> > > +
> > > +};
> > 
> > Either separate the members with blank lines or don't, but let's be
> > consistent.
> 
> Will fix
> 
> > 
> > ...
> > 
> > >   void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
> > >   {
> > > -	enum hardware_error hw_err;
> > > +	u32 hw_err;
> > >   	if (fault_inject_csc_hw_error())
> > >   		schedule_work(&tile->csc_hw_error_work);
> > > -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
> > > +	for (hw_err = 0; hw_err < DRM_XE_RAS_ERROR_SEVERITY_MAX; hw_err++)
> > 
> > for_each_error_severity()
> > 
> > >   		if (master_ctl & ERROR_IRQ(hw_err))
> > >   			hw_error_source_handler(tile, hw_err);
> > >   }
> > 
> > ...
> > 
> > > +/**
> > > + * enum drm_xe_ras_error_severity - Supported drm ras error severity.
> > > + */
> > > +enum drm_xe_ras_error_severity {
> > > +	/** @DRM_XE_RAS_ERROR_CORRECTABLE: Correctable Error */
> > > +	DRM_XE_RAS_ERROR_CORRECTABLE = 0,
> > > +	/** @DRM_XE_RAS_ERROR_NONFATAL: Non fatal Error */
> > > +	DRM_XE_RAS_ERROR_NONFATAL,
> > > +	/** @DRM_XE_RAS_ERROR_FATAL: Fatal error */
> > > +	DRM_XE_RAS_ERROR_FATAL,
> > > +	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
> > > +	DRM_XE_RAS_ERROR_SEVERITY_MAX, /* non-ABI */
> > 
> > This is guaranteed last member, so redundant comma.
> 
> ok
> 
> > 
> > > +};
> > > +
> > > +/**
> > > + * enum drm_xe_ras_error_class - Supported drm ras error classes.
> > > + */
> > > +enum drm_xe_ras_error_class {
> > > +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
> > > +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
> > > +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
> > > +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
> > > +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
> > > +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
> > 
> > Ditto.
> > 
> > > +};
> > 
> > Also, all of the enums share the same DRM_XE_RAS_ERROR_* prefix, so let's try
> > to have distinguishable naming. Perhaps [*] would be useful here as well ;)
> 
> DRM_XE_RAS_ERROR_SEVERITY_* will cause longer names. Any suggestions?

Try this full version first and see how the outcome looks like...
if we are still respecting the line limits without ugly cuts, then let's go with it.
otherwise try something shorter ERR_SEV_ ... or something like that...

> 
> Thanks
> Riana
> 
> > 
> > Raag
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-09 14:13       ` Rodrigo Vivi
@ 2026-01-09 15:58         ` Raag Jadav
  2026-01-12  6:13           ` Riana Tauro
  0 siblings, 1 reply; 31+ messages in thread
From: Raag Jadav @ 2026-01-09 15:58 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Riana Tauro, intel-xe, dri-devel, aravind.iddamsetty,
	anshuman.gupta, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar

On Fri, Jan 09, 2026 at 09:13:31AM -0500, Rodrigo Vivi wrote:
> On Fri, Jan 09, 2026 at 01:38:44PM +0530, Riana Tauro wrote:
> > Hi Raag
> > 
> > Thank you for the review
> > 
> > On 12/9/2025 1:52 PM, Raag Jadav wrote:
> > > On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
> > > > Allocate correctable, nonfatal and fatal nodes per xe device.
> > > > Each node contains error classes, counters and respective
> > > > query counter functions.
> > > > 
> > > > Add basic functionality to create and register drm nodes.
> > > > Below operations can be performed using Generic netlink DRM RAS interface

...

> > > > Query Error counter:
> > > > 
> > > > $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
> > > > {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
> > > 
> > > One more (sorry): So this means graphics will be a different id? Or do they
> > > overlap? How does it work?
> > > 
> > 
> > Did not get this question.

This give the impression that it's specific to compute engine, so I was
hoping for something more generic like "execution unit" or simply "core"
but I couldn't come up with anything better than this, so upto you.

> > > Also,
> > > 
> > > [*] I'm not much informed about the history here but the 'error' term
> > > seems slapped onto almost everything. We already know it's RAS so perhaps
> > > we add it only where make sense and try to simplify some of the naming?

...

> > > > +/**
> > > > + * enum drm_xe_ras_error_class - Supported drm ras error classes.
> > > > + */
> > > > +enum drm_xe_ras_error_class {
> > > > +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
> > > > +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
> > > > +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
> > > > +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
> > > > +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
> > > > +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
> > > > +};
> > > 
> > > Also, all of the enums share the same DRM_XE_RAS_ERROR_* prefix, so let's try
> > > to have distinguishable naming. Perhaps [*] would be useful here as well ;)
> > 
> > DRM_XE_RAS_ERROR_SEVERITY_* will cause longer names. Any suggestions?

Already mentioned above[*], the key is to not overuse 'error' ;)

DRM_XE_RAS_SEVERITY_*
DRM_XE_RAS_COMPONENT_*

and so on ...

> Try this full version first and see how the outcome looks like...
> if we are still respecting the line limits without ugly cuts, then let's go with it.
> otherwise try something shorter ERR_SEV_ ... or something like that...

... which can be futher shortened with this idea.

Side note: I'm already using these on my local branch.

Raag

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-09 15:58         ` Raag Jadav
@ 2026-01-12  6:13           ` Riana Tauro
  2026-01-12 10:27             ` Raag Jadav
  0 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2026-01-12  6:13 UTC (permalink / raw)
  To: Raag Jadav, Rodrigo Vivi
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	joonas.lahtinen, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar



On 1/9/2026 9:28 PM, Raag Jadav wrote:
> On Fri, Jan 09, 2026 at 09:13:31AM -0500, Rodrigo Vivi wrote:
>> On Fri, Jan 09, 2026 at 01:38:44PM +0530, Riana Tauro wrote:
>>> Hi Raag
>>>
>>> Thank you for the review
>>>
>>> On 12/9/2025 1:52 PM, Raag Jadav wrote:
>>>> On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
>>>>> Allocate correctable, nonfatal and fatal nodes per xe device.
>>>>> Each node contains error classes, counters and respective
>>>>> query counter functions.
>>>>>
>>>>> Add basic functionality to create and register drm nodes.
>>>>> Below operations can be performed using Generic netlink DRM RAS interface
> 
> ...
> 
>>>>> Query Error counter:
>>>>>
>>>>> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
>>>>> {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
>>>>
>>>> One more (sorry): So this means graphics will be a different id? Or do they
>>>> overlap? How does it work?
>>>>
>>>
>>> Did not get this question.
> 
> This give the impression that it's specific to compute engine, so I was
> hoping for something more generic like "execution unit" or simply "core"
> but I couldn't come up with anything better than this, so upto you.

Perhaps just GT. Let me check

> 
>>>> Also,
>>>>
>>>> [*] I'm not much informed about the history here but the 'error' term
>>>> seems slapped onto almost everything. We already know it's RAS so perhaps
>>>> we add it only where make sense and try to simplify some of the naming?
> 
> ...
> 
>>>>> +/**
>>>>> + * enum drm_xe_ras_error_class - Supported drm ras error classes.
>>>>> + */
>>>>> +enum drm_xe_ras_error_class {
>>>>> +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
>>>>> +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
>>>>> +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
>>>>> +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
>>>>> +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
>>>>> +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
>>>>> +};
>>>>
>>>> Also, all of the enums share the same DRM_XE_RAS_ERROR_* prefix, so let's try
>>>> to have distinguishable naming. Perhaps [*] would be useful here as well ;)
>>>
>>> DRM_XE_RAS_ERROR_SEVERITY_* will cause longer names. Any suggestions?
> 
> Already mentioned above[*], the key is to not overuse 'error' ;)
> 
> DRM_XE_RAS_SEVERITY_*
> DRM_XE_RAS_COMPONENT_*

There's been an interest expressed to add telemetry nodes as well.

https://patchwork.freedesktop.org/patch/666138/?series=118435&rev=5

I have kept the prefix (DRM_XE_RAS_ERROR) consistent with the first 
patch (type - ERROR_COUNTER) for alignment.

 From my perspective retaining the prefix ERROR would be beneficial to 
differentiate if there are different types.

Can you please have a look at the link and let me know if you still 
think the same

For differentiation, i will add SEVERITY and CLASS/COMPONENT.

Thanks
Riana

> 
> and so on ...
> 
>> Try this full version first and see how the outcome looks like...
>> if we are still respecting the line limits without ugly cuts, then let's go with it.
>> otherwise try something shorter ERR_SEV_ ... or something like that...
> 
> ... which can be futher shortened with this idea.
> 
> Side note: I'm already using these on my local branch.
> 
> Raag


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2026-01-12  6:13           ` Riana Tauro
@ 2026-01-12 10:27             ` Raag Jadav
  0 siblings, 0 replies; 31+ messages in thread
From: Raag Jadav @ 2026-01-12 10:27 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Rodrigo Vivi, intel-xe, dri-devel, aravind.iddamsetty,
	anshuman.gupta, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar

On Mon, Jan 12, 2026 at 11:43:16AM +0530, Riana Tauro wrote:
> On 1/9/2026 9:28 PM, Raag Jadav wrote:
> > On Fri, Jan 09, 2026 at 09:13:31AM -0500, Rodrigo Vivi wrote:
> > > On Fri, Jan 09, 2026 at 01:38:44PM +0530, Riana Tauro wrote:
> > > > Hi Raag
> > > > 
> > > > Thank you for the review
> > > > 
> > > > On 12/9/2025 1:52 PM, Raag Jadav wrote:
> > > > > On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
> > > > > > Allocate correctable, nonfatal and fatal nodes per xe device.
> > > > > > Each node contains error classes, counters and respective
> > > > > > query counter functions.
> > > > > > 
> > > > > > Add basic functionality to create and register drm nodes.
> > > > > > Below operations can be performed using Generic netlink DRM RAS interface

...

> > > > > > +/**
> > > > > > + * enum drm_xe_ras_error_class - Supported drm ras error classes.
> > > > > > + */
> > > > > > +enum drm_xe_ras_error_class {
> > > > > > +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
> > > > > > +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
> > > > > > +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
> > > > > > +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
> > > > > > +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
> > > > > > +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
> > > > > > +};
> > > > > 
> > > > > Also, all of the enums share the same DRM_XE_RAS_ERROR_* prefix, so let's try
> > > > > to have distinguishable naming. Perhaps [*] would be useful here as well ;)
> > > > 
> > > > DRM_XE_RAS_ERROR_SEVERITY_* will cause longer names. Any suggestions?
> > 
> > Already mentioned above[*], the key is to not overuse 'error' ;)
> > 
> > DRM_XE_RAS_SEVERITY_*
> > DRM_XE_RAS_COMPONENT_*
> 
> There's been an interest expressed to add telemetry nodes as well.
> 
> https://patchwork.freedesktop.org/patch/666138/?series=118435&rev=5
> 
> I have kept the prefix (DRM_XE_RAS_ERROR) consistent with the first patch
> (type - ERROR_COUNTER) for alignment.
> 
> From my perspective retaining the prefix ERROR would be beneficial to
> differentiate if there are different types.
> 
> Can you please have a look at the link and let me know if you still think
> the same

Fair, whichever makes sense for the usecase and please excuse my
bikeshedding.

> For differentiation, i will add SEVERITY and CLASS/COMPONENT.

Thank you.

Raag

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2025-12-05  8:39 ` [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
  2025-12-09  8:22   ` Raag Jadav
@ 2025-12-09 21:57   ` Rodrigo Vivi
  2026-01-07  9:48     ` Aravind Iddamsetty
  1 sibling, 1 reply; 31+ messages in thread
From: Rodrigo Vivi @ 2025-12-09 21:57 UTC (permalink / raw)
  To: Riana Tauro, Aravind Iddamsetty, Joonas Lahtinen
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	joonas.lahtinen, lukas, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar

On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
> Allocate correctable, nonfatal and fatal nodes per xe device.
> Each node contains error classes, counters and respective
> query counter functions.
> 
> Add basic functionality to create and register drm nodes.
> Below operations can be performed using Generic netlink DRM RAS interface
> 
> List Nodes:
> 
> $ sudo ynl --family drm_ras  --dump list-nodes
> [{'device-name': '0000:03:00.0',
>   'node-id': 0,
>   'node-name': 'correctable-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 1,
>   'node-name': 'nonfatal-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 2,
>   'node-name': 'fatal-errors',
>   'node-type': 'error-counter'}]
> 
> Get Error counters:
> 
> $ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
>  {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]
> 
> Query Error counter:
> 
> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
> {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: Add ID's and names as uAPI (Rodrigo)
>     Add documentation
>     Modify commit message
> ---
>  drivers/gpu/drm/xe/Makefile           |   1 +
>  drivers/gpu/drm/xe/xe_device_types.h  |   4 +
>  drivers/gpu/drm/xe/xe_drm_ras.c       | 199 ++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_drm_ras.h       |  12 ++
>  drivers/gpu/drm/xe/xe_drm_ras_types.h |  40 ++++++
>  drivers/gpu/drm/xe/xe_hw_error.c      |  64 ++++-----
>  include/uapi/drm/xe_drm.h             |  82 +++++++++++
>  7 files changed, 368 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index a7e13a676f7d..bc417ef19280 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -41,6 +41,7 @@ xe-y += xe_bb.o \
>  	xe_device_sysfs.o \
>  	xe_dma_buf.o \
>  	xe_drm_client.o \
> +	xe_drm_ras.o \
>  	xe_eu_stall.o \
>  	xe_exec.o \
>  	xe_exec_queue.o \
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 9de73353223f..d6ea275700e1 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -13,6 +13,7 @@
>  #include <drm/ttm/ttm_device.h>
>  
>  #include "xe_devcoredump_types.h"
> +#include "xe_drm_ras_types.h"
>  #include "xe_heci_gsc.h"
>  #include "xe_late_bind_fw_types.h"
>  #include "xe_lmtt_types.h"
> @@ -361,6 +362,9 @@ struct xe_device {
>  		bool oob_initialized;
>  	} wa_active;
>  
> +	/** @ras: ras structure for device */
> +	struct xe_drm_ras ras;
> +
>  	/** @survivability: survivability information for device */
>  	struct xe_survivability survivability;
>  
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
> new file mode 100644
> index 000000000000..764b14b1edf8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include <drm/drm_managed.h>
> +#include <drm/drm_ras.h>
> +#include <linux/bitmap.h>
> +
> +#include "xe_device.h"
> +#include "xe_drm_ras.h"
> +
> +static const char * const errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
> +
> +static int hw_query_error_counter(struct xe_drm_ras_counter *info,
> +				  u32 error_id, const char **name, u32 *val)
> +{
> +	if (error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)
> +		return -EINVAL;
> +
> +	if (!info[error_id].name)
> +		return -ENOENT;
> +
> +	*name = info[error_id].name;
> +	*val = atomic64_read(&info[error_id].counter);
> +
> +	return 0;
> +}
> +
> +static int query_non_fatal_error_counters(struct drm_ras_node *ep,
> +					  u32 error_id, const char **name,
> +					  u32 *val)
> +{
> +	struct xe_device *xe = ep->priv;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_NONFATAL];
> +
> +	return hw_query_error_counter(info, error_id, name, val);
> +}
> +
> +static int query_fatal_error_counters(struct drm_ras_node *ep,
> +				      u32 error_id, const char **name,
> +				      u32 *val)
> +{
> +	struct xe_device *xe = ep->priv;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_FATAL];
> +
> +	return hw_query_error_counter(info, error_id, name, val);
> +}
> +
> +static int query_correctable_error_counters(struct drm_ras_node *ep,
> +					    u32 error_id, const char **name,
> +					    u32 *val)
> +{
> +	struct xe_device *xe = ep->priv;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_CORRECTABLE];
> +
> +	return hw_query_error_counter(info, error_id, name, val);
> +}
> +
> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe,
> +							     int count)
> +{
> +	struct xe_drm_ras_counter *counter;
> +	int i;
> +
> +	counter = drmm_kzalloc(&xe->drm, count * sizeof(struct xe_drm_ras_counter), GFP_KERNEL);
> +	if (!counter)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < count; i++) {
> +		if (!errors[i])
> +			continue;
> +
> +		counter[i].name = errors[i];
> +		atomic64_set(&counter[i].counter, 0);
> +	}
> +
> +	return counter;
> +}
> +
> +static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
> +			      const enum drm_xe_ras_error_severity severity)
> +{
> +	struct xe_drm_ras *ras = &xe->ras;
> +	int count = 0, ret = 0;
> +
> +	count = DRM_XE_RAS_ERROR_CLASS_MAX;
> +	node->error_counter_range.first = DRM_XE_RAS_ERROR_CORE_COMPUTE;
> +	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
> +
> +	ras->info[severity] = allocate_and_copy_counters(xe, count);
> +	if (IS_ERR(ras->info[severity]))
> +		return PTR_ERR(ras->info[severity]);
> +
> +	switch (severity) {
> +	case DRM_XE_RAS_ERROR_CORRECTABLE:
> +		node->query_error_counter = query_correctable_error_counters;
> +		break;
> +	case DRM_XE_RAS_ERROR_NONFATAL:
> +		node->query_error_counter = query_non_fatal_error_counters;
> +		break;
> +	case DRM_XE_RAS_ERROR_FATAL:
> +		node->query_error_counter = query_fatal_error_counters;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int register_nodes(struct xe_device *xe)
> +{
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	const char *device_name;
> +	int i = 0, ret;
> +
> +	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
> +				pci_domain_nr(pdev->bus), pdev->bus->number,
> +				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +
> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
> +		struct drm_ras_node *node = &ras->node[i];
> +
> +		node->device_name = device_name;
> +		node->node_name = error_severity[i];
> +		node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
> +		node->priv = xe;
> +
> +		ret = assign_node_params(xe, node, i);
> +		if (ret)
> +			return ret;
> +
> +		ret = drm_ras_node_register(node);
> +		if (ret) {
> +			drm_err(&xe->drm, "Failed to register drm ras tile node\n");
> +			return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void xe_drm_ras_unregister_nodes(void *arg)
> +{
> +	struct xe_device *xe = arg;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	int i = 0;
> +
> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
> +		struct drm_ras_node *node = &ras->node[i];
> +
> +		drm_ras_node_unregister(node);
> +
> +		if (i == 0)
> +			kfree(node->device_name);
> +	}
> +}
> +
> +/**
> + * xe_drm_ras_allocate_nodes - Allocate drm ras nodes
> + * @xe: xe device instance
> + *
> + * Allocate xe drm ras nodes for all error severities per device
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
> +{
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct drm_ras_node *node;
> +	int err;
> +
> +	node = drmm_kzalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX * sizeof(struct drm_ras_node),
> +			    GFP_KERNEL);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	ras->node = node;
> +
> +	err = register_nodes(xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to register drm ras node\n");
> +		return err;
> +	}
> +
> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
> +		return err;
> +	}
> +
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
> new file mode 100644
> index 000000000000..6272b5da4e6d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_drm_ras.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +#ifndef XE_DRM_RAS_H_
> +#define XE_DRM_RAS_H_
> +
> +struct xe_device;
> +
> +int xe_drm_ras_allocate_nodes(struct xe_device *xe);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
> new file mode 100644
> index 000000000000..409d6fa54a23
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_DRM_RAS_TYPES_H_
> +#define _XE_DRM_RAS_TYPES_H_
> +
> +#include <drm/xe_drm.h>
> +#include <linux/atomic.h>
> +
> +struct drm_ras_node;
> +
> +/**
> + * struct xe_drm_ras_counter - xe ras counter
> + *
> + * This structure contains error class and counter information
> + */
> +struct xe_drm_ras_counter {
> +	/** @name: error class name */
> +	const char *name;
> +	/** @counter: count of error */
> +	atomic64_t counter;
> +};
> +
> +/**
> + * struct xe_drm_ras - xe drm ras structure
> + *
> + * This structure has details of error counters
> + */
> +struct xe_drm_ras {
> +	/** @node: DRM RAS node */
> +	struct drm_ras_node *node;
> +
> +	/** @info: info array for all types of errors */
> +	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
> +
> +};
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 8c65291f36fc..d63078d00b56 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -10,20 +10,14 @@
>  #include "regs/xe_irq_regs.h"
>  
>  #include "xe_device.h"
> +#include "xe_drm_ras.h"
>  #include "xe_hw_error.h"
>  #include "xe_mmio.h"
>  #include "xe_survivability_mode.h"
>  
>  #define  HEC_UNCORR_FW_ERR_BITS 4
>  extern struct fault_attr inject_csc_hw_error;
> -
> -/* Error categories reported by hardware */
> -enum hardware_error {
> -	HARDWARE_ERROR_CORRECTABLE = 0,
> -	HARDWARE_ERROR_NONFATAL = 1,
> -	HARDWARE_ERROR_FATAL = 2,
> -	HARDWARE_ERROR_MAX,
> -};
> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>  
>  static const char * const hec_uncorrected_fw_errors[] = {
>  	"Fatal",
> @@ -32,20 +26,6 @@ static const char * const hec_uncorrected_fw_errors[] = {
>  	"Data Corruption"
>  };
>  
> -static const char *hw_error_to_str(const enum hardware_error hw_err)
> -{
> -	switch (hw_err) {
> -	case HARDWARE_ERROR_CORRECTABLE:
> -		return "CORRECTABLE";
> -	case HARDWARE_ERROR_NONFATAL:
> -		return "NONFATAL";
> -	case HARDWARE_ERROR_FATAL:
> -		return "FATAL";
> -	default:
> -		return "UNKNOWN";
> -	}
> -}
> -
>  static bool fault_inject_csc_hw_error(void)
>  {
>  	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> @@ -62,9 +42,10 @@ static void csc_hw_error_work(struct work_struct *work)
>  		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");
>  }
>  
> -static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> +static void csc_hw_error_handler(struct xe_tile *tile,
> +				 const enum drm_xe_ras_error_severity severity)
>  {
> -	const char *hw_err_str = hw_error_to_str(hw_err);
> +	const char *severity_str = error_severity[severity];
>  	struct xe_device *xe = tile_to_xe(tile);
>  	struct xe_mmio *mmio = &tile->mmio;
>  	u32 base, err_bit, err_src;
> @@ -78,7 +59,7 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>  	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>  	if (!err_src) {
>  		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
> -				    tile->id, hw_err_str);
> +				    tile->id, severity_str);
>  		return;
>  	}
>  
> @@ -87,7 +68,7 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>  		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>  			drm_err_ratelimited(&xe->drm, HW_ERR
>  					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
> -					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
> +					     severity_str, hec_uncorrected_fw_errors[err_bit],
>  					     err_bit);
>  
>  			schedule_work(&tile->csc_hw_error_work);
> @@ -97,9 +78,9 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>  	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>  }
>  
> -static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> +static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_severity severity)
>  {
> -	const char *hw_err_str = hw_error_to_str(hw_err);
> +	const char *severity_str = error_severity[severity];
>  	struct xe_device *xe = tile_to_xe(tile);
>  	unsigned long flags;
>  	u32 err_src;
> @@ -108,17 +89,17 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>  		return;
>  
>  	spin_lock_irqsave(&xe->irq.lock, flags);
> -	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
> +	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(severity));
>  	if (!err_src) {
>  		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
> -				    tile->id, hw_err_str);
> +				    tile->id, severity_str);
>  		goto unlock;
>  	}
>  
>  	if (err_src & XE_CSC_ERROR)
> -		csc_hw_error_handler(tile, hw_err);
> +		csc_hw_error_handler(tile, severity);
>  
> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
> +	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
>  
>  unlock:
>  	spin_unlock_irqrestore(&xe->irq.lock, flags);
> @@ -136,16 +117,30 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>   */
>  void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>  {
> -	enum hardware_error hw_err;
> +	u32 hw_err;
>  
>  	if (fault_inject_csc_hw_error())
>  		schedule_work(&tile->csc_hw_error_work);
>  
> -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
> +	for (hw_err = 0; hw_err < DRM_XE_RAS_ERROR_SEVERITY_MAX; hw_err++)
>  		if (master_ctl & ERROR_IRQ(hw_err))
>  			hw_error_source_handler(tile, hw_err);
>  }
>  
> +static int hw_error_info_init(struct xe_device *xe)
> +{
> +	int ret;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return 0;
> +
> +	ret = xe_drm_ras_allocate_nodes(xe);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
>  /*
>   * Process hardware errors during boot
>   */
> @@ -178,5 +173,6 @@ void xe_hw_error_init(struct xe_device *xe)
>  
>  	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>  
> +	hw_error_info_init(xe);
>  	process_hw_errors(xe);
>  }
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 0d99bb0cd20a..3f6c38908b70 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -2294,6 +2294,88 @@ struct drm_xe_vm_query_mem_range_attr {
>  
>  };
>  
> +/**
> + * DOC: Xe DRM RAS
> + *
> + * The enums and strings defined below map to the attributes of the DRM RAS Netlink Interface.
> + * Refer to Documentation/netlink/specs/drm_ras.yaml for complete interface specification.
> + *
> + * Node Registration
> + * =================
> + *
> + * The driver registers DRM RAS nodes for each error severity level.
> + * enum drm_xe_ras_error_severity defines the node-id, while DRM_XE_RAS_ERROR_SEVERITY_NAMES maps
> + * node-id to node-name.
> + *
> + * Error Classification
> + * ====================
> + *
> + * Each node contains a list of error counters. Each error is identified by a error-id and
> + * an error-name. enum drm_xe_ras_error_class defines the error-id, while
> + * DRM_XE_RAS_ERROR_CLASS_NAMES maps error-id to error-name.
> + *
> + * User Interface
> + * ==============
> + *
> + * To retrieve error values of a error counter, userspace applications should
> + * follow the below steps:
> + *
> + * 1. Use command LIST_NODES to enumerate all available nodes
> + * 2. Select node by node-id or node-name
> + * 3. Use command GET_ERROR_COUNTERS to list errors of specific node
> + * 4. Query specific error values using either error-id or error-name
> + *
> + * .. code-block:: C
> + *
> + *	// Lookup tables for ID-to-name resolution
> + *	static const char *nodes[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
> + *	static const char *errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
> + *
> + */
> +
> +/**
> + * enum drm_xe_ras_error_severity - Supported drm ras error severity.
> + */
> +enum drm_xe_ras_error_severity {
> +	/** @DRM_XE_RAS_ERROR_CORRECTABLE: Correctable Error */
> +	DRM_XE_RAS_ERROR_CORRECTABLE = 0,
> +	/** @DRM_XE_RAS_ERROR_NONFATAL: Non fatal Error */
> +	DRM_XE_RAS_ERROR_NONFATAL,
> +	/** @DRM_XE_RAS_ERROR_FATAL: Fatal error */
> +	DRM_XE_RAS_ERROR_FATAL,
> +	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
> +	DRM_XE_RAS_ERROR_SEVERITY_MAX, /* non-ABI */
> +};
> +
> +/**
> + * enum drm_xe_ras_error_class - Supported drm ras error classes.
> + */
> +enum drm_xe_ras_error_class {
> +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
> +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
> +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
> +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
> +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
> +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
> +};
> +
> +/*
> + * Error severity to name mapping.
> + */
> +#define DRM_XE_RAS_ERROR_SEVERITY_NAMES {				\
> +	[DRM_XE_RAS_ERROR_CORRECTABLE] = "correctable-errors",		\
> +	[DRM_XE_RAS_ERROR_NONFATAL] = "nonfatal-errors",		\
> +	[DRM_XE_RAS_ERROR_FATAL] = "fatal-errors",			\
> +}
> +
> +/*
> + * Error class to name mapping.
> + */
> +#define DRM_XE_RAS_ERROR_CLASS_NAMES {					\
> +	[DRM_XE_RAS_ERROR_CORE_COMPUTE] =  "Core Compute Error",	\
> +	[DRM_XE_RAS_ERROR_SOC_INTERNAL] =  "SOC Internal Error",	\


These looks good to me.

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>

Joonas, Aravind, does this align what you had in mind for the uAPI?

Thanks,
Rodrigo.

> +}
> +
>  #if defined(__cplusplus)
>  }
>  #endif
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras
  2025-12-09 21:57   ` Rodrigo Vivi
@ 2026-01-07  9:48     ` Aravind Iddamsetty
  0 siblings, 0 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2026-01-07  9:48 UTC (permalink / raw)
  To: Rodrigo Vivi, Riana Tauro, Joonas Lahtinen
  Cc: intel-xe, dri-devel, anshuman.gupta, lukas, simona.vetter,
	airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar


On 10-12-2025 03:27, Rodrigo Vivi wrote:
> On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
>> Allocate correctable, nonfatal and fatal nodes per xe device.
>> Each node contains error classes, counters and respective
>> query counter functions.
>>
>> Add basic functionality to create and register drm nodes.
>> Below operations can be performed using Generic netlink DRM RAS interface
>>
>> List Nodes:
>>
>> $ sudo ynl --family drm_ras  --dump list-nodes
>> [{'device-name': '0000:03:00.0',
>>   'node-id': 0,
>>   'node-name': 'correctable-errors',
>>   'node-type': 'error-counter'},
>>  {'device-name': '0000:03:00.0',
>>   'node-id': 1,
>>   'node-name': 'nonfatal-errors',
>>   'node-type': 'error-counter'},
>>  {'device-name': '0000:03:00.0',
>>   'node-id': 2,
>>   'node-name': 'fatal-errors',
>>   'node-type': 'error-counter'}]
>>
>> Get Error counters:
>>
>> $ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
>> [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
>>  {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]
>>
>> Query Error counter:
>>
>> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
>> {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: Add ID's and names as uAPI (Rodrigo)
>>     Add documentation
>>     Modify commit message
>> ---
>>  drivers/gpu/drm/xe/Makefile           |   1 +
>>  drivers/gpu/drm/xe/xe_device_types.h  |   4 +
>>  drivers/gpu/drm/xe/xe_drm_ras.c       | 199 ++++++++++++++++++++++++++
>>  drivers/gpu/drm/xe/xe_drm_ras.h       |  12 ++
>>  drivers/gpu/drm/xe/xe_drm_ras_types.h |  40 ++++++
>>  drivers/gpu/drm/xe/xe_hw_error.c      |  64 ++++-----
>>  include/uapi/drm/xe_drm.h             |  82 +++++++++++
>>  7 files changed, 368 insertions(+), 34 deletions(-)
>>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
>>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
>>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> index a7e13a676f7d..bc417ef19280 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -41,6 +41,7 @@ xe-y += xe_bb.o \
>>  	xe_device_sysfs.o \
>>  	xe_dma_buf.o \
>>  	xe_drm_client.o \
>> +	xe_drm_ras.o \
>>  	xe_eu_stall.o \
>>  	xe_exec.o \
>>  	xe_exec_queue.o \
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index 9de73353223f..d6ea275700e1 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -13,6 +13,7 @@
>>  #include <drm/ttm/ttm_device.h>
>>  
>>  #include "xe_devcoredump_types.h"
>> +#include "xe_drm_ras_types.h"
>>  #include "xe_heci_gsc.h"
>>  #include "xe_late_bind_fw_types.h"
>>  #include "xe_lmtt_types.h"
>> @@ -361,6 +362,9 @@ struct xe_device {
>>  		bool oob_initialized;
>>  	} wa_active;
>>  
>> +	/** @ras: ras structure for device */
>> +	struct xe_drm_ras ras;
>> +
>>  	/** @survivability: survivability information for device */
>>  	struct xe_survivability survivability;
>>  
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
>> new file mode 100644
>> index 000000000000..764b14b1edf8
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
>> @@ -0,0 +1,199 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_managed.h>
>> +#include <drm/drm_ras.h>
>> +#include <linux/bitmap.h>
>> +
>> +#include "xe_device.h"
>> +#include "xe_drm_ras.h"
>> +
>> +static const char * const errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
>> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>> +
>> +static int hw_query_error_counter(struct xe_drm_ras_counter *info,
>> +				  u32 error_id, const char **name, u32 *val)
>> +{
>> +	if (error_id >= DRM_XE_RAS_ERROR_CLASS_MAX)
>> +		return -EINVAL;
>> +
>> +	if (!info[error_id].name)
>> +		return -ENOENT;
>> +
>> +	*name = info[error_id].name;
>> +	*val = atomic64_read(&info[error_id].counter);
>> +
>> +	return 0;
>> +}
>> +
>> +static int query_non_fatal_error_counters(struct drm_ras_node *ep,
>> +					  u32 error_id, const char **name,
>> +					  u32 *val)
>> +{
>> +	struct xe_device *xe = ep->priv;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_NONFATAL];
>> +
>> +	return hw_query_error_counter(info, error_id, name, val);
>> +}
>> +
>> +static int query_fatal_error_counters(struct drm_ras_node *ep,
>> +				      u32 error_id, const char **name,
>> +				      u32 *val)
>> +{
>> +	struct xe_device *xe = ep->priv;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_FATAL];
>> +
>> +	return hw_query_error_counter(info, error_id, name, val);
>> +}
>> +
>> +static int query_correctable_error_counters(struct drm_ras_node *ep,
>> +					    u32 error_id, const char **name,
>> +					    u32 *val)
>> +{
>> +	struct xe_device *xe = ep->priv;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERROR_CORRECTABLE];
>> +
>> +	return hw_query_error_counter(info, error_id, name, val);
>> +}
>> +
>> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe,
>> +							     int count)
>> +{
>> +	struct xe_drm_ras_counter *counter;
>> +	int i;
>> +
>> +	counter = drmm_kzalloc(&xe->drm, count * sizeof(struct xe_drm_ras_counter), GFP_KERNEL);
>> +	if (!counter)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	for (i = 0; i < count; i++) {
>> +		if (!errors[i])
>> +			continue;
>> +
>> +		counter[i].name = errors[i];
>> +		atomic64_set(&counter[i].counter, 0);
>> +	}
>> +
>> +	return counter;
>> +}
>> +
>> +static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
>> +			      const enum drm_xe_ras_error_severity severity)
>> +{
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	int count = 0, ret = 0;
>> +
>> +	count = DRM_XE_RAS_ERROR_CLASS_MAX;
>> +	node->error_counter_range.first = DRM_XE_RAS_ERROR_CORE_COMPUTE;
>> +	node->error_counter_range.last = DRM_XE_RAS_ERROR_CLASS_MAX - 1;
>> +
>> +	ras->info[severity] = allocate_and_copy_counters(xe, count);
>> +	if (IS_ERR(ras->info[severity]))
>> +		return PTR_ERR(ras->info[severity]);
>> +
>> +	switch (severity) {
>> +	case DRM_XE_RAS_ERROR_CORRECTABLE:
>> +		node->query_error_counter = query_correctable_error_counters;
>> +		break;
>> +	case DRM_XE_RAS_ERROR_NONFATAL:
>> +		node->query_error_counter = query_non_fatal_error_counters;
>> +		break;
>> +	case DRM_XE_RAS_ERROR_FATAL:
>> +		node->query_error_counter = query_fatal_error_counters;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int register_nodes(struct xe_device *xe)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	const char *device_name;
>> +	int i = 0, ret;
>> +
>> +	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
>> +				pci_domain_nr(pdev->bus), pdev->bus->number,
>> +				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
>> +
>> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
>> +		struct drm_ras_node *node = &ras->node[i];
>> +
>> +		node->device_name = device_name;
>> +		node->node_name = error_severity[i];
>> +		node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
>> +		node->priv = xe;
>> +
>> +		ret = assign_node_params(xe, node, i);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ret = drm_ras_node_register(node);
>> +		if (ret) {
>> +			drm_err(&xe->drm, "Failed to register drm ras tile node\n");
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void xe_drm_ras_unregister_nodes(void *arg)
>> +{
>> +	struct xe_device *xe = arg;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	int i = 0;
>> +
>> +	for (i = 0; i < DRM_XE_RAS_ERROR_SEVERITY_MAX; i++) {
>> +		struct drm_ras_node *node = &ras->node[i];
>> +
>> +		drm_ras_node_unregister(node);
>> +
>> +		if (i == 0)
>> +			kfree(node->device_name);
>> +	}
>> +}
>> +
>> +/**
>> + * xe_drm_ras_allocate_nodes - Allocate drm ras nodes
>> + * @xe: xe device instance
>> + *
>> + * Allocate xe drm ras nodes for all error severities per device
>> + *
>> + * Return: 0 on success, error code on failure
>> + */
>> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
>> +{
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct drm_ras_node *node;
>> +	int err;
>> +
>> +	node = drmm_kzalloc(&xe->drm, DRM_XE_RAS_ERROR_SEVERITY_MAX * sizeof(struct drm_ras_node),
>> +			    GFP_KERNEL);
>> +	if (!node)
>> +		return -ENOMEM;
>> +
>> +	ras->node = node;
>> +
>> +	err = register_nodes(xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to register drm ras node\n");
>> +		return err;
>> +	}
>> +
>> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to add action for xe drm_ras\n");
>> +		return err;
>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
>> new file mode 100644
>> index 000000000000..6272b5da4e6d
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras.h
>> @@ -0,0 +1,12 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +#ifndef XE_DRM_RAS_H_
>> +#define XE_DRM_RAS_H_
>> +
>> +struct xe_device;
>> +
>> +int xe_drm_ras_allocate_nodes(struct xe_device *xe);
>> +
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> new file mode 100644
>> index 000000000000..409d6fa54a23
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> @@ -0,0 +1,40 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_DRM_RAS_TYPES_H_
>> +#define _XE_DRM_RAS_TYPES_H_
>> +
>> +#include <drm/xe_drm.h>
>> +#include <linux/atomic.h>
>> +
>> +struct drm_ras_node;
>> +
>> +/**
>> + * struct xe_drm_ras_counter - xe ras counter
>> + *
>> + * This structure contains error class and counter information
>> + */
>> +struct xe_drm_ras_counter {
>> +	/** @name: error class name */
>> +	const char *name;
>> +	/** @counter: count of error */
>> +	atomic64_t counter;
>> +};
>> +
>> +/**
>> + * struct xe_drm_ras - xe drm ras structure
>> + *
>> + * This structure has details of error counters
>> + */
>> +struct xe_drm_ras {
>> +	/** @node: DRM RAS node */
>> +	struct drm_ras_node *node;
>> +
>> +	/** @info: info array for all types of errors */
>> +	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERROR_SEVERITY_MAX];
>> +
>> +};
>> +
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index 8c65291f36fc..d63078d00b56 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -10,20 +10,14 @@
>>  #include "regs/xe_irq_regs.h"
>>  
>>  #include "xe_device.h"
>> +#include "xe_drm_ras.h"
>>  #include "xe_hw_error.h"
>>  #include "xe_mmio.h"
>>  #include "xe_survivability_mode.h"
>>  
>>  #define  HEC_UNCORR_FW_ERR_BITS 4
>>  extern struct fault_attr inject_csc_hw_error;
>> -
>> -/* Error categories reported by hardware */
>> -enum hardware_error {
>> -	HARDWARE_ERROR_CORRECTABLE = 0,
>> -	HARDWARE_ERROR_NONFATAL = 1,
>> -	HARDWARE_ERROR_FATAL = 2,
>> -	HARDWARE_ERROR_MAX,
>> -};
>> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>>  
>>  static const char * const hec_uncorrected_fw_errors[] = {
>>  	"Fatal",
>> @@ -32,20 +26,6 @@ static const char * const hec_uncorrected_fw_errors[] = {
>>  	"Data Corruption"
>>  };
>>  
>> -static const char *hw_error_to_str(const enum hardware_error hw_err)
>> -{
>> -	switch (hw_err) {
>> -	case HARDWARE_ERROR_CORRECTABLE:
>> -		return "CORRECTABLE";
>> -	case HARDWARE_ERROR_NONFATAL:
>> -		return "NONFATAL";
>> -	case HARDWARE_ERROR_FATAL:
>> -		return "FATAL";
>> -	default:
>> -		return "UNKNOWN";
>> -	}
>> -}
>> -
>>  static bool fault_inject_csc_hw_error(void)
>>  {
>>  	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
>> @@ -62,9 +42,10 @@ static void csc_hw_error_work(struct work_struct *work)
>>  		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");
>>  }
>>  
>> -static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>> +static void csc_hw_error_handler(struct xe_tile *tile,
>> +				 const enum drm_xe_ras_error_severity severity)
>>  {
>> -	const char *hw_err_str = hw_error_to_str(hw_err);
>> +	const char *severity_str = error_severity[severity];
>>  	struct xe_device *xe = tile_to_xe(tile);
>>  	struct xe_mmio *mmio = &tile->mmio;
>>  	u32 base, err_bit, err_src;
>> @@ -78,7 +59,7 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>>  	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>>  	if (!err_src) {
>>  		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
>> -				    tile->id, hw_err_str);
>> +				    tile->id, severity_str);
>>  		return;
>>  	}
>>  
>> @@ -87,7 +68,7 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>>  		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>>  			drm_err_ratelimited(&xe->drm, HW_ERR
>>  					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
>> -					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
>> +					     severity_str, hec_uncorrected_fw_errors[err_bit],
>>  					     err_bit);
>>  
>>  			schedule_work(&tile->csc_hw_error_work);
>> @@ -97,9 +78,9 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>>  	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>>  }
>>  
>> -static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>> +static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_severity severity)
>>  {
>> -	const char *hw_err_str = hw_error_to_str(hw_err);
>> +	const char *severity_str = error_severity[severity];
>>  	struct xe_device *xe = tile_to_xe(tile);
>>  	unsigned long flags;
>>  	u32 err_src;
>> @@ -108,17 +89,17 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>  		return;
>>  
>>  	spin_lock_irqsave(&xe->irq.lock, flags);
>> -	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
>> +	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(severity));
>>  	if (!err_src) {
>>  		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
>> -				    tile->id, hw_err_str);
>> +				    tile->id, severity_str);
>>  		goto unlock;
>>  	}
>>  
>>  	if (err_src & XE_CSC_ERROR)
>> -		csc_hw_error_handler(tile, hw_err);
>> +		csc_hw_error_handler(tile, severity);
>>  
>> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>> +	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
>>  
>>  unlock:
>>  	spin_unlock_irqrestore(&xe->irq.lock, flags);
>> @@ -136,16 +117,30 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>   */
>>  void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>>  {
>> -	enum hardware_error hw_err;
>> +	u32 hw_err;
>>  
>>  	if (fault_inject_csc_hw_error())
>>  		schedule_work(&tile->csc_hw_error_work);
>>  
>> -	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>> +	for (hw_err = 0; hw_err < DRM_XE_RAS_ERROR_SEVERITY_MAX; hw_err++)
>>  		if (master_ctl & ERROR_IRQ(hw_err))
>>  			hw_error_source_handler(tile, hw_err);
>>  }
>>  
>> +static int hw_error_info_init(struct xe_device *xe)
>> +{
>> +	int ret;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return 0;
>> +
>> +	ret = xe_drm_ras_allocate_nodes(xe);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Process hardware errors during boot
>>   */
>> @@ -178,5 +173,6 @@ void xe_hw_error_init(struct xe_device *xe)
>>  
>>  	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>>  
>> +	hw_error_info_init(xe);
>>  	process_hw_errors(xe);
>>  }
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 0d99bb0cd20a..3f6c38908b70 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -2294,6 +2294,88 @@ struct drm_xe_vm_query_mem_range_attr {
>>  
>>  };
>>  
>> +/**
>> + * DOC: Xe DRM RAS
>> + *
>> + * The enums and strings defined below map to the attributes of the DRM RAS Netlink Interface.
>> + * Refer to Documentation/netlink/specs/drm_ras.yaml for complete interface specification.
>> + *
>> + * Node Registration
>> + * =================
>> + *
>> + * The driver registers DRM RAS nodes for each error severity level.
>> + * enum drm_xe_ras_error_severity defines the node-id, while DRM_XE_RAS_ERROR_SEVERITY_NAMES maps
>> + * node-id to node-name.
>> + *
>> + * Error Classification
>> + * ====================
>> + *
>> + * Each node contains a list of error counters. Each error is identified by a error-id and
>> + * an error-name. enum drm_xe_ras_error_class defines the error-id, while
>> + * DRM_XE_RAS_ERROR_CLASS_NAMES maps error-id to error-name.
>> + *
>> + * User Interface
>> + * ==============
>> + *
>> + * To retrieve error values of a error counter, userspace applications should
>> + * follow the below steps:
>> + *
>> + * 1. Use command LIST_NODES to enumerate all available nodes
>> + * 2. Select node by node-id or node-name
>> + * 3. Use command GET_ERROR_COUNTERS to list errors of specific node
>> + * 4. Query specific error values using either error-id or error-name
>> + *
>> + * .. code-block:: C
>> + *
>> + *	// Lookup tables for ID-to-name resolution
>> + *	static const char *nodes[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>> + *	static const char *errors[] = DRM_XE_RAS_ERROR_CLASS_NAMES;
>> + *
>> + */
>> +
>> +/**
>> + * enum drm_xe_ras_error_severity - Supported drm ras error severity.
>> + */
>> +enum drm_xe_ras_error_severity {
>> +	/** @DRM_XE_RAS_ERROR_CORRECTABLE: Correctable Error */
>> +	DRM_XE_RAS_ERROR_CORRECTABLE = 0,
>> +	/** @DRM_XE_RAS_ERROR_NONFATAL: Non fatal Error */
>> +	DRM_XE_RAS_ERROR_NONFATAL,
>> +	/** @DRM_XE_RAS_ERROR_FATAL: Fatal error */
>> +	DRM_XE_RAS_ERROR_FATAL,
>> +	/** @DRM_XE_RAS_ERROR_SEVERITY_MAX: Max severity */
>> +	DRM_XE_RAS_ERROR_SEVERITY_MAX, /* non-ABI */
>> +};
>> +
>> +/**
>> + * enum drm_xe_ras_error_class - Supported drm ras error classes.
>> + */
>> +enum drm_xe_ras_error_class {
>> +	/** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
>> +	DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
>> +	/** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
>> +	DRM_XE_RAS_ERROR_SOC_INTERNAL,
>> +	/** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
>> +	DRM_XE_RAS_ERROR_CLASS_MAX,	/* non-ABI */
>> +};
>> +
>> +/*
>> + * Error severity to name mapping.
>> + */
>> +#define DRM_XE_RAS_ERROR_SEVERITY_NAMES {				\
>> +	[DRM_XE_RAS_ERROR_CORRECTABLE] = "correctable-errors",		\
>> +	[DRM_XE_RAS_ERROR_NONFATAL] = "nonfatal-errors",		\
>> +	[DRM_XE_RAS_ERROR_FATAL] = "fatal-errors",			\
>> +}
>> +
>> +/*
>> + * Error class to name mapping.
>> + */
>> +#define DRM_XE_RAS_ERROR_CLASS_NAMES {					\
>> +	[DRM_XE_RAS_ERROR_CORE_COMPUTE] =  "Core Compute Error",	\
>> +	[DRM_XE_RAS_ERROR_SOC_INTERNAL] =  "SOC Internal Error",	\
>
> These looks good to me.
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>
> Joonas, Aravind, does this align what you had in mind for the uAPI?

Ya, the only change I would request is not to have space in the names,
better use '-' as separator.

Thanks,
Aravind.
>
> Thanks,
> Rodrigo.
>
>> +}
>> +
>>  #if defined(__cplusplus)
>>  }
>>  #endif
>> -- 
>> 2.47.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
  2025-12-05  8:39 ` [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
  2025-12-05  8:39 ` [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
@ 2025-12-05  8:39 ` Riana Tauro
  2025-12-10 18:18   ` Raag Jadav
  2025-12-05  8:39 ` [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2025-12-05  8:39 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	lukas, simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, Riana Tauro,
	Himal Prasad Ghimiray

PVC supports GT error reporting via vector registers along with
error status register. Add support to report these errors and
update respective counters. Incase of Subslice error reported
by vector register, process the error status register
for applicable bits.

Incorporate the counter inside the driver itself and start
using the drm_ras generic netlink to report them.

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
---
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  44 +++++
 drivers/gpu/drm/xe/xe_hw_error.c           | 182 ++++++++++++++++++++-
 2 files changed, 221 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index c146b9ef44eb..b54712e893d5 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -11,10 +11,54 @@
 
 #define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
 
+#define ERR_STAT_GT_COR				0x100160
+#define ERR_STAT_GT_NONFATAL			0x100164
+#define ERR_STAT_GT_FATAL			0x100168
+#define ERR_STAT_GT_REG(x)			XE_REG(_PICK_EVEN((x), \
+								 ERR_STAT_GT_COR, \
+								 ERR_STAT_GT_NONFATAL))
+
+#define  GT_HW_ERROR_MAX_ERR_BITS		16
+#define  EU_GRF_ERR				(15)
+#define  EU_IC_ERR				(14)
+#define  SLM_ERR				(13)
+#define  GUC_COR_ERR				(1)
+
+#define  GUC_FAT_ERR				(6)
+#define  FPU_FAT_ERR				(3)
+
+#define PVC_COR_ERR_MASK			(BIT(GUC_COR_ERR) | BIT(SLM_ERR) | \
+						 BIT(EU_IC_ERR) | BIT(EU_GRF_ERR))
+
+#define PVC_FAT_ERR_MASK			(BIT(FPU_FAT_ERR) | BIT(GUC_FAT_ERR) | \
+						 BIT(EU_GRF_ERR) | BIT(SLM_ERR))
+
 #define DEV_ERR_STAT_NONFATAL			0x100178
 #define DEV_ERR_STAT_CORRECTABLE		0x10017c
 #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
 								  DEV_ERR_STAT_CORRECTABLE, \
 								  DEV_ERR_STAT_NONFATAL))
+
 #define   XE_CSC_ERROR				BIT(17)
+#define   XE_GT_ERROR				BIT(0)
+
+#define  ERR_STAT_GT_FATAL_VECTOR_0		0x100260
+#define  ERR_STAT_GT_FATAL_VECTOR_1		0x100264
+
+#define  ERR_STAT_GT_FATAL_VECTOR_REG(x)	XE_REG(_PICK_EVEN((x), \
+								  ERR_STAT_GT_FATAL_VECTOR_0, \
+								  ERR_STAT_GT_FATAL_VECTOR_1))
+
+#define  ERR_STAT_GT_COR_VECTOR_LEN		(4)
+#define  ERR_STAT_GT_COR_VECTOR_0		0x1002a0
+#define  ERR_STAT_GT_COR_VECTOR_1		0x1002a4
+
+#define  ERR_STAT_GT_COR_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
+								 ERR_STAT_GT_COR_VECTOR_0,\
+								 ERR_STAT_GT_COR_VECTOR_1))
+
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == DRM_XE_RAS_ERROR_CORRECTABLE ? \
+						 ERR_STAT_GT_COR_VECTOR_REG(x) : \
+						 ERR_STAT_GT_FATAL_VECTOR_REG(x))
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index d63078d00b56..77c90f1b06fd 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,7 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <linux/bitmap.h>
 #include <linux/fault-inject.h>
 
 #include "regs/xe_gsc_regs.h"
@@ -16,6 +17,8 @@
 #include "xe_survivability_mode.h"
 
 #define  HEC_UNCORR_FW_ERR_BITS 4
+#define XE_RAS_REG_SIZE 32
+
 extern struct fault_attr inject_csc_hw_error;
 static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
 
@@ -26,6 +29,25 @@ static const char * const hec_uncorrected_fw_errors[] = {
 	"Data Corruption"
 };
 
+#define ERR_INDEX(_bit, index) \
+	[__ffs(_bit)] = index
+
+static const unsigned long xe_hw_error_map[] = {
+	ERR_INDEX(XE_GT_ERROR, DRM_XE_RAS_ERROR_CORE_COMPUTE),
+};
+
+enum gt_vector_regs {
+	ERR_STAT_GT_VECTOR0 = 0,
+	ERR_STAT_GT_VECTOR1,
+	ERR_STAT_GT_VECTOR2,
+	ERR_STAT_GT_VECTOR3,
+	ERR_STAT_GT_VECTOR4,
+	ERR_STAT_GT_VECTOR5,
+	ERR_STAT_GT_VECTOR6,
+	ERR_STAT_GT_VECTOR7,
+	ERR_STAT_GT_VECTOR_MAX,
+};
+
 static bool fault_inject_csc_hw_error(void)
 {
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
@@ -78,14 +100,136 @@ static void csc_hw_error_handler(struct xe_tile *tile,
 	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
 }
 
+static void log_hw_error(struct xe_tile *tile, const char *name,
+			 const enum drm_xe_ras_error_severity severity)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+
+	if (severity == DRM_XE_RAS_ERROR_FATAL)
+		drm_err_ratelimited(&xe->drm, "%s %s error detected\n", name, severity_str);
+	else
+		drm_warn(&xe->drm, "%s %s error detected\n", name, severity_str);
+}
+
+static void
+log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
+	   const enum drm_xe_ras_error_severity severity)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+
+	if (severity == DRM_XE_RAS_ERROR_FATAL)
+		drm_err_ratelimited(&xe->drm, "%s %s error detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+				    name, severity_str, i, err);
+	else
+		drm_warn(&xe->drm, "%s %s error detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+			 name, severity_str, i, err);
+}
+
+static void gt_handle_errors(struct xe_tile *tile,
+			     const enum drm_xe_ras_error_severity severity, u32 error_id)
+{
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	struct xe_mmio *mmio = &tile->mmio;
+	unsigned long err_stat = 0;
+	int i;
+
+	if (xe->info.platform != XE_PVC)
+		return;
+
+	for (i = 0; i < ERR_STAT_GT_VECTOR_MAX; i++) {
+		u32 vector, val;
+
+		if (severity == DRM_XE_RAS_ERROR_CORRECTABLE && i >= ERR_STAT_GT_COR_VECTOR_LEN)
+			break;
+
+		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(severity, i));
+		if (!vector)
+			continue;
+
+		switch (i) {
+		case ERR_STAT_GT_VECTOR0:
+		case ERR_STAT_GT_VECTOR1:
+			u32 errbit;
+
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			log_gt_err(tile, "Subslice", i, vector, severity);
+
+			if (err_stat)
+				break;
+
+			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(severity));
+			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
+				if (severity == DRM_XE_RAS_ERROR_CORRECTABLE &&
+				    (BIT(errbit) & PVC_COR_ERR_MASK))
+					atomic64_inc(&info[error_id].counter);
+				if (severity == DRM_XE_RAS_ERROR_FATAL &&
+				    (BIT(errbit) & PVC_FAT_ERR_MASK))
+					atomic64_inc(&info[error_id].counter);
+			}
+			if (err_stat)
+				xe_mmio_write32(mmio, ERR_STAT_GT_REG(severity), err_stat);
+			break;
+		case ERR_STAT_GT_VECTOR2:
+		case ERR_STAT_GT_VECTOR3:
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			log_gt_err(tile, "L3 BANK", i, vector, severity);
+			break;
+		case ERR_STAT_GT_VECTOR6:
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			log_gt_err(tile, "TLB", i, vector, severity);
+			break;
+		case ERR_STAT_GT_VECTOR7:
+			val = hweight32(vector);
+			atomic64_add(val, &info[error_id].counter);
+			break;
+		default:
+			log_gt_err(tile, "Undefined", i, vector, severity);
+		}
+
+		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(severity, i), vector);
+	}
+}
+
+static void gt_hw_error_handler(struct xe_tile *tile,
+				const enum drm_xe_ras_error_severity severity, u32 error_id)
+{
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+
+	switch (severity) {
+	case DRM_XE_RAS_ERROR_CORRECTABLE:
+		gt_handle_errors(tile, severity, error_id);
+		break;
+	case DRM_XE_RAS_ERROR_NONFATAL:
+		atomic64_inc(&info[error_id].counter);
+		log_hw_error(tile, "GT", severity);
+		break;
+	case DRM_XE_RAS_ERROR_FATAL:
+		gt_handle_errors(tile, severity, error_id);
+		break;
+	default:
+		drm_warn(&xe->drm, "Undefined error detected\n");
+	}
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_severity severity)
 {
 	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
-	unsigned long flags;
-	u32 err_src;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	unsigned long flags, err_src;
+	u32 err_bit;
 
-	if (xe->info.platform != XE_BATTLEMAGE)
+	if (!IS_DGFX(xe))
 		return;
 
 	spin_lock_irqsave(&xe->irq.lock, flags);
@@ -96,11 +240,39 @@ static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_
 		goto unlock;
 	}
 
-	if (err_src & XE_CSC_ERROR)
+	if (err_src & XE_CSC_ERROR) {
 		csc_hw_error_handler(tile, severity);
+		goto clear_reg;
+	}
 
-	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
+	if (!info) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Errors undefined\n");
+		goto clear_reg;
+	}
+
+	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
+		u32 error_id = xe_hw_error_map[err_bit];
+		const char *name;
+
+		name = info[error_id].name;
+		if (!name)
+			goto clear_reg;
 
+		if (severity == DRM_XE_RAS_ERROR_FATAL) {
+			drm_err_ratelimited(&xe->drm, HW_ERR
+					    "TILE%d reported %s %s error, bit[%d] is set\n",
+					    tile->id, name, severity_str, err_bit);
+		} else {
+			drm_warn(&xe->drm, HW_ERR
+				 "TILE%d reported %s %s error, bit[%d] is set\n",
+				 tile->id, name, severity_str, err_bit);
+		}
+		if (BIT(err_bit) & XE_GT_ERROR)
+			gt_hw_error_handler(tile, severity, error_id);
+	}
+
+clear_reg:
+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
 unlock:
 	spin_unlock_irqrestore(&xe->irq.lock, flags);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2025-12-05  8:39 ` [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
@ 2025-12-10 18:18   ` Raag Jadav
  2026-01-12  3:41     ` Riana Tauro
  0 siblings, 1 reply; 31+ messages in thread
From: Raag Jadav @ 2025-12-10 18:18 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Himal Prasad Ghimiray

On Fri, Dec 05, 2025 at 02:09:35PM +0530, Riana Tauro wrote:
> PVC supports GT error reporting via vector registers along with
> error status register. Add support to report these errors and
> update respective counters. Incase of Subslice error reported
> by vector register, process the error status register
> for applicable bits.
> 
> Incorporate the counter inside the driver itself and start
> using the drm_ras generic netlink to report them.

...

> +#define ERR_STAT_GT_COR				0x100160
> +#define ERR_STAT_GT_NONFATAL			0x100164
> +#define ERR_STAT_GT_FATAL			0x100168
> +#define ERR_STAT_GT_REG(x)			XE_REG(_PICK_EVEN((x), \
> +								 ERR_STAT_GT_COR, \

Align with braces.

> +								 ERR_STAT_GT_NONFATAL))
> +
> +#define  GT_HW_ERROR_MAX_ERR_BITS		16
> +#define  EU_GRF_ERR				(15)
> +#define  EU_IC_ERR				(14)
> +#define  SLM_ERR				(13)

Nit: I know this consolidates the duplication but separate bits followed
by their mask would be relatively easier to read IMHO.

> +#define  GUC_COR_ERR				(1)
> +
> +#define  GUC_FAT_ERR				(6)
> +#define  FPU_FAT_ERR				(3)
> +
> +#define PVC_COR_ERR_MASK			(BIT(GUC_COR_ERR) | BIT(SLM_ERR) | \
> +						 BIT(EU_IC_ERR) | BIT(EU_GRF_ERR))
> +
> +#define PVC_FAT_ERR_MASK			(BIT(FPU_FAT_ERR) | BIT(GUC_FAT_ERR) | \
> +						 BIT(EU_GRF_ERR) | BIT(SLM_ERR))

I had an impression that we used REG_BIT(), or did things change?

>  #define DEV_ERR_STAT_NONFATAL			0x100178
>  #define DEV_ERR_STAT_CORRECTABLE		0x10017c
>  #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>  								  DEV_ERR_STAT_CORRECTABLE, \
>  								  DEV_ERR_STAT_NONFATAL))
> +
>  #define   XE_CSC_ERROR				BIT(17)
> +#define   XE_GT_ERROR				BIT(0)
> +
> +#define  ERR_STAT_GT_FATAL_VECTOR_0		0x100260
> +#define  ERR_STAT_GT_FATAL_VECTOR_1		0x100264
> +
> +#define  ERR_STAT_GT_FATAL_VECTOR_REG(x)	XE_REG(_PICK_EVEN((x), \
> +								  ERR_STAT_GT_FATAL_VECTOR_0, \
> +								  ERR_STAT_GT_FATAL_VECTOR_1))
> +
> +#define  ERR_STAT_GT_COR_VECTOR_LEN		(4)
> +#define  ERR_STAT_GT_COR_VECTOR_0		0x1002a0
> +#define  ERR_STAT_GT_COR_VECTOR_1		0x1002a4
> +
> +#define  ERR_STAT_GT_COR_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
> +								 ERR_STAT_GT_COR_VECTOR_0,\

Ditto for alignment. Also, the convention here seems like an empty space
before the backslash so let's be consistent.

> +								 ERR_STAT_GT_COR_VECTOR_1))
> +
> +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == DRM_XE_RAS_ERROR_CORRECTABLE ? \
> +						 ERR_STAT_GT_COR_VECTOR_REG(x) : \
> +						 ERR_STAT_GT_FATAL_VECTOR_REG(x))
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index d63078d00b56..77c90f1b06fd 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -3,6 +3,7 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <linux/bitmap.h>
>  #include <linux/fault-inject.h>
>  
>  #include "regs/xe_gsc_regs.h"
> @@ -16,6 +17,8 @@
>  #include "xe_survivability_mode.h"
>  
>  #define  HEC_UNCORR_FW_ERR_BITS 4
> +#define XE_RAS_REG_SIZE 32

Alignment please! (including the values)

>  extern struct fault_attr inject_csc_hw_error;
>  static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>  
> @@ -26,6 +29,25 @@ static const char * const hec_uncorrected_fw_errors[] = {
>  	"Data Corruption"
>  };
>  
> +#define ERR_INDEX(_bit, index) \
> +	[__ffs(_bit)] = index

Does this guarantee compile time evaluation for all archs? We might risk
breaking other builds where the compiler can't figure it out.

> +static const unsigned long xe_hw_error_map[] = {
> +	ERR_INDEX(XE_GT_ERROR, DRM_XE_RAS_ERROR_CORE_COMPUTE),
> +};
> +
> +enum gt_vector_regs {
> +	ERR_STAT_GT_VECTOR0 = 0,
> +	ERR_STAT_GT_VECTOR1,
> +	ERR_STAT_GT_VECTOR2,
> +	ERR_STAT_GT_VECTOR3,
> +	ERR_STAT_GT_VECTOR4,
> +	ERR_STAT_GT_VECTOR5,
> +	ERR_STAT_GT_VECTOR6,
> +	ERR_STAT_GT_VECTOR7,
> +	ERR_STAT_GT_VECTOR_MAX,
> +};
> +
>  static bool fault_inject_csc_hw_error(void)
>  {
>  	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> @@ -78,14 +100,136 @@ static void csc_hw_error_handler(struct xe_tile *tile,
>  	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>  }
>  
> +static void log_hw_error(struct xe_tile *tile, const char *name,
> +			 const enum drm_xe_ras_error_severity severity)
> +{
> +	const char *severity_str = error_severity[severity];
> +	struct xe_device *xe = tile_to_xe(tile);
> +
> +	if (severity == DRM_XE_RAS_ERROR_FATAL)
> +		drm_err_ratelimited(&xe->drm, "%s %s error detected\n", name, severity_str);

Adding to my earlier bikeshed[*] about 'error', it's either as part of
severity_str or message but let's not spam it.

[*] https://lore.kernel.org/intel-xe/aTfcV5nb_vBOOBvP@black.igk.intel.com/

> +	else
> +		drm_warn(&xe->drm, "%s %s error detected\n", name, severity_str);

Ditto.

> +}
> +
> +static void
> +log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,

Could be a single line. Also, 'i' could be unsigned.

> +	   const enum drm_xe_ras_error_severity severity)
> +{
> +	const char *severity_str = error_severity[severity];
> +	struct xe_device *xe = tile_to_xe(tile);
> +
> +	if (severity == DRM_XE_RAS_ERROR_FATAL)
> +		drm_err_ratelimited(&xe->drm, "%s %s error detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",

Same as above.

> +				    name, severity_str, i, err);
> +	else
> +		drm_warn(&xe->drm, "%s %s error detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",

Ditto.

> +			 name, severity_str, i, err);
> +}
> +
> +static void gt_handle_errors(struct xe_tile *tile,
> +			     const enum drm_xe_ras_error_severity severity, u32 error_id)
> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	struct xe_mmio *mmio = &tile->mmio;
> +	unsigned long err_stat = 0;
> +	int i;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return;
> +
> +	for (i = 0; i < ERR_STAT_GT_VECTOR_MAX; i++) {
> +		u32 vector, val;
> +
> +		if (severity == DRM_XE_RAS_ERROR_CORRECTABLE && i >= ERR_STAT_GT_COR_VECTOR_LEN)

This can be a temp variable which you can use as a for loop condition,
but please use explicit length per severity so we don't come back
questioning things.

> +			break;
> +
> +		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(severity, i));
> +		if (!vector)
> +			continue;
> +
> +		switch (i) {
> +		case ERR_STAT_GT_VECTOR0:
> +		case ERR_STAT_GT_VECTOR1:
> +			u32 errbit;
> +
> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			log_gt_err(tile, "Subslice", i, vector, severity);
> +
> +			if (err_stat)
> +				break;

So we won't ever be getting past this point, is that right?

> +			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(severity));
> +			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
> +				if (severity == DRM_XE_RAS_ERROR_CORRECTABLE &&
> +				    (BIT(errbit) & PVC_COR_ERR_MASK))
> +					atomic64_inc(&info[error_id].counter);
> +				if (severity == DRM_XE_RAS_ERROR_FATAL &&
> +				    (BIT(errbit) & PVC_FAT_ERR_MASK))
> +					atomic64_inc(&info[error_id].counter);
> +			}
> +			if (err_stat)
> +				xe_mmio_write32(mmio, ERR_STAT_GT_REG(severity), err_stat);
> +			break;
> +		case ERR_STAT_GT_VECTOR2:
> +		case ERR_STAT_GT_VECTOR3:
> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			log_gt_err(tile, "L3 BANK", i, vector, severity);
> +			break;
> +		case ERR_STAT_GT_VECTOR6:
> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			log_gt_err(tile, "TLB", i, vector, severity);
> +			break;
> +		case ERR_STAT_GT_VECTOR7:
> +			val = hweight32(vector);
> +			atomic64_add(val, &info[error_id].counter);
> +			break;
> +		default:
> +			log_gt_err(tile, "Undefined", i, vector, severity);
> +		}
> +
> +		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(severity, i), vector);
> +	}
> +}
> +
> +static void gt_hw_error_handler(struct xe_tile *tile,
> +				const enum drm_xe_ras_error_severity severity, u32 error_id)

This entire function can be just an explicit NONFATAL case at the top of
gt_handle_errors().

> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +
> +	switch (severity) {
> +	case DRM_XE_RAS_ERROR_CORRECTABLE:
> +		gt_handle_errors(tile, severity, error_id);
> +		break;
> +	case DRM_XE_RAS_ERROR_NONFATAL:
> +		atomic64_inc(&info[error_id].counter);
> +		log_hw_error(tile, "GT", severity);
> +		break;
> +	case DRM_XE_RAS_ERROR_FATAL:
> +		gt_handle_errors(tile, severity, error_id);
> +		break;
> +	default:
> +		drm_warn(&xe->drm, "Undefined error detected\n");
> +	}
> +}
> +
>  static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_severity severity)

Perhaps severity can be const now?

>  {
>  	const char *severity_str = error_severity[severity];
>  	struct xe_device *xe = tile_to_xe(tile);
> -	unsigned long flags;
> -	u32 err_src;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	unsigned long flags, err_src;
> +	u32 err_bit;
>  
> -	if (xe->info.platform != XE_BATTLEMAGE)
> +	if (!IS_DGFX(xe))
>  		return;
>  
>  	spin_lock_irqsave(&xe->irq.lock, flags);

Isn't this IRQ context only? What am I missing?

> @@ -96,11 +240,39 @@ static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_
>  		goto unlock;
>  	}
>  
> -	if (err_src & XE_CSC_ERROR)
> +	if (err_src & XE_CSC_ERROR) {

Shouldn't this be inside the loop below?

>  		csc_hw_error_handler(tile, severity);
> +		goto clear_reg;

If no, should we not check for other bits before bailing?

> +	}
>  
> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
> +	if (!info) {
> +		drm_err_ratelimited(&xe->drm, HW_ERR "Errors undefined\n");
> +		goto clear_reg;
> +	}
> +
> +	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
> +		u32 error_id = xe_hw_error_map[err_bit];
> +		const char *name;
> +
> +		name = info[error_id].name;
> +		if (!name)
> +			goto clear_reg;
>  
> +		if (severity == DRM_XE_RAS_ERROR_FATAL) {
> +			drm_err_ratelimited(&xe->drm, HW_ERR
> +					    "TILE%d reported %s %s error, bit[%d] is set\n",
> +					    tile->id, name, severity_str, err_bit);
> +		} else {
> +			drm_warn(&xe->drm, HW_ERR
> +				 "TILE%d reported %s %s error, bit[%d] is set\n",
> +				 tile->id, name, severity_str, err_bit);
> +		}
> +		if (BIT(err_bit) & XE_GT_ERROR)

Shouldn't this be compare?

Raag

> +			gt_hw_error_handler(tile, severity, error_id);
> +	}
> +
> +clear_reg:
> +	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
>  unlock:
>  	spin_unlock_irqrestore(&xe->irq.lock, flags);
>  }
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2025-12-10 18:18   ` Raag Jadav
@ 2026-01-12  3:41     ` Riana Tauro
  2026-01-12 10:02       ` Raag Jadav
  0 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2026-01-12  3:41 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Himal Prasad Ghimiray

Hi Raag

On 12/10/2025 11:48 PM, Raag Jadav wrote:
> On Fri, Dec 05, 2025 at 02:09:35PM +0530, Riana Tauro wrote:
>> PVC supports GT error reporting via vector registers along with
>> error status register. Add support to report these errors and
>> update respective counters. Incase of Subslice error reported
>> by vector register, process the error status register
>> for applicable bits.
>>
>> Incorporate the counter inside the driver itself and start
>> using the drm_ras generic netlink to report them.
> 
> ...
> 
>> +#define ERR_STAT_GT_COR				0x100160
>> +#define ERR_STAT_GT_NONFATAL			0x100164
>> +#define ERR_STAT_GT_FATAL			0x100168
>> +#define ERR_STAT_GT_REG(x)			XE_REG(_PICK_EVEN((x), \
>> +								 ERR_STAT_GT_COR, \
> 
> Align with braces.

This is aligned with the braces.

> 
>> +								 ERR_STAT_GT_NONFATAL))
>> +
>> +#define  GT_HW_ERROR_MAX_ERR_BITS		16
>> +#define  EU_GRF_ERR				(15)
>> +#define  EU_IC_ERR				(14)
>> +#define  SLM_ERR				(13)
> 
> Nit: I know this consolidates the duplication but separate bits followed
> by their mask would be relatively easier to read IMHO.
> 

Hmm.. okay. Was trying to avoid multiple declarations for the same bit


>> +#define  GUC_COR_ERR				(1)
>> +
>> +#define  GUC_FAT_ERR				(6)
>> +#define  FPU_FAT_ERR				(3)
>> +
>> +#define PVC_COR_ERR_MASK			(BIT(GUC_COR_ERR) | BIT(SLM_ERR) | \
>> +						 BIT(EU_IC_ERR) | BIT(EU_GRF_ERR))
>> +
>> +#define PVC_FAT_ERR_MASK			(BIT(FPU_FAT_ERR) | BIT(GUC_FAT_ERR) | \
>> +						 BIT(EU_GRF_ERR) | BIT(SLM_ERR))
> 
> I had an impression that we used REG_BIT(), or did things change?

I took the previous patch directly. Thanks for catching this. Will fix

> 
>>   #define DEV_ERR_STAT_NONFATAL			0x100178
>>   #define DEV_ERR_STAT_CORRECTABLE		0x10017c
>>   #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>>   								  DEV_ERR_STAT_CORRECTABLE, \
>>   								  DEV_ERR_STAT_NONFATAL))
>> +
>>   #define   XE_CSC_ERROR				BIT(17)
>> +#define   XE_GT_ERROR				BIT(0)
>> +
>> +#define  ERR_STAT_GT_FATAL_VECTOR_0		0x100260
>> +#define  ERR_STAT_GT_FATAL_VECTOR_1		0x100264
>> +
>> +#define  ERR_STAT_GT_FATAL_VECTOR_REG(x)	XE_REG(_PICK_EVEN((x), \
>> +								  ERR_STAT_GT_FATAL_VECTOR_0, \
>> +								  ERR_STAT_GT_FATAL_VECTOR_1))
>> +
>> +#define  ERR_STAT_GT_COR_VECTOR_LEN		(4)
>> +#define  ERR_STAT_GT_COR_VECTOR_0		0x1002a0
>> +#define  ERR_STAT_GT_COR_VECTOR_1		0x1002a4
>> +
>> +#define  ERR_STAT_GT_COR_VECTOR_REG(x)		XE_REG(_PICK_EVEN((x), \
>> +								 ERR_STAT_GT_COR_VECTOR_0,\
> 
> Ditto for alignment. Also, the convention here seems like an empty space
> before the backslash so let's be consistent.

Alignment is proper here too.  It's patchwork :(

> 
>> +								 ERR_STAT_GT_COR_VECTOR_1))
>> +
>> +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == DRM_XE_RAS_ERROR_CORRECTABLE ? \
>> +						 ERR_STAT_GT_COR_VECTOR_REG(x) : \
>> +						 ERR_STAT_GT_FATAL_VECTOR_REG(x))
>> +
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index d63078d00b56..77c90f1b06fd 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,6 +3,7 @@
>>    * Copyright © 2025 Intel Corporation
>>    */
>>   
>> +#include <linux/bitmap.h>
>>   #include <linux/fault-inject.h>
>>   
>>   #include "regs/xe_gsc_regs.h"
>> @@ -16,6 +17,8 @@
>>   #include "xe_survivability_mode.h"
>>   
>>   #define  HEC_UNCORR_FW_ERR_BITS 4
>> +#define XE_RAS_REG_SIZE 32
> 
> Alignment please! (including the values)
> 
>>   extern struct fault_attr inject_csc_hw_error;
>>   static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>>   
>> @@ -26,6 +29,25 @@ static const char * const hec_uncorrected_fw_errors[] = {
>>   	"Data Corruption"
>>   };
>>   
>> +#define ERR_INDEX(_bit, index) \
>> +	[__ffs(_bit)] = index
> 
> Does this guarantee compile time evaluation for all archs? We might risk
> breaking other builds where the compiler can't figure it out.

Thanks for pointing this out. Will fix this

> 
>> +static const unsigned long xe_hw_error_map[] = {
>> +	ERR_INDEX(XE_GT_ERROR, DRM_XE_RAS_ERROR_CORE_COMPUTE),
>> +};
>> +
>> +enum gt_vector_regs {
>> +	ERR_STAT_GT_VECTOR0 = 0,
>> +	ERR_STAT_GT_VECTOR1,
>> +	ERR_STAT_GT_VECTOR2,
>> +	ERR_STAT_GT_VECTOR3,
>> +	ERR_STAT_GT_VECTOR4,
>> +	ERR_STAT_GT_VECTOR5,
>> +	ERR_STAT_GT_VECTOR6,
>> +	ERR_STAT_GT_VECTOR7,
>> +	ERR_STAT_GT_VECTOR_MAX,
>> +};
>> +
>>   static bool fault_inject_csc_hw_error(void)
>>   {
>>   	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
>> @@ -78,14 +100,136 @@ static void csc_hw_error_handler(struct xe_tile *tile,
>>   	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>>   }
>>   
>> +static void log_hw_error(struct xe_tile *tile, const char *name,
>> +			 const enum drm_xe_ras_error_severity severity)
>> +{
>> +	const char *severity_str = error_severity[severity];
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +
>> +	if (severity == DRM_XE_RAS_ERROR_FATAL)
>> +		drm_err_ratelimited(&xe->drm, "%s %s error detected\n", name, severity_str);
> 
> Adding to my earlier bikeshed[*] about 'error', it's either as part of
> severity_str or message but let's not spam it.
> 
> [*] https://lore.kernel.org/intel-xe/aTfcV5nb_vBOOBvP@black.igk.intel.com/

fixing this

> 
>> +	else
>> +		drm_warn(&xe->drm, "%s %s error detected\n", name, severity_str);
> 
> Ditto.
> 
>> +}
>> +
>> +static void
>> +log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
> 
> Could be a single line. Also, 'i' could be unsigned.

okay

> 
>> +	   const enum drm_xe_ras_error_severity severity)
>> +{
>> +	const char *severity_str = error_severity[severity];
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +
>> +	if (severity == DRM_XE_RAS_ERROR_FATAL)
>> +		drm_err_ratelimited(&xe->drm, "%s %s error detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
> 
> Same as above.

This is actually the type of GT error so it will not print extra error.
Used here

log_gt_err(tile, "TLB", i, vector, severity);
log_gt_err(tile, "Subslice", i, vector, severity);


> 
>> +				    name, severity_str, i, err);
>> +	else
>> +		drm_warn(&xe->drm, "%s %s error detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
> 
> Ditto.
> 
>> +			 name, severity_str, i, err);
>> +}
>> +
>> +static void gt_handle_errors(struct xe_tile *tile,
>> +			     const enum drm_xe_ras_error_severity severity, u32 error_id)
>> +{
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	unsigned long err_stat = 0;
>> +	int i;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return;
>> +
>> +	for (i = 0; i < ERR_STAT_GT_VECTOR_MAX; i++) {
>> +		u32 vector, val;
>> +
>> +		if (severity == DRM_XE_RAS_ERROR_CORRECTABLE && i >= ERR_STAT_GT_COR_VECTOR_LEN)
> 
> This can be a temp variable which you can use as a for loop condition,
> but please use explicit length per severity so we don't come back
> questioning things.
> 

okay


>> +			break;
>> +
>> +		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(severity, i));
>> +		if (!vector)
>> +			continue;
>> +
>> +		switch (i) {
>> +		case ERR_STAT_GT_VECTOR0:
>> +		case ERR_STAT_GT_VECTOR1:
>> +			u32 errbit;
>> +
>> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			log_gt_err(tile, "Subslice", i, vector, severity);
>> +
>> +			if (err_stat)
>> +				break;
> 
> So we won't ever be getting past this point, is that right?

err stat will be read only once. The first time we will not hit this.


> 
>> +			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(severity));
>> +			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
>> +				if (severity == DRM_XE_RAS_ERROR_CORRECTABLE &&
>> +				    (BIT(errbit) & PVC_COR_ERR_MASK))
>> +					atomic64_inc(&info[error_id].counter);
>> +				if (severity == DRM_XE_RAS_ERROR_FATAL &&
>> +				    (BIT(errbit) & PVC_FAT_ERR_MASK))
>> +					atomic64_inc(&info[error_id].counter);
>> +			}
>> +			if (err_stat)
>> +				xe_mmio_write32(mmio, ERR_STAT_GT_REG(severity), err_stat);
>> +			break;
>> +		case ERR_STAT_GT_VECTOR2:
>> +		case ERR_STAT_GT_VECTOR3:
>> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			log_gt_err(tile, "L3 BANK", i, vector, severity);
>> +			break;
>> +		case ERR_STAT_GT_VECTOR6:
>> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			log_gt_err(tile, "TLB", i, vector, severity);
>> +			break;
>> +		case ERR_STAT_GT_VECTOR7:
>> +			val = hweight32(vector);
>> +			atomic64_add(val, &info[error_id].counter);
>> +			break;
>> +		default:
>> +			log_gt_err(tile, "Undefined", i, vector, severity);
>> +		}
>> +
>> +		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(severity, i), vector);
>> +	}
>> +}
>> +
>> +static void gt_hw_error_handler(struct xe_tile *tile,
>> +				const enum drm_xe_ras_error_severity severity, u32 error_id)
> 
> This entire function can be just an explicit NONFATAL case at the top of
> gt_handle_errors().

okay

> 
>> +{
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +
>> +	switch (severity) {
>> +	case DRM_XE_RAS_ERROR_CORRECTABLE:
>> +		gt_handle_errors(tile, severity, error_id);
>> +		break;
>> +	case DRM_XE_RAS_ERROR_NONFATAL:
>> +		atomic64_inc(&info[error_id].counter);
>> +		log_hw_error(tile, "GT", severity);
>> +		break;
>> +	case DRM_XE_RAS_ERROR_FATAL:
>> +		gt_handle_errors(tile, severity, error_id);
>> +		break;
>> +	default:
>> +		drm_warn(&xe->drm, "Undefined error detected\n");
>> +	}
>> +}
>> +
>>   static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_severity severity)
> 
> Perhaps severity can be const now?

yeah it can be const here. Will add

> 
>>   {
>>   	const char *severity_str = error_severity[severity];
>>   	struct xe_device *xe = tile_to_xe(tile);
>> -	unsigned long flags;
>> -	u32 err_src;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	unsigned long flags, err_src;
>> +	u32 err_bit;
>>   
>> -	if (xe->info.platform != XE_BATTLEMAGE)
>> +	if (!IS_DGFX(xe))
>>   		return;
>>   
>>   	spin_lock_irqsave(&xe->irq.lock, flags);
> 
> Isn't this IRQ context only? What am I missing?

This is right

> 
>> @@ -96,11 +240,39 @@ static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_
>>   		goto unlock;
>>   	}
>>   
>> -	if (err_src & XE_CSC_ERROR)
>> +	if (err_src & XE_CSC_ERROR) {
> 
> Shouldn't this be inside the loop below?

We do not have a separate type for CSC. And once we get a CSC error, the 
driver will be wedged and only way to recover is firmware flash.

So there is no point of keeping count or checking other bits.


> 
>>   		csc_hw_error_handler(tile, severity);
>> +		goto clear_reg;
> 
> If no, should we not check for other bits before bailing?
> 
>> +	}
>>   
>> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
>> +	if (!info) {
>> +		drm_err_ratelimited(&xe->drm, HW_ERR "Errors undefined\n");
>> +		goto clear_reg;
>> +	}
>> +
>> +	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
>> +		u32 error_id = xe_hw_error_map[err_bit];
>> +		const char *name;
>> +
>> +		name = info[error_id].name;
>> +		if (!name)
>> +			goto clear_reg;
>>   
>> +		if (severity == DRM_XE_RAS_ERROR_FATAL) {
>> +			drm_err_ratelimited(&xe->drm, HW_ERR
>> +					    "TILE%d reported %s %s error, bit[%d] is set\n",
>> +					    tile->id, name, severity_str, err_bit);
>> +		} else {
>> +			drm_warn(&xe->drm, HW_ERR
>> +				 "TILE%d reported %s %s error, bit[%d] is set\n",
>> +				 tile->id, name, severity_str, err_bit);
>> +		}
>> +		if (BIT(err_bit) & XE_GT_ERROR)
> 
> Shouldn't this be compare?

This also won't give an error. But yeah removing ffs so will change this

Thanks
Riana

> 
> Raag
> 
>> +			gt_hw_error_handler(tile, severity, error_id);
>> +	}
>> +
>> +clear_reg:
>> +	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(severity), err_src);
>>   unlock:
>>   	spin_unlock_irqrestore(&xe->irq.lock, flags);
>>   }
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors
  2026-01-12  3:41     ` Riana Tauro
@ 2026-01-12 10:02       ` Raag Jadav
  0 siblings, 0 replies; 31+ messages in thread
From: Raag Jadav @ 2026-01-12 10:02 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Himal Prasad Ghimiray

On Mon, Jan 12, 2026 at 09:11:05AM +0530, Riana Tauro wrote:
> On 12/10/2025 11:48 PM, Raag Jadav wrote:
> > On Fri, Dec 05, 2025 at 02:09:35PM +0530, Riana Tauro wrote:
> > > PVC supports GT error reporting via vector registers along with
> > > error status register. Add support to report these errors and
> > > update respective counters. Incase of Subslice error reported
> > > by vector register, process the error status register
> > > for applicable bits.
> > > 
> > > Incorporate the counter inside the driver itself and start
> > > using the drm_ras generic netlink to report them.

...

> > > +		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(severity, i));
> > > +		if (!vector)
> > > +			continue;
> > > +
> > > +		switch (i) {
> > > +		case ERR_STAT_GT_VECTOR0:
> > > +		case ERR_STAT_GT_VECTOR1:
> > > +			u32 errbit;
> > > +
> > > +			val = hweight32(vector);
> > > +			atomic64_add(val, &info[error_id].counter);
> > > +			log_gt_err(tile, "Subslice", i, vector, severity);
> > > +
> > > +			if (err_stat)
> > > +				break;
> > 
> > So we won't ever be getting past this point, is that right?
> 
> err stat will be read only once. The first time we will not hit this.

Right, so let's explain it with a small comment.

...

> > > @@ -96,11 +240,39 @@ static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_
> > >   		goto unlock;
> > >   	}
> > > -	if (err_src & XE_CSC_ERROR)
> > > +	if (err_src & XE_CSC_ERROR) {
> > 
> > Shouldn't this be inside the loop below?
> 
> We do not have a separate type for CSC. And once we get a CSC error, the
> driver will be wedged and only way to recover is firmware flash.
> 
> So there is no point of keeping count or checking other bits.

Ditto for a small comment.

Raag

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (2 preceding siblings ...)
  2025-12-05  8:39 ` [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
@ 2025-12-05  8:39 ` Riana Tauro
  2025-12-15 10:52   ` Raag Jadav
  2025-12-05  9:40 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev3) Patchwork
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2025-12-05  8:39 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	lukas, simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, Riana Tauro,
	Himal Prasad Ghimiray

Report the SOC nonfatal/fatal hardware error and update the counters.

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
---
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
 drivers/gpu/drm/xe/xe_hw_error.c           | 202 +++++++++++++++++++++
 2 files changed, 226 insertions(+)

diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index b54712e893d5..3280af41f79f 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -40,6 +40,7 @@
 								  DEV_ERR_STAT_NONFATAL))
 
 #define   XE_CSC_ERROR				BIT(17)
+#define   XE_SOC_ERROR				BIT(16)
 #define   XE_GT_ERROR				BIT(0)
 
 #define  ERR_STAT_GT_FATAL_VECTOR_0		0x100260
@@ -61,4 +62,27 @@
 						 ERR_STAT_GT_COR_VECTOR_REG(x) : \
 						 ERR_STAT_GT_FATAL_VECTOR_REG(x))
 
+#define SOC_PVC_BASE				0x282000
+#define SOC_PVC_SLAVE_BASE			0x283000
+
+#define SOC_GCOERRSTS				0x200
+#define SOC_GNFERRSTS				0x210
+#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
+								  (base) + SOC_GCOERRSTS, \
+								  (base) + SOC_GNFERRSTS))
+#define   SOC_SLAVE_IEH				BIT(1)
+#define   SOC_IEH0_LOCAL_ERR_STATUS		BIT(0)
+#define   SOC_IEH1_LOCAL_ERR_STATUS		BIT(0)
+
+#define SOC_GSYSEVTCTL				0x264
+#define SOC_GSYSEVTCTL_REG(base, slave_base, x)	XE_REG(_PICK_EVEN((x), \
+								  (base) + SOC_GSYSEVTCTL, \
+								  slave_base + SOC_GSYSEVTCTL))
+
+#define SOC_LERRUNCSTS				0x280
+#define SOC_LERRCORSTS				0x294
+#define SOC_LOCAL_ERR_STAT_REG(base, x)		XE_REG(x == DRM_XE_RAS_ERROR_CORRECTABLE ? \
+						      (base) + SOC_LERRCORSTS : \
+						      (base) + SOC_LERRUNCSTS)
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 77c90f1b06fd..1b7c782dbd98 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -18,6 +18,7 @@
 
 #define  HEC_UNCORR_FW_ERR_BITS 4
 #define XE_RAS_REG_SIZE 32
+#define XE_SOC_NUM_IEH 2
 
 extern struct fault_attr inject_csc_hw_error;
 static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
@@ -34,6 +35,7 @@ static const char * const hec_uncorrected_fw_errors[] = {
 
 static const unsigned long xe_hw_error_map[] = {
 	ERR_INDEX(XE_GT_ERROR, DRM_XE_RAS_ERROR_CORE_COMPUTE),
+	ERR_INDEX(XE_SOC_ERROR, DRM_XE_RAS_ERROR_SOC_INTERNAL),
 };
 
 enum gt_vector_regs {
@@ -48,6 +50,92 @@ enum gt_vector_regs {
 	ERR_STAT_GT_VECTOR_MAX,
 };
 
+static const char * const pvc_slave_local_fatal_err_reg[] = {
+	[0]		= "Local IEH internal: Malformed PCIe AER",
+	[1]		= "Local IEH internal: Malformed PCIe ERR",
+	[2]		= "Local IEH internal: UR conditions in IEH",
+	[3]		= "Local IEH internal: From SERR Sources",
+	[4 ... 19]	= "Undefined",
+	[20]		= "Malformed MCA error packet (HBM/Punit)",
+	[21 ... 31]	= "Undefined",
+};
+
+static const char * const pvc_slave_global_err_reg[] = {
+	[0]         = "Undefined",
+	[1]         = "HBM SS2: Channel0",
+	[2]         = "HBM SS2: Channel1",
+	[3]         = "HBM SS2: Channel2",
+	[4]         = "HBM SS2: Channel3",
+	[5]         = "HBM SS2: Channel4",
+	[6]         = "HBM SS2: Channel5",
+	[7]         = "HBM SS2: Channel6",
+	[8]         = "HBM SS2: Channel7",
+	[9]         = "HBM SS3: Channel0",
+	[10]        = "HBM SS3: Channel1",
+	[11]        = "HBM SS3: Channel2",
+	[12]        = "HBM SS3: Channel3",
+	[13]        = "HBM SS3: Channel4",
+	[14]        = "HBM SS3: Channel5",
+	[15]        = "HBM SS3: Channel6",
+	[16]        = "HBM SS3: Channel7",
+	[17]        = "Undefined",
+	[18]        = "ANR MDFI",
+	[19 ... 31] = "Undefined",
+};
+
+static const char * const pvc_master_global_err_reg[] = {
+	[0 ... 1]   = "Undefined",
+	[2]         =  "HBM SS0: Channel0",
+	[3]         =  "HBM SS0: Channel1",
+	[4]         =  "HBM SS0: Channel2",
+	[5]         =  "HBM SS0: Channel3",
+	[6]         =  "HBM SS0: Channel4",
+	[7]         =  "HBM SS0: Channel5",
+	[8]         =  "HBM SS0: Channel6",
+	[9]         =  "HBM SS0: Channel7",
+	[10]        =  "HBM SS1: Channel0",
+	[11]        =  "HBM SS1: Channel1",
+	[12]        =  "HBM SS1: Channel2",
+	[13]        =  "HBM SS1: Channel3",
+	[14]        =  "HBM SS1: Channel4",
+	[15]        =  "HBM SS1: Channel5",
+	[16]        =  "HBM SS1: Channel6",
+	[17]        =  "HBM SS1: Channel7",
+	[18 ... 31] = "Undefined",
+};
+
+static const char * const pvc_master_local_fatal_err_reg[] = {
+	[0]         = "Local IEH internal: Malformed IOSF PCIe AER",
+	[1]         = "Local IEH internal: Malformed IOSF PCIe ERR",
+	[2]         = "Local IEH internal: IEH UR RESPONSE",
+	[3]         = "Local IEH internal: From SERR SPI controller",
+	[4]         = "Base Die MDFI T2T",
+	[5]         = "Undefined",
+	[6]         = "Base Die MDFI T2C",
+	[7]         = "Undefined",
+	[8]         = "Invalid CSC PSF Command Parity",
+	[9]         = "Invalid CSC PSF Unexpected Completion",
+	[10]        = "Invalid CSC PSF Unsupported Request",
+	[11]        = "Invalid PCIe PSF Command Parity",
+	[12]        = "PCIe PSF Unexpected Completion",
+	[13]        = "PCIe PSF Unsupported Request",
+	[14 ... 19] = "Undefined",
+	[20]        = "Malformed MCA error packet (HBM/Punit)",
+	[21 ... 31] = "Undefined",
+};
+
+static const char * const pvc_master_local_nonfatal_err_reg[] = {
+	[0 ... 3]   = "Undefined",
+	[4]         = "Base Die MDFI T2T",
+	[5]         = "Undefined",
+	[6]         = "Base Die MDFI T2C",
+	[7]         = "Undefined",
+	[8]         = "Invalid CSC PSF Command Parity",
+	[9]         = "Invalid CSC PSF Unexpected Completion",
+	[10]        = "Invalid PCIe PSF Command Parity",
+	[11 ... 31] = "Undefined",
+};
+
 static bool fault_inject_csc_hw_error(void)
 {
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
@@ -197,6 +285,117 @@ static void gt_handle_errors(struct xe_tile *tile,
 	}
 }
 
+static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
+			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	const char *name;
+
+	name = reg_info[err_bit];
+
+	if (strcmp(name, "Undefined") != 0) {
+		if (severity == DRM_XE_RAS_ERROR_FATAL)
+			drm_err_ratelimited(&xe->drm, "%s SOC %s error detected",
+					    name, severity_str);
+		else
+			drm_warn(&xe->drm, "%s SOC %s error detected", name, severity_str);
+		atomic64_inc(&info[index].counter);
+	}
+}
+
+static void soc_hw_error_handler(struct xe_tile *tile,
+				 const enum drm_xe_ras_error_severity severity, u32 error_id)
+{
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_mmio *mmio = &tile->mmio;
+	unsigned long master_global_errstat, slave_global_errstat;
+	unsigned long master_local_errstat, slave_local_errstat;
+	u32 base, slave_base, regbit;
+	int i;
+
+	if (xe->info.platform != XE_PVC)
+		return;
+
+	base = SOC_PVC_BASE;
+	slave_base = SOC_PVC_SLAVE_BASE;
+
+	/*
+	 * Mask error type in GSYSEVTCTL so that no new errors of the type will be reported
+	 */
+	for (i = 0; i < XE_SOC_NUM_IEH; i++)
+		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i), ~REG_BIT(severity));
+
+	if (severity == DRM_XE_RAS_ERROR_CORRECTABLE) {
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity), REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity), REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, severity),
+				REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, severity),
+				REG_GENMASK(31, 0));
+		goto unmask_gsysevtctl;
+	}
+
+	/*
+	 * Read the master global IEH error register if
+	 * BIT 1 is set then process the slave IEH first. If BIT 0 in
+	 * global error register is set then process the corresponding
+	 * Local error registers
+	 */
+	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity));
+	if (master_global_errstat & SOC_SLAVE_IEH) {
+		slave_global_errstat = xe_mmio_read32(mmio,
+						      SOC_GLOBAL_ERR_STAT_REG(slave_base, severity));
+		if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
+			slave_local_errstat = xe_mmio_read32(mmio,
+							     SOC_LOCAL_ERR_STAT_REG(slave_base,
+										    severity));
+
+			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE) {
+				if (severity == DRM_XE_RAS_ERROR_FATAL)
+					log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
+						      regbit, error_id);
+			}
+
+			xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, severity),
+					slave_local_errstat);
+		}
+
+		for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
+			log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
+
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, severity),
+				slave_global_errstat);
+	}
+
+	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
+		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity));
+
+		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
+			if (severity == DRM_XE_RAS_ERROR_FATAL)
+				log_soc_error(tile, pvc_master_local_fatal_err_reg, severity,
+					      regbit, error_id);
+			if (severity == DRM_XE_RAS_ERROR_NONFATAL)
+				log_soc_error(tile, pvc_master_local_nonfatal_err_reg, severity,
+					      regbit, error_id);
+		}
+
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity), master_local_errstat);
+	}
+
+	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
+		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
+
+	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity), master_global_errstat);
+
+unmask_gsysevtctl:
+	for (i = 0; i < XE_SOC_NUM_IEH; i++)
+		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
+				(DRM_XE_RAS_ERROR_SEVERITY_MAX << 1) + 1);
+}
+
 static void gt_hw_error_handler(struct xe_tile *tile,
 				const enum drm_xe_ras_error_severity severity, u32 error_id)
 {
@@ -269,6 +468,9 @@ static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_
 		}
 		if (BIT(err_bit) & XE_GT_ERROR)
 			gt_hw_error_handler(tile, severity, error_id);
+
+		if (BIT(err_bit) == XE_SOC_ERROR)
+			soc_hw_error_handler(tile, severity, error_id);
 	}
 
 clear_reg:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors
  2025-12-05  8:39 ` [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
@ 2025-12-15 10:52   ` Raag Jadav
  2026-01-12  4:45     ` Riana Tauro
  0 siblings, 1 reply; 31+ messages in thread
From: Raag Jadav @ 2025-12-15 10:52 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Himal Prasad Ghimiray

On Fri, Dec 05, 2025 at 02:09:36PM +0530, Riana Tauro wrote:
> Report the SOC nonfatal/fatal hardware error and update the counters.

...

> +#define SOC_PVC_BASE				0x282000

Curious, should we use 'master' naming for consistency with the code?

> +#define SOC_PVC_SLAVE_BASE			0x283000
> +
> +#define SOC_GCOERRSTS				0x200
> +#define SOC_GNFERRSTS				0x210
> +#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
> +								  (base) + SOC_GCOERRSTS, \
> +								  (base) + SOC_GNFERRSTS))
> +#define   SOC_SLAVE_IEH				BIT(1)
> +#define   SOC_IEH0_LOCAL_ERR_STATUS		BIT(0)
> +#define   SOC_IEH1_LOCAL_ERR_STATUS		BIT(0)

What's the secret spacing convention in this file? Really, I couldn't
figure out ;)

> +#define SOC_GSYSEVTCTL				0x264
> +#define SOC_GSYSEVTCTL_REG(base, slave_base, x)	XE_REG(_PICK_EVEN((x), \
> +								  (base) + SOC_GSYSEVTCTL, \
> +								  slave_base + SOC_GSYSEVTCTL))

Brace around slave_base for consistency.

> +#define SOC_LERRUNCSTS				0x280
> +#define SOC_LERRCORSTS				0x294
> +#define SOC_LOCAL_ERR_STAT_REG(base, x)		XE_REG(x == DRM_XE_RAS_ERROR_CORRECTABLE ? \

In previous patch this was 'hw_err', so whichever one you use please make
it consistent.

> +						      (base) + SOC_LERRCORSTS : \
> +						      (base) + SOC_LERRUNCSTS)
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 77c90f1b06fd..1b7c782dbd98 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -18,6 +18,7 @@
>  
>  #define  HEC_UNCORR_FW_ERR_BITS 4
>  #define XE_RAS_REG_SIZE 32
> +#define XE_SOC_NUM_IEH 2

Alignment please! (including the values)

>  extern struct fault_attr inject_csc_hw_error;
>  static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
> @@ -34,6 +35,7 @@ static const char * const hec_uncorrected_fw_errors[] = {
>  
>  static const unsigned long xe_hw_error_map[] = {
>  	ERR_INDEX(XE_GT_ERROR, DRM_XE_RAS_ERROR_CORE_COMPUTE),
> +	ERR_INDEX(XE_SOC_ERROR, DRM_XE_RAS_ERROR_SOC_INTERNAL),
>  };
>  
>  enum gt_vector_regs {
> @@ -48,6 +50,92 @@ enum gt_vector_regs {
>  	ERR_STAT_GT_VECTOR_MAX,
>  };
>  
> +static const char * const pvc_slave_local_fatal_err_reg[] = {
> +	[0]		= "Local IEH internal: Malformed PCIe AER",
> +	[1]		= "Local IEH internal: Malformed PCIe ERR",
> +	[2]		= "Local IEH internal: UR conditions in IEH",
> +	[3]		= "Local IEH internal: From SERR Sources",

Unless there's anything like 'IEH external', let's try to simplify a bit.

> +	[4 ... 19]	= "Undefined",
> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
> +	[21 ... 31]	= "Undefined",

Nit: I'd align '=' in all the arrays here but ofcourse it's a personal
preference.

> +};
> +
> +static const char * const pvc_slave_global_err_reg[] = {
> +	[0]         = "Undefined",
> +	[1]         = "HBM SS2: Channel0",
> +	[2]         = "HBM SS2: Channel1",
> +	[3]         = "HBM SS2: Channel2",
> +	[4]         = "HBM SS2: Channel3",
> +	[5]         = "HBM SS2: Channel4",
> +	[6]         = "HBM SS2: Channel5",
> +	[7]         = "HBM SS2: Channel6",
> +	[8]         = "HBM SS2: Channel7",
> +	[9]         = "HBM SS3: Channel0",
> +	[10]        = "HBM SS3: Channel1",
> +	[11]        = "HBM SS3: Channel2",
> +	[12]        = "HBM SS3: Channel3",
> +	[13]        = "HBM SS3: Channel4",
> +	[14]        = "HBM SS3: Channel5",
> +	[15]        = "HBM SS3: Channel6",
> +	[16]        = "HBM SS3: Channel7",
> +	[17]        = "Undefined",
> +	[18]        = "ANR MDFI",
> +	[19 ... 31] = "Undefined",
> +};
> +
> +static const char * const pvc_master_global_err_reg[] = {
> +	[0 ... 1]   = "Undefined",
> +	[2]         =  "HBM SS0: Channel0",
> +	[3]         =  "HBM SS0: Channel1",
> +	[4]         =  "HBM SS0: Channel2",
> +	[5]         =  "HBM SS0: Channel3",
> +	[6]         =  "HBM SS0: Channel4",
> +	[7]         =  "HBM SS0: Channel5",
> +	[8]         =  "HBM SS0: Channel6",
> +	[9]         =  "HBM SS0: Channel7",
> +	[10]        =  "HBM SS1: Channel0",
> +	[11]        =  "HBM SS1: Channel1",
> +	[12]        =  "HBM SS1: Channel2",
> +	[13]        =  "HBM SS1: Channel3",
> +	[14]        =  "HBM SS1: Channel4",
> +	[15]        =  "HBM SS1: Channel5",
> +	[16]        =  "HBM SS1: Channel6",
> +	[17]        =  "HBM SS1: Channel7",
> +	[18 ... 31] = "Undefined",
> +};

I'd move this array above as per SS<N> ordering. Also, group together
global and local ones.

> +static const char * const pvc_master_local_fatal_err_reg[] = {
> +	[0]         = "Local IEH internal: Malformed IOSF PCIe AER",
> +	[1]         = "Local IEH internal: Malformed IOSF PCIe ERR",
> +	[2]         = "Local IEH internal: IEH UR RESPONSE",
> +	[3]         = "Local IEH internal: From SERR SPI controller",
> +	[4]         = "Base Die MDFI T2T",
> +	[5]         = "Undefined",
> +	[6]         = "Base Die MDFI T2C",
> +	[7]         = "Undefined",
> +	[8]         = "Invalid CSC PSF Command Parity",
> +	[9]         = "Invalid CSC PSF Unexpected Completion",
> +	[10]        = "Invalid CSC PSF Unsupported Request",
> +	[11]        = "Invalid PCIe PSF Command Parity",
> +	[12]        = "PCIe PSF Unexpected Completion",
> +	[13]        = "PCIe PSF Unsupported Request",
> +	[14 ... 19] = "Undefined",
> +	[20]        = "Malformed MCA error packet (HBM/Punit)",
> +	[21 ... 31] = "Undefined",
> +};
> +
> +static const char * const pvc_master_local_nonfatal_err_reg[] = {
> +	[0 ... 3]   = "Undefined",
> +	[4]         = "Base Die MDFI T2T",
> +	[5]         = "Undefined",
> +	[6]         = "Base Die MDFI T2C",
> +	[7]         = "Undefined",
> +	[8]         = "Invalid CSC PSF Command Parity",
> +	[9]         = "Invalid CSC PSF Unexpected Completion",
> +	[10]        = "Invalid PCIe PSF Command Parity",
> +	[11 ... 31] = "Undefined",
> +};
> +
>  static bool fault_inject_csc_hw_error(void)
>  {
>  	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> @@ -197,6 +285,117 @@ static void gt_handle_errors(struct xe_tile *tile,
>  	}
>  }
>  
> +static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
> +			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
> +{
> +	const char *severity_str = error_severity[severity];
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	const char *name;
> +
> +	name = reg_info[err_bit];
> +
> +	if (strcmp(name, "Undefined") != 0) {

Do we need '!= 0'?

> +		if (severity == DRM_XE_RAS_ERROR_FATAL)
> +			drm_err_ratelimited(&xe->drm, "%s SOC %s error detected",

Again, duplicate 'error'.

> +					    name, severity_str);
> +		else
> +			drm_warn(&xe->drm, "%s SOC %s error detected", name, severity_str);

Ditto.

> +		atomic64_inc(&info[index].counter);
> +	}
> +}
> +
> +static void soc_hw_error_handler(struct xe_tile *tile,
> +				 const enum drm_xe_ras_error_severity severity, u32 error_id)
> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_mmio *mmio = &tile->mmio;
> +	unsigned long master_global_errstat, slave_global_errstat;
> +	unsigned long master_local_errstat, slave_local_errstat;
> +	u32 base, slave_base, regbit;
> +	int i;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return;
> +
> +	base = SOC_PVC_BASE;
> +	slave_base = SOC_PVC_SLAVE_BASE;
> +
> +	/*
> +	 * Mask error type in GSYSEVTCTL so that no new errors of the type will be reported
> +	 */

Can be one line.

> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i), ~REG_BIT(severity));
> +
> +	if (severity == DRM_XE_RAS_ERROR_CORRECTABLE) {
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity), REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity), REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, severity),
> +				REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, severity),
> +				REG_GENMASK(31, 0));
> +		goto unmask_gsysevtctl;
> +	}
> +
> +	/*
> +	 * Read the master global IEH error register if
> +	 * BIT 1 is set then process the slave IEH first. If BIT 0 in
> +	 * global error register is set then process the corresponding
> +	 * Local error registers
> +	 */

This can definitely be less lines.

> +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity));
> +	if (master_global_errstat & SOC_SLAVE_IEH) {
> +		slave_global_errstat = xe_mmio_read32(mmio,
> +						      SOC_GLOBAL_ERR_STAT_REG(slave_base, severity));
> +		if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
> +			slave_local_errstat = xe_mmio_read32(mmio,
> +							     SOC_LOCAL_ERR_STAT_REG(slave_base,
> +										    severity));
> +
> +			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE) {
> +				if (severity == DRM_XE_RAS_ERROR_FATAL)

Shouldn't this condition be outside the loop? Also, should we not log it
after we clear the bits?

> +					log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
> +						      regbit, error_id);
> +			}
> +
> +			xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, severity),
> +					slave_local_errstat);
> +		}
> +
> +		for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
> +			log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);

Ditto.

> +
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, severity),
> +				slave_global_errstat);
> +	}
> +
> +	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
> +		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity));
> +
> +		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
> +			if (severity == DRM_XE_RAS_ERROR_FATAL)
> +				log_soc_error(tile, pvc_master_local_fatal_err_reg, severity,
> +					      regbit, error_id);
> +			if (severity == DRM_XE_RAS_ERROR_NONFATAL)
> +				log_soc_error(tile, pvc_master_local_nonfatal_err_reg, severity,
> +					      regbit, error_id);

These can be consolidated using temp variable. Also, log after clear.

> +		}
> +
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity), master_local_errstat);
> +	}
> +
> +	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
> +		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);

Ditto.

> +
> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity), master_global_errstat);
> +
> +unmask_gsysevtctl:
> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
> +				(DRM_XE_RAS_ERROR_SEVERITY_MAX << 1) + 1);
> +}
> +
>  static void gt_hw_error_handler(struct xe_tile *tile,
>  				const enum drm_xe_ras_error_severity severity, u32 error_id)
>  {
> @@ -269,6 +468,9 @@ static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_
>  		}
>  		if (BIT(err_bit) & XE_GT_ERROR)
>  			gt_hw_error_handler(tile, severity, error_id);
> +
> +		if (BIT(err_bit) == XE_SOC_ERROR)

Make this consistent with above.

Raag

> +			soc_hw_error_handler(tile, severity, error_id);
>  	}
>  
>  clear_reg:
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors
  2025-12-15 10:52   ` Raag Jadav
@ 2026-01-12  4:45     ` Riana Tauro
  2026-01-12 10:06       ` Raag Jadav
  0 siblings, 1 reply; 31+ messages in thread
From: Riana Tauro @ 2026-01-12  4:45 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Himal Prasad Ghimiray

Hi Raag

On 12/15/2025 4:22 PM, Raag Jadav wrote:
> On Fri, Dec 05, 2025 at 02:09:36PM +0530, Riana Tauro wrote:
>> Report the SOC nonfatal/fatal hardware error and update the counters.
> 
> ...
> 
>> +#define SOC_PVC_BASE				0x282000
> 
> Curious, should we use 'master' naming for consistency with the code?

Okay will name it master

> 
>> +#define SOC_PVC_SLAVE_BASE			0x283000
>> +
>> +#define SOC_GCOERRSTS				0x200
>> +#define SOC_GNFERRSTS				0x210
>> +#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
>> +								  (base) + SOC_GCOERRSTS, \
>> +								  (base) + SOC_GNFERRSTS))
>> +#define   SOC_SLAVE_IEH				BIT(1)
>> +#define   SOC_IEH0_LOCAL_ERR_STATUS		BIT(0)
>> +#define   SOC_IEH1_LOCAL_ERR_STATUS		BIT(0)
> 
> What's the secret spacing convention in this file? Really, I couldn't
> figure out ;)
> 

This is patchwork/diff issue. If you apply the patch the spacing is fine.
Even checkpatch shows no error

>> +#define SOC_GSYSEVTCTL				0x264
>> +#define SOC_GSYSEVTCTL_REG(base, slave_base, x)	XE_REG(_PICK_EVEN((x), \
>> +								  (base) + SOC_GSYSEVTCTL, \
>> +								  slave_base + SOC_GSYSEVTCTL))
> 
> Brace around slave_base for consistency.

okay

> 
>> +#define SOC_LERRUNCSTS				0x280
>> +#define SOC_LERRCORSTS				0x294
>> +#define SOC_LOCAL_ERR_STAT_REG(base, x)		XE_REG(x == DRM_XE_RAS_ERROR_CORRECTABLE ? \
> 
> In previous patch this was 'hw_err', so whichever one you use please make
> it consistent.
> 
>> +						      (base) + SOC_LERRCORSTS : \
>> +						      (base) + SOC_LERRUNCSTS)
>> +
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index 77c90f1b06fd..1b7c782dbd98 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -18,6 +18,7 @@
>>   
>>   #define  HEC_UNCORR_FW_ERR_BITS 4
>>   #define XE_RAS_REG_SIZE 32
>> +#define XE_SOC_NUM_IEH 2
> 
> Alignment please! (including the values)

will align values. macro names are aligned

> 
>>   extern struct fault_attr inject_csc_hw_error;
>>   static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>> @@ -34,6 +35,7 @@ static const char * const hec_uncorrected_fw_errors[] = {
>>   
>>   static const unsigned long xe_hw_error_map[] = {
>>   	ERR_INDEX(XE_GT_ERROR, DRM_XE_RAS_ERROR_CORE_COMPUTE),
>> +	ERR_INDEX(XE_SOC_ERROR, DRM_XE_RAS_ERROR_SOC_INTERNAL),
>>   };
>>   
>>   enum gt_vector_regs {
>> @@ -48,6 +50,92 @@ enum gt_vector_regs {
>>   	ERR_STAT_GT_VECTOR_MAX,
>>   };
>>   
>> +static const char * const pvc_slave_local_fatal_err_reg[] = {
>> +	[0]		= "Local IEH internal: Malformed PCIe AER",
>> +	[1]		= "Local IEH internal: Malformed PCIe ERR",
>> +	[2]		= "Local IEH internal: UR conditions in IEH",
>> +	[3]		= "Local IEH internal: From SERR Sources",
> 
> Unless there's anything like 'IEH external', let's try to simplify a bit.
> 
>> +	[4 ... 19]	= "Undefined",
>> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
>> +	[21 ... 31]	= "Undefined",
> 
> Nit: I'd align '=' in all the arrays here but ofcourse it's a personal
> preference.

will align

> 
>> +};
>> +
>> +static const char * const pvc_slave_global_err_reg[] = {
>> +	[0]         = "Undefined",
>> +	[1]         = "HBM SS2: Channel0",
>> +	[2]         = "HBM SS2: Channel1",
>> +	[3]         = "HBM SS2: Channel2",
>> +	[4]         = "HBM SS2: Channel3",
>> +	[5]         = "HBM SS2: Channel4",
>> +	[6]         = "HBM SS2: Channel5",
>> +	[7]         = "HBM SS2: Channel6",
>> +	[8]         = "HBM SS2: Channel7",
>> +	[9]         = "HBM SS3: Channel0",
>> +	[10]        = "HBM SS3: Channel1",
>> +	[11]        = "HBM SS3: Channel2",
>> +	[12]        = "HBM SS3: Channel3",
>> +	[13]        = "HBM SS3: Channel4",
>> +	[14]        = "HBM SS3: Channel5",
>> +	[15]        = "HBM SS3: Channel6",
>> +	[16]        = "HBM SS3: Channel7",
>> +	[17]        = "Undefined",
>> +	[18]        = "ANR MDFI",
>> +	[19 ... 31] = "Undefined",
>> +};
>> +
>> +static const char * const pvc_master_global_err_reg[] = {
>> +	[0 ... 1]   = "Undefined",
>> +	[2]         =  "HBM SS0: Channel0",
>> +	[3]         =  "HBM SS0: Channel1",
>> +	[4]         =  "HBM SS0: Channel2",
>> +	[5]         =  "HBM SS0: Channel3",
>> +	[6]         =  "HBM SS0: Channel4",
>> +	[7]         =  "HBM SS0: Channel5",
>> +	[8]         =  "HBM SS0: Channel6",
>> +	[9]         =  "HBM SS0: Channel7",
>> +	[10]        =  "HBM SS1: Channel0",
>> +	[11]        =  "HBM SS1: Channel1",
>> +	[12]        =  "HBM SS1: Channel2",
>> +	[13]        =  "HBM SS1: Channel3",
>> +	[14]        =  "HBM SS1: Channel4",
>> +	[15]        =  "HBM SS1: Channel5",
>> +	[16]        =  "HBM SS1: Channel6",
>> +	[17]        =  "HBM SS1: Channel7",
>> +	[18 ... 31] = "Undefined",
>> +};
> 
> I'd move this array above as per SS<N> ordering. Also, group together
> global and local ones.

okay
> 
>> +static const char * const pvc_master_local_fatal_err_reg[] = {
>> +	[0]         = "Local IEH internal: Malformed IOSF PCIe AER",
>> +	[1]         = "Local IEH internal: Malformed IOSF PCIe ERR",
>> +	[2]         = "Local IEH internal: IEH UR RESPONSE",
>> +	[3]         = "Local IEH internal: From SERR SPI controller",
>> +	[4]         = "Base Die MDFI T2T",
>> +	[5]         = "Undefined",
>> +	[6]         = "Base Die MDFI T2C",
>> +	[7]         = "Undefined",
>> +	[8]         = "Invalid CSC PSF Command Parity",
>> +	[9]         = "Invalid CSC PSF Unexpected Completion",
>> +	[10]        = "Invalid CSC PSF Unsupported Request",
>> +	[11]        = "Invalid PCIe PSF Command Parity",
>> +	[12]        = "PCIe PSF Unexpected Completion",
>> +	[13]        = "PCIe PSF Unsupported Request",
>> +	[14 ... 19] = "Undefined",
>> +	[20]        = "Malformed MCA error packet (HBM/Punit)",
>> +	[21 ... 31] = "Undefined",
>> +};
>> +
>> +static const char * const pvc_master_local_nonfatal_err_reg[] = {
>> +	[0 ... 3]   = "Undefined",
>> +	[4]         = "Base Die MDFI T2T",
>> +	[5]         = "Undefined",
>> +	[6]         = "Base Die MDFI T2C",
>> +	[7]         = "Undefined",
>> +	[8]         = "Invalid CSC PSF Command Parity",
>> +	[9]         = "Invalid CSC PSF Unexpected Completion",
>> +	[10]        = "Invalid PCIe PSF Command Parity",
>> +	[11 ... 31] = "Undefined",
>> +};
>> +
>>   static bool fault_inject_csc_hw_error(void)
>>   {
>>   	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
>> @@ -197,6 +285,117 @@ static void gt_handle_errors(struct xe_tile *tile,
>>   	}
>>   }
>>   
>> +static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
>> +			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
>> +{
>> +	const char *severity_str = error_severity[severity];
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	const char *name;
>> +
>> +	name = reg_info[err_bit];
>> +
>> +	if (strcmp(name, "Undefined") != 0) {
> 
> Do we need '!= 0'?

Can be removed

> 
>> +		if (severity == DRM_XE_RAS_ERROR_FATAL)
>> +			drm_err_ratelimited(&xe->drm, "%s SOC %s error detected",
> 
> Again, duplicate 'error'.
> 
>> +					    name, severity_str);
>> +		else
>> +			drm_warn(&xe->drm, "%s SOC %s error detected", name, severity_str);
> 
> Ditto.
> 
>> +		atomic64_inc(&info[index].counter);
>> +	}
>> +}
>> +
>> +static void soc_hw_error_handler(struct xe_tile *tile,
>> +				 const enum drm_xe_ras_error_severity severity, u32 error_id)
>> +{
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	unsigned long master_global_errstat, slave_global_errstat;
>> +	unsigned long master_local_errstat, slave_local_errstat;
>> +	u32 base, slave_base, regbit;
>> +	int i;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return;
>> +
>> +	base = SOC_PVC_BASE;
>> +	slave_base = SOC_PVC_SLAVE_BASE;
>> +
>> +	/*
>> +	 * Mask error type in GSYSEVTCTL so that no new errors of the type will be reported
>> +	 */
> 
> Can be one line.

will remove multiline

> 
>> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
>> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i), ~REG_BIT(severity));
>> +
>> +	if (severity == DRM_XE_RAS_ERROR_CORRECTABLE) {
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity), REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity), REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, severity),
>> +				REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, severity),
>> +				REG_GENMASK(31, 0));
>> +		goto unmask_gsysevtctl;
>> +	}
>> +
>> +	/*
>> +	 * Read the master global IEH error register if
>> +	 * BIT 1 is set then process the slave IEH first. If BIT 0 in
>> +	 * global error register is set then process the corresponding
>> +	 * Local error registers
>> +	 */
> 
> This can definitely be less lines.

okay. It'll still be 3 lines

> 
>> +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity));
>> +	if (master_global_errstat & SOC_SLAVE_IEH) {
>> +		slave_global_errstat = xe_mmio_read32(mmio,
>> +						      SOC_GLOBAL_ERR_STAT_REG(slave_base, severity));
>> +		if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
>> +			slave_local_errstat = xe_mmio_read32(mmio,
>> +							     SOC_LOCAL_ERR_STAT_REG(slave_base,
>> +										    severity));
>> +
>> +			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE) {
>> +				if (severity == DRM_XE_RAS_ERROR_FATAL)
> 
> Shouldn't this condition be outside the loop? Also, should we not log it
> after we clear the bits?

Yeah condition can be.

But why should we log it after? Anyway the rest of the registers need to 
cleared too to unmask

> 
>> +					log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
>> +						      regbit, error_id);
>> +			}
>> +
>> +			xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, severity),
>> +					slave_local_errstat);
>> +		}
>> +
>> +		for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
>> +			log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
> 
> Ditto.
> 
>> +
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, severity),
>> +				slave_global_errstat);
>> +	}
>> +
>> +	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
>> +		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity));
>> +
>> +		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
>> +			if (severity == DRM_XE_RAS_ERROR_FATAL)
>> +				log_soc_error(tile, pvc_master_local_fatal_err_reg, severity,
>> +					      regbit, error_id);
>> +			if (severity == DRM_XE_RAS_ERROR_NONFATAL)
>> +				log_soc_error(tile, pvc_master_local_nonfatal_err_reg, severity,
>> +					      regbit, error_id);
> 
> These can be consolidated using temp variable. Also, log after clear.

will consolidate the arrays

Thanks
Riana

> 
>> +		}
>> +
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(base, severity), master_local_errstat);
>> +	}
>> +
>> +	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
>> +		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
> 
> Ditto.
> 
>> +
>> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity), master_global_errstat);
>> +
>> +unmask_gsysevtctl:
>> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
>> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(base, slave_base, i),
>> +				(DRM_XE_RAS_ERROR_SEVERITY_MAX << 1) + 1);
>> +}
>> +
>>   static void gt_hw_error_handler(struct xe_tile *tile,
>>   				const enum drm_xe_ras_error_severity severity, u32 error_id)
>>   {
>> @@ -269,6 +468,9 @@ static void hw_error_source_handler(struct xe_tile *tile, enum drm_xe_ras_error_
>>   		}
>>   		if (BIT(err_bit) & XE_GT_ERROR)
>>   			gt_hw_error_handler(tile, severity, error_id);
>> +
>> +		if (BIT(err_bit) == XE_SOC_ERROR)
> 
> Make this consistent with above.
> 
> Raag
> 
>> +			soc_hw_error_handler(tile, severity, error_id);
>>   	}
>>   
>>   clear_reg:
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors
  2026-01-12  4:45     ` Riana Tauro
@ 2026-01-12 10:06       ` Raag Jadav
  0 siblings, 0 replies; 31+ messages in thread
From: Raag Jadav @ 2026-01-12 10:06 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, Himal Prasad Ghimiray

On Mon, Jan 12, 2026 at 10:15:58AM +0530, Riana Tauro wrote:
> On 12/15/2025 4:22 PM, Raag Jadav wrote:
> > On Fri, Dec 05, 2025 at 02:09:36PM +0530, Riana Tauro wrote:
> > > Report the SOC nonfatal/fatal hardware error and update the counters.

...

> > > +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(base, severity));
> > > +	if (master_global_errstat & SOC_SLAVE_IEH) {
> > > +		slave_global_errstat = xe_mmio_read32(mmio,
> > > +						      SOC_GLOBAL_ERR_STAT_REG(slave_base, severity));
> > > +		if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
> > > +			slave_local_errstat = xe_mmio_read32(mmio,
> > > +							     SOC_LOCAL_ERR_STAT_REG(slave_base,
> > > +										    severity));
> > > +
> > > +			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE) {
> > > +				if (severity == DRM_XE_RAS_ERROR_FATAL)
> > 
> > Shouldn't this condition be outside the loop? Also, should we not log it
> > after we clear the bits?
> 
> Yeah condition can be.
> 
> But why should we log it after? Anyway the rest of the registers need to
> cleared too to unmask

Yes, doesn't make much functional difference but the rule of thumb is to

1. Execute
2. Log

so just ordering change, but upto you.

Raag

^ permalink raw reply	[flat|nested] 31+ messages in thread

* ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev3)
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (3 preceding siblings ...)
  2025-12-05  8:39 ` [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
@ 2025-12-05  9:40 ` Patchwork
  2025-12-05  9:41 ` ✓ CI.KUnit: success " Patchwork
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2025-12-05  9:40 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev3)
URL   : https://patchwork.freedesktop.org/series/155188/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
2de9a3901bc28757c7906b454717b64e2a214021
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 2c6b954b514c7834bd06fc0e4660d864ccfe23e2
Author: Riana Tauro <riana.tauro@intel.com>
Date:   Fri Dec 5 14:09:36 2025 +0530

    drm/xe/xe_hw_error: Add support for PVC SOC errors
    
    Report the SOC nonfatal/fatal hardware error and update the counters.
    
    Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
    Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
    Signed-off-by: Riana Tauro <riana.tauro@intel.com>
+ /mt/dim checkpatch 0949b969da10a9fc389d255f398653eb9bc4ffaf drm-intel
2cb0846cd686 drm/ras: Introduce the DRM RAS infrastructure over generic netlink
-:58: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#58: 
new file mode 100644

-:806: WARNING:LONG_LINE: line length of 114 exceeds 100 columns
#806: FILE: drivers/gpu/drm/drm_ras_nl.c:13:
+static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {

-:811: WARNING:LONG_LINE: line length of 116 exceeds 100 columns
#811: FILE: drivers/gpu/drm/drm_ras_nl.c:18:
+static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {

total: 0 errors, 3 warnings, 0 checks, 905 lines checked
8c989fcbf061 drm/xe/xe_drm_ras: Add support for drm ras
-:31: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#31: 
$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'

-:77: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#77: 
new file mode 100644

total: 0 errors, 2 warnings, 0 checks, 502 lines checked
91eb06e3f6c8 drm/xe/xe_hw_error: Add support for GT hardware errors
-:73: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'hw_err' may be better as '(hw_err)' to avoid precedence issues
#73: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:60:
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == DRM_XE_RAS_ERROR_CORRECTABLE ? \
+						 ERR_STAT_GT_COR_VECTOR_REG(x) : \
+						 ERR_STAT_GT_FATAL_VECTOR_REG(x))

-:73: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'x' - possible side-effects?
#73: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:60:
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)	(hw_err == DRM_XE_RAS_ERROR_CORRECTABLE ? \
+						 ERR_STAT_GT_COR_VECTOR_REG(x) : \
+						 ERR_STAT_GT_FATAL_VECTOR_REG(x))

total: 0 errors, 0 warnings, 2 checks, 274 lines checked
2c6b954b514c drm/xe/xe_hw_error: Add support for PVC SOC errors
-:33: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'base' - possible side-effects?
#33: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:70:
+#define SOC_GLOBAL_ERR_STAT_REG(base, x)	XE_REG(_PICK_EVEN((x), \
+								  (base) + SOC_GCOERRSTS, \
+								  (base) + SOC_GNFERRSTS))

-:41: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'slave_base' may be better as '(slave_base)' to avoid precedence issues
#41: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:78:
+#define SOC_GSYSEVTCTL_REG(base, slave_base, x)	XE_REG(_PICK_EVEN((x), \
+								  (base) + SOC_GSYSEVTCTL, \
+								  slave_base + SOC_GSYSEVTCTL))

-:47: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'base' - possible side-effects?
#47: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:84:
+#define SOC_LOCAL_ERR_STAT_REG(base, x)		XE_REG(x == DRM_XE_RAS_ERROR_CORRECTABLE ? \
+						      (base) + SOC_LERRCORSTS : \
+						      (base) + SOC_LERRUNCSTS)

-:47: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'x' may be better as '(x)' to avoid precedence issues
#47: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:84:
+#define SOC_LOCAL_ERR_STAT_REG(base, x)		XE_REG(x == DRM_XE_RAS_ERROR_CORRECTABLE ? \
+						      (base) + SOC_LERRCORSTS : \
+						      (base) + SOC_LERRUNCSTS)

-:231: WARNING:LONG_LINE: line length of 101 exceeds 100 columns
#231: FILE: drivers/gpu/drm/xe/xe_hw_error.c:350:
+						      SOC_GLOBAL_ERR_STAT_REG(slave_base, severity));

total: 0 errors, 1 warnings, 4 checks, 266 lines checked



^ permalink raw reply	[flat|nested] 31+ messages in thread

* ✓ CI.KUnit: success for Introduce DRM_RAS using generic netlink for RAS (rev3)
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (4 preceding siblings ...)
  2025-12-05  9:40 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev3) Patchwork
@ 2025-12-05  9:41 ` Patchwork
  2025-12-05  9:56 ` ✗ CI.checksparse: warning " Patchwork
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2025-12-05  9:41 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev3)
URL   : https://patchwork.freedesktop.org/series/155188/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[09:40:29] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[09:40:33] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[09:41:04] Starting KUnit Kernel (1/1)...
[09:41:04] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[09:41:04] ================== guc_buf (11 subtests) ===================
[09:41:04] [PASSED] test_smallest
[09:41:04] [PASSED] test_largest
[09:41:04] [PASSED] test_granular
[09:41:04] [PASSED] test_unique
[09:41:04] [PASSED] test_overlap
[09:41:04] [PASSED] test_reusable
[09:41:04] [PASSED] test_too_big
[09:41:04] [PASSED] test_flush
[09:41:04] [PASSED] test_lookup
[09:41:04] [PASSED] test_data
[09:41:04] [PASSED] test_class
[09:41:04] ===================== [PASSED] guc_buf =====================
[09:41:04] =================== guc_dbm (7 subtests) ===================
[09:41:04] [PASSED] test_empty
[09:41:04] [PASSED] test_default
[09:41:04] ======================== test_size  ========================
[09:41:04] [PASSED] 4
[09:41:04] [PASSED] 8
[09:41:04] [PASSED] 32
[09:41:04] [PASSED] 256
[09:41:04] ==================== [PASSED] test_size ====================
[09:41:04] ======================= test_reuse  ========================
[09:41:04] [PASSED] 4
[09:41:04] [PASSED] 8
[09:41:04] [PASSED] 32
[09:41:04] [PASSED] 256
[09:41:04] =================== [PASSED] test_reuse ====================
[09:41:04] =================== test_range_overlap  ====================
[09:41:04] [PASSED] 4
[09:41:04] [PASSED] 8
[09:41:04] [PASSED] 32
[09:41:04] [PASSED] 256
[09:41:04] =============== [PASSED] test_range_overlap ================
[09:41:04] =================== test_range_compact  ====================
[09:41:04] [PASSED] 4
[09:41:04] [PASSED] 8
[09:41:04] [PASSED] 32
[09:41:04] [PASSED] 256
[09:41:04] =============== [PASSED] test_range_compact ================
[09:41:04] ==================== test_range_spare  =====================
[09:41:04] [PASSED] 4
[09:41:04] [PASSED] 8
[09:41:04] [PASSED] 32
[09:41:04] [PASSED] 256
[09:41:04] ================ [PASSED] test_range_spare =================
[09:41:04] ===================== [PASSED] guc_dbm =====================
[09:41:04] =================== guc_idm (6 subtests) ===================
[09:41:04] [PASSED] bad_init
[09:41:04] [PASSED] no_init
[09:41:04] [PASSED] init_fini
[09:41:04] [PASSED] check_used
[09:41:04] [PASSED] check_quota
[09:41:04] [PASSED] check_all
[09:41:04] ===================== [PASSED] guc_idm =====================
[09:41:04] ================== no_relay (3 subtests) ===================
[09:41:04] [PASSED] xe_drops_guc2pf_if_not_ready
[09:41:04] [PASSED] xe_drops_guc2vf_if_not_ready
[09:41:04] [PASSED] xe_rejects_send_if_not_ready
[09:41:04] ==================== [PASSED] no_relay =====================
[09:41:04] ================== pf_relay (14 subtests) ==================
[09:41:04] [PASSED] pf_rejects_guc2pf_too_short
[09:41:04] [PASSED] pf_rejects_guc2pf_too_long
[09:41:04] [PASSED] pf_rejects_guc2pf_no_payload
[09:41:04] [PASSED] pf_fails_no_payload
[09:41:04] [PASSED] pf_fails_bad_origin
[09:41:04] [PASSED] pf_fails_bad_type
[09:41:04] [PASSED] pf_txn_reports_error
[09:41:04] [PASSED] pf_txn_sends_pf2guc
[09:41:04] [PASSED] pf_sends_pf2guc
[09:41:04] [SKIPPED] pf_loopback_nop
[09:41:04] [SKIPPED] pf_loopback_echo
[09:41:04] [SKIPPED] pf_loopback_fail
[09:41:04] [SKIPPED] pf_loopback_busy
[09:41:04] [SKIPPED] pf_loopback_retry
[09:41:04] ==================== [PASSED] pf_relay =====================
[09:41:04] ================== vf_relay (3 subtests) ===================
[09:41:04] [PASSED] vf_rejects_guc2vf_too_short
[09:41:04] [PASSED] vf_rejects_guc2vf_too_long
[09:41:04] [PASSED] vf_rejects_guc2vf_no_payload
[09:41:04] ==================== [PASSED] vf_relay =====================
[09:41:04] ================ pf_gt_config (6 subtests) =================
[09:41:04] [PASSED] fair_contexts_1vf
[09:41:04] [PASSED] fair_doorbells_1vf
[09:41:04] [PASSED] fair_ggtt_1vf
[09:41:04] ====================== fair_contexts  ======================
[09:41:04] [PASSED] 1 VF
[09:41:04] [PASSED] 2 VFs
[09:41:04] [PASSED] 3 VFs
[09:41:04] [PASSED] 4 VFs
[09:41:04] [PASSED] 5 VFs
[09:41:04] [PASSED] 6 VFs
[09:41:04] [PASSED] 7 VFs
[09:41:04] [PASSED] 8 VFs
[09:41:04] [PASSED] 9 VFs
[09:41:04] [PASSED] 10 VFs
[09:41:04] [PASSED] 11 VFs
[09:41:04] [PASSED] 12 VFs
[09:41:04] [PASSED] 13 VFs
[09:41:04] [PASSED] 14 VFs
[09:41:04] [PASSED] 15 VFs
[09:41:04] [PASSED] 16 VFs
[09:41:04] [PASSED] 17 VFs
[09:41:04] [PASSED] 18 VFs
[09:41:04] [PASSED] 19 VFs
[09:41:04] [PASSED] 20 VFs
[09:41:04] [PASSED] 21 VFs
[09:41:04] [PASSED] 22 VFs
[09:41:04] [PASSED] 23 VFs
[09:41:04] [PASSED] 24 VFs
[09:41:04] [PASSED] 25 VFs
[09:41:04] [PASSED] 26 VFs
[09:41:04] [PASSED] 27 VFs
[09:41:04] [PASSED] 28 VFs
[09:41:04] [PASSED] 29 VFs
[09:41:04] [PASSED] 30 VFs
[09:41:04] [PASSED] 31 VFs
[09:41:04] [PASSED] 32 VFs
[09:41:04] [PASSED] 33 VFs
[09:41:04] [PASSED] 34 VFs
[09:41:04] [PASSED] 35 VFs
[09:41:04] [PASSED] 36 VFs
[09:41:04] [PASSED] 37 VFs
[09:41:04] [PASSED] 38 VFs
[09:41:04] [PASSED] 39 VFs
[09:41:04] [PASSED] 40 VFs
[09:41:04] [PASSED] 41 VFs
[09:41:04] [PASSED] 42 VFs
[09:41:04] [PASSED] 43 VFs
[09:41:04] [PASSED] 44 VFs
[09:41:04] [PASSED] 45 VFs
[09:41:04] [PASSED] 46 VFs
[09:41:04] [PASSED] 47 VFs
[09:41:04] [PASSED] 48 VFs
[09:41:04] [PASSED] 49 VFs
[09:41:04] [PASSED] 50 VFs
[09:41:04] [PASSED] 51 VFs
[09:41:04] [PASSED] 52 VFs
[09:41:04] [PASSED] 53 VFs
[09:41:04] [PASSED] 54 VFs
[09:41:04] [PASSED] 55 VFs
[09:41:04] [PASSED] 56 VFs
[09:41:04] [PASSED] 57 VFs
[09:41:04] [PASSED] 58 VFs
[09:41:04] [PASSED] 59 VFs
[09:41:04] [PASSED] 60 VFs
[09:41:04] [PASSED] 61 VFs
[09:41:04] [PASSED] 62 VFs
[09:41:04] [PASSED] 63 VFs
[09:41:04] ================== [PASSED] fair_contexts ==================
[09:41:04] ===================== fair_doorbells  ======================
[09:41:04] [PASSED] 1 VF
[09:41:04] [PASSED] 2 VFs
[09:41:04] [PASSED] 3 VFs
[09:41:04] [PASSED] 4 VFs
[09:41:04] [PASSED] 5 VFs
[09:41:04] [PASSED] 6 VFs
[09:41:04] [PASSED] 7 VFs
[09:41:04] [PASSED] 8 VFs
[09:41:04] [PASSED] 9 VFs
[09:41:04] [PASSED] 10 VFs
[09:41:04] [PASSED] 11 VFs
[09:41:04] [PASSED] 12 VFs
[09:41:04] [PASSED] 13 VFs
[09:41:04] [PASSED] 14 VFs
[09:41:04] [PASSED] 15 VFs
[09:41:04] [PASSED] 16 VFs
[09:41:04] [PASSED] 17 VFs
[09:41:04] [PASSED] 18 VFs
[09:41:04] [PASSED] 19 VFs
[09:41:04] [PASSED] 20 VFs
[09:41:04] [PASSED] 21 VFs
[09:41:04] [PASSED] 22 VFs
[09:41:04] [PASSED] 23 VFs
[09:41:04] [PASSED] 24 VFs
[09:41:04] [PASSED] 25 VFs
[09:41:04] [PASSED] 26 VFs
[09:41:04] [PASSED] 27 VFs
[09:41:04] [PASSED] 28 VFs
[09:41:04] [PASSED] 29 VFs
[09:41:04] [PASSED] 30 VFs
[09:41:04] [PASSED] 31 VFs
[09:41:04] [PASSED] 32 VFs
[09:41:04] [PASSED] 33 VFs
[09:41:04] [PASSED] 34 VFs
[09:41:04] [PASSED] 35 VFs
[09:41:04] [PASSED] 36 VFs
[09:41:04] [PASSED] 37 VFs
[09:41:04] [PASSED] 38 VFs
[09:41:04] [PASSED] 39 VFs
[09:41:04] [PASSED] 40 VFs
[09:41:04] [PASSED] 41 VFs
[09:41:04] [PASSED] 42 VFs
[09:41:04] [PASSED] 43 VFs
[09:41:04] [PASSED] 44 VFs
[09:41:04] [PASSED] 45 VFs
[09:41:04] [PASSED] 46 VFs
[09:41:04] [PASSED] 47 VFs
[09:41:04] [PASSED] 48 VFs
[09:41:04] [PASSED] 49 VFs
[09:41:04] [PASSED] 50 VFs
[09:41:04] [PASSED] 51 VFs
[09:41:04] [PASSED] 52 VFs
[09:41:04] [PASSED] 53 VFs
[09:41:04] [PASSED] 54 VFs
[09:41:04] [PASSED] 55 VFs
[09:41:04] [PASSED] 56 VFs
[09:41:04] [PASSED] 57 VFs
[09:41:04] [PASSED] 58 VFs
[09:41:04] [PASSED] 59 VFs
[09:41:04] [PASSED] 60 VFs
[09:41:04] [PASSED] 61 VFs
[09:41:04] [PASSED] 62 VFs
[09:41:04] [PASSED] 63 VFs
[09:41:04] ================= [PASSED] fair_doorbells ==================
[09:41:04] ======================== fair_ggtt  ========================
[09:41:04] [PASSED] 1 VF
[09:41:04] [PASSED] 2 VFs
[09:41:04] [PASSED] 3 VFs
[09:41:04] [PASSED] 4 VFs
[09:41:04] [PASSED] 5 VFs
[09:41:04] [PASSED] 6 VFs
[09:41:04] [PASSED] 7 VFs
[09:41:04] [PASSED] 8 VFs
[09:41:04] [PASSED] 9 VFs
[09:41:04] [PASSED] 10 VFs
[09:41:04] [PASSED] 11 VFs
[09:41:04] [PASSED] 12 VFs
[09:41:04] [PASSED] 13 VFs
[09:41:04] [PASSED] 14 VFs
[09:41:04] [PASSED] 15 VFs
[09:41:04] [PASSED] 16 VFs
[09:41:04] [PASSED] 17 VFs
[09:41:04] [PASSED] 18 VFs
[09:41:04] [PASSED] 19 VFs
[09:41:04] [PASSED] 20 VFs
[09:41:04] [PASSED] 21 VFs
[09:41:04] [PASSED] 22 VFs
[09:41:04] [PASSED] 23 VFs
[09:41:04] [PASSED] 24 VFs
[09:41:04] [PASSED] 25 VFs
[09:41:04] [PASSED] 26 VFs
[09:41:04] [PASSED] 27 VFs
[09:41:04] [PASSED] 28 VFs
[09:41:04] [PASSED] 29 VFs
[09:41:04] [PASSED] 30 VFs
[09:41:04] [PASSED] 31 VFs
[09:41:04] [PASSED] 32 VFs
[09:41:04] [PASSED] 33 VFs
[09:41:04] [PASSED] 34 VFs
[09:41:04] [PASSED] 35 VFs
[09:41:04] [PASSED] 36 VFs
[09:41:04] [PASSED] 37 VFs
[09:41:04] [PASSED] 38 VFs
[09:41:04] [PASSED] 39 VFs
[09:41:04] [PASSED] 40 VFs
[09:41:04] [PASSED] 41 VFs
[09:41:04] [PASSED] 42 VFs
[09:41:04] [PASSED] 43 VFs
[09:41:04] [PASSED] 44 VFs
[09:41:04] [PASSED] 45 VFs
[09:41:04] [PASSED] 46 VFs
[09:41:04] [PASSED] 47 VFs
[09:41:04] [PASSED] 48 VFs
[09:41:04] [PASSED] 49 VFs
[09:41:04] [PASSED] 50 VFs
[09:41:04] [PASSED] 51 VFs
[09:41:04] [PASSED] 52 VFs
[09:41:04] [PASSED] 53 VFs
[09:41:04] [PASSED] 54 VFs
[09:41:04] [PASSED] 55 VFs
[09:41:04] [PASSED] 56 VFs
[09:41:04] [PASSED] 57 VFs
[09:41:04] [PASSED] 58 VFs
[09:41:04] [PASSED] 59 VFs
[09:41:04] [PASSED] 60 VFs
[09:41:04] [PASSED] 61 VFs
[09:41:04] [PASSED] 62 VFs
[09:41:04] [PASSED] 63 VFs
[09:41:04] ==================== [PASSED] fair_ggtt ====================
[09:41:04] ================== [PASSED] pf_gt_config ===================
[09:41:04] ===================== lmtt (1 subtest) =====================
[09:41:04] ======================== test_ops  =========================
[09:41:04] [PASSED] 2-level
[09:41:04] [PASSED] multi-level
[09:41:04] ==================== [PASSED] test_ops =====================
[09:41:04] ====================== [PASSED] lmtt =======================
[09:41:04] ================= pf_service (11 subtests) =================
[09:41:04] [PASSED] pf_negotiate_any
[09:41:04] [PASSED] pf_negotiate_base_match
[09:41:04] [PASSED] pf_negotiate_base_newer
[09:41:04] [PASSED] pf_negotiate_base_next
[09:41:04] [SKIPPED] pf_negotiate_base_older
[09:41:04] [PASSED] pf_negotiate_base_prev
[09:41:04] [PASSED] pf_negotiate_latest_match
[09:41:04] [PASSED] pf_negotiate_latest_newer
[09:41:04] [PASSED] pf_negotiate_latest_next
[09:41:04] [SKIPPED] pf_negotiate_latest_older
[09:41:04] [SKIPPED] pf_negotiate_latest_prev
[09:41:04] =================== [PASSED] pf_service ====================
[09:41:04] ================= xe_guc_g2g (2 subtests) ==================
[09:41:04] ============== xe_live_guc_g2g_kunit_default  ==============
[09:41:04] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[09:41:04] ============== xe_live_guc_g2g_kunit_allmem  ===============
[09:41:04] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[09:41:04] =================== [SKIPPED] xe_guc_g2g ===================
[09:41:04] =================== xe_mocs (2 subtests) ===================
[09:41:04] ================ xe_live_mocs_kernel_kunit  ================
[09:41:04] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[09:41:04] ================ xe_live_mocs_reset_kunit  =================
[09:41:04] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[09:41:04] ==================== [SKIPPED] xe_mocs =====================
[09:41:04] ================= xe_migrate (2 subtests) ==================
[09:41:04] ================= xe_migrate_sanity_kunit  =================
[09:41:04] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[09:41:04] ================== xe_validate_ccs_kunit  ==================
[09:41:04] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[09:41:04] =================== [SKIPPED] xe_migrate ===================
[09:41:04] ================== xe_dma_buf (1 subtest) ==================
[09:41:04] ==================== xe_dma_buf_kunit  =====================
[09:41:04] ================ [SKIPPED] xe_dma_buf_kunit ================
[09:41:04] =================== [SKIPPED] xe_dma_buf ===================
[09:41:04] ================= xe_bo_shrink (1 subtest) =================
[09:41:04] =================== xe_bo_shrink_kunit  ====================
[09:41:04] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[09:41:04] ================== [SKIPPED] xe_bo_shrink ==================
[09:41:04] ==================== xe_bo (2 subtests) ====================
[09:41:04] ================== xe_ccs_migrate_kunit  ===================
[09:41:04] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[09:41:04] ==================== xe_bo_evict_kunit  ====================
[09:41:04] =============== [SKIPPED] xe_bo_evict_kunit ================
[09:41:04] ===================== [SKIPPED] xe_bo ======================
[09:41:04] ==================== args (11 subtests) ====================
[09:41:04] [PASSED] count_args_test
[09:41:04] [PASSED] call_args_example
[09:41:04] [PASSED] call_args_test
[09:41:04] [PASSED] drop_first_arg_example
[09:41:04] [PASSED] drop_first_arg_test
[09:41:04] [PASSED] first_arg_example
[09:41:04] [PASSED] first_arg_test
[09:41:04] [PASSED] last_arg_example
[09:41:04] [PASSED] last_arg_test
[09:41:04] [PASSED] pick_arg_example
[09:41:04] [PASSED] sep_comma_example
[09:41:04] ====================== [PASSED] args =======================
[09:41:04] =================== xe_pci (3 subtests) ====================
[09:41:04] ==================== check_graphics_ip  ====================
[09:41:04] [PASSED] 12.00 Xe_LP
[09:41:04] [PASSED] 12.10 Xe_LP+
[09:41:04] [PASSED] 12.55 Xe_HPG
[09:41:04] [PASSED] 12.60 Xe_HPC
[09:41:04] [PASSED] 12.70 Xe_LPG
[09:41:04] [PASSED] 12.71 Xe_LPG
[09:41:04] [PASSED] 12.74 Xe_LPG+
[09:41:04] [PASSED] 20.01 Xe2_HPG
[09:41:04] [PASSED] 20.02 Xe2_HPG
[09:41:04] [PASSED] 20.04 Xe2_LPG
[09:41:04] [PASSED] 30.00 Xe3_LPG
[09:41:04] [PASSED] 30.01 Xe3_LPG
[09:41:04] [PASSED] 30.03 Xe3_LPG
[09:41:04] [PASSED] 30.04 Xe3_LPG
[09:41:04] [PASSED] 30.05 Xe3_LPG
[09:41:04] [PASSED] 35.11 Xe3p_XPC
[09:41:04] ================ [PASSED] check_graphics_ip ================
[09:41:04] ===================== check_media_ip  ======================
[09:41:04] [PASSED] 12.00 Xe_M
[09:41:04] [PASSED] 12.55 Xe_HPM
[09:41:04] [PASSED] 13.00 Xe_LPM+
[09:41:04] [PASSED] 13.01 Xe2_HPM
[09:41:04] [PASSED] 20.00 Xe2_LPM
[09:41:04] [PASSED] 30.00 Xe3_LPM
[09:41:04] [PASSED] 30.02 Xe3_LPM
[09:41:04] [PASSED] 35.00 Xe3p_LPM
[09:41:04] [PASSED] 35.03 Xe3p_HPM
[09:41:04] ================= [PASSED] check_media_ip ==================
[09:41:04] =================== check_platform_desc  ===================
[09:41:04] [PASSED] 0x9A60 (TIGERLAKE)
[09:41:04] [PASSED] 0x9A68 (TIGERLAKE)
[09:41:04] [PASSED] 0x9A70 (TIGERLAKE)
[09:41:04] [PASSED] 0x9A40 (TIGERLAKE)
[09:41:04] [PASSED] 0x9A49 (TIGERLAKE)
[09:41:04] [PASSED] 0x9A59 (TIGERLAKE)
[09:41:04] [PASSED] 0x9A78 (TIGERLAKE)
[09:41:04] [PASSED] 0x9AC0 (TIGERLAKE)
[09:41:04] [PASSED] 0x9AC9 (TIGERLAKE)
[09:41:04] [PASSED] 0x9AD9 (TIGERLAKE)
[09:41:04] [PASSED] 0x9AF8 (TIGERLAKE)
[09:41:04] [PASSED] 0x4C80 (ROCKETLAKE)
[09:41:04] [PASSED] 0x4C8A (ROCKETLAKE)
[09:41:04] [PASSED] 0x4C8B (ROCKETLAKE)
[09:41:04] [PASSED] 0x4C8C (ROCKETLAKE)
[09:41:04] [PASSED] 0x4C90 (ROCKETLAKE)
[09:41:04] [PASSED] 0x4C9A (ROCKETLAKE)
[09:41:04] [PASSED] 0x4680 (ALDERLAKE_S)
[09:41:04] [PASSED] 0x4682 (ALDERLAKE_S)
[09:41:04] [PASSED] 0x4688 (ALDERLAKE_S)
[09:41:04] [PASSED] 0x468A (ALDERLAKE_S)
[09:41:04] [PASSED] 0x468B (ALDERLAKE_S)
[09:41:04] [PASSED] 0x4690 (ALDERLAKE_S)
[09:41:04] [PASSED] 0x4692 (ALDERLAKE_S)
[09:41:04] [PASSED] 0x4693 (ALDERLAKE_S)
[09:41:04] [PASSED] 0x46A0 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46A1 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46A2 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46A3 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46A6 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46A8 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46AA (ALDERLAKE_P)
[09:41:04] [PASSED] 0x462A (ALDERLAKE_P)
[09:41:04] [PASSED] 0x4626 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x4628 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46B0 (ALDERLAKE_P)
stty: 'standard input': Inappropriate ioctl for device
[09:41:04] [PASSED] 0x46B1 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46B2 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46B3 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46C0 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46C1 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46C2 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46C3 (ALDERLAKE_P)
[09:41:04] [PASSED] 0x46D0 (ALDERLAKE_N)
[09:41:04] [PASSED] 0x46D1 (ALDERLAKE_N)
[09:41:04] [PASSED] 0x46D2 (ALDERLAKE_N)
[09:41:04] [PASSED] 0x46D3 (ALDERLAKE_N)
[09:41:04] [PASSED] 0x46D4 (ALDERLAKE_N)
[09:41:04] [PASSED] 0xA721 (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7A1 (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7A9 (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7AC (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7AD (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA720 (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7A0 (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7A8 (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7AA (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA7AB (ALDERLAKE_P)
[09:41:04] [PASSED] 0xA780 (ALDERLAKE_S)
[09:41:04] [PASSED] 0xA781 (ALDERLAKE_S)
[09:41:04] [PASSED] 0xA782 (ALDERLAKE_S)
[09:41:04] [PASSED] 0xA783 (ALDERLAKE_S)
[09:41:04] [PASSED] 0xA788 (ALDERLAKE_S)
[09:41:04] [PASSED] 0xA789 (ALDERLAKE_S)
[09:41:04] [PASSED] 0xA78A (ALDERLAKE_S)
[09:41:04] [PASSED] 0xA78B (ALDERLAKE_S)
[09:41:04] [PASSED] 0x4905 (DG1)
[09:41:04] [PASSED] 0x4906 (DG1)
[09:41:04] [PASSED] 0x4907 (DG1)
[09:41:04] [PASSED] 0x4908 (DG1)
[09:41:04] [PASSED] 0x4909 (DG1)
[09:41:04] [PASSED] 0x56C0 (DG2)
[09:41:04] [PASSED] 0x56C2 (DG2)
[09:41:04] [PASSED] 0x56C1 (DG2)
[09:41:04] [PASSED] 0x7D51 (METEORLAKE)
[09:41:04] [PASSED] 0x7DD1 (METEORLAKE)
[09:41:04] [PASSED] 0x7D41 (METEORLAKE)
[09:41:04] [PASSED] 0x7D67 (METEORLAKE)
[09:41:04] [PASSED] 0xB640 (METEORLAKE)
[09:41:04] [PASSED] 0x56A0 (DG2)
[09:41:04] [PASSED] 0x56A1 (DG2)
[09:41:04] [PASSED] 0x56A2 (DG2)
[09:41:04] [PASSED] 0x56BE (DG2)
[09:41:04] [PASSED] 0x56BF (DG2)
[09:41:04] [PASSED] 0x5690 (DG2)
[09:41:04] [PASSED] 0x5691 (DG2)
[09:41:04] [PASSED] 0x5692 (DG2)
[09:41:04] [PASSED] 0x56A5 (DG2)
[09:41:04] [PASSED] 0x56A6 (DG2)
[09:41:04] [PASSED] 0x56B0 (DG2)
[09:41:04] [PASSED] 0x56B1 (DG2)
[09:41:04] [PASSED] 0x56BA (DG2)
[09:41:04] [PASSED] 0x56BB (DG2)
[09:41:04] [PASSED] 0x56BC (DG2)
[09:41:04] [PASSED] 0x56BD (DG2)
[09:41:04] [PASSED] 0x5693 (DG2)
[09:41:04] [PASSED] 0x5694 (DG2)
[09:41:04] [PASSED] 0x5695 (DG2)
[09:41:04] [PASSED] 0x56A3 (DG2)
[09:41:04] [PASSED] 0x56A4 (DG2)
[09:41:04] [PASSED] 0x56B2 (DG2)
[09:41:04] [PASSED] 0x56B3 (DG2)
[09:41:04] [PASSED] 0x5696 (DG2)
[09:41:04] [PASSED] 0x5697 (DG2)
[09:41:04] [PASSED] 0xB69 (PVC)
[09:41:04] [PASSED] 0xB6E (PVC)
[09:41:04] [PASSED] 0xBD4 (PVC)
[09:41:04] [PASSED] 0xBD5 (PVC)
[09:41:04] [PASSED] 0xBD6 (PVC)
[09:41:04] [PASSED] 0xBD7 (PVC)
[09:41:04] [PASSED] 0xBD8 (PVC)
[09:41:04] [PASSED] 0xBD9 (PVC)
[09:41:04] [PASSED] 0xBDA (PVC)
[09:41:04] [PASSED] 0xBDB (PVC)
[09:41:04] [PASSED] 0xBE0 (PVC)
[09:41:04] [PASSED] 0xBE1 (PVC)
[09:41:04] [PASSED] 0xBE5 (PVC)
[09:41:04] [PASSED] 0x7D40 (METEORLAKE)
[09:41:04] [PASSED] 0x7D45 (METEORLAKE)
[09:41:04] [PASSED] 0x7D55 (METEORLAKE)
[09:41:04] [PASSED] 0x7D60 (METEORLAKE)
[09:41:04] [PASSED] 0x7DD5 (METEORLAKE)
[09:41:04] [PASSED] 0x6420 (LUNARLAKE)
[09:41:04] [PASSED] 0x64A0 (LUNARLAKE)
[09:41:04] [PASSED] 0x64B0 (LUNARLAKE)
[09:41:04] [PASSED] 0xE202 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE209 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE20B (BATTLEMAGE)
[09:41:04] [PASSED] 0xE20C (BATTLEMAGE)
[09:41:04] [PASSED] 0xE20D (BATTLEMAGE)
[09:41:04] [PASSED] 0xE210 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE211 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE212 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE216 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE220 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE221 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE222 (BATTLEMAGE)
[09:41:04] [PASSED] 0xE223 (BATTLEMAGE)
[09:41:04] [PASSED] 0xB080 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB081 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB082 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB083 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB084 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB085 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB086 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB087 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB08F (PANTHERLAKE)
[09:41:04] [PASSED] 0xB090 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB0A0 (PANTHERLAKE)
[09:41:04] [PASSED] 0xB0B0 (PANTHERLAKE)
[09:41:04] [PASSED] 0xD740 (NOVALAKE_S)
[09:41:04] [PASSED] 0xD741 (NOVALAKE_S)
[09:41:04] [PASSED] 0xD742 (NOVALAKE_S)
[09:41:04] [PASSED] 0xD743 (NOVALAKE_S)
[09:41:04] [PASSED] 0xD744 (NOVALAKE_S)
[09:41:04] [PASSED] 0xD745 (NOVALAKE_S)
[09:41:04] [PASSED] 0x674C (CRESCENTISLAND)
[09:41:04] [PASSED] 0xFD80 (PANTHERLAKE)
[09:41:04] [PASSED] 0xFD81 (PANTHERLAKE)
[09:41:04] =============== [PASSED] check_platform_desc ===============
[09:41:04] ===================== [PASSED] xe_pci ======================
[09:41:04] =================== xe_rtp (2 subtests) ====================
[09:41:04] =============== xe_rtp_process_to_sr_tests  ================
[09:41:04] [PASSED] coalesce-same-reg
[09:41:04] [PASSED] no-match-no-add
[09:41:04] [PASSED] match-or
[09:41:04] [PASSED] match-or-xfail
[09:41:04] [PASSED] no-match-no-add-multiple-rules
[09:41:04] [PASSED] two-regs-two-entries
[09:41:04] [PASSED] clr-one-set-other
[09:41:04] [PASSED] set-field
[09:41:04] [PASSED] conflict-duplicate
[09:41:04] [PASSED] conflict-not-disjoint
[09:41:04] [PASSED] conflict-reg-type
[09:41:04] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[09:41:04] ================== xe_rtp_process_tests  ===================
[09:41:04] [PASSED] active1
[09:41:04] [PASSED] active2
[09:41:04] [PASSED] active-inactive
[09:41:04] [PASSED] inactive-active
[09:41:04] [PASSED] inactive-1st_or_active-inactive
[09:41:04] [PASSED] inactive-2nd_or_active-inactive
[09:41:04] [PASSED] inactive-last_or_active-inactive
[09:41:04] [PASSED] inactive-no_or_active-inactive
[09:41:04] ============== [PASSED] xe_rtp_process_tests ===============
[09:41:04] ===================== [PASSED] xe_rtp ======================
[09:41:04] ==================== xe_wa (1 subtest) =====================
[09:41:04] ======================== xe_wa_gt  =========================
[09:41:04] [PASSED] TIGERLAKE B0
[09:41:04] [PASSED] DG1 A0
[09:41:04] [PASSED] DG1 B0
[09:41:04] [PASSED] ALDERLAKE_S A0
[09:41:04] [PASSED] ALDERLAKE_S B0
[09:41:04] [PASSED] ALDERLAKE_S C0
[09:41:04] [PASSED] ALDERLAKE_S D0
[09:41:04] [PASSED] ALDERLAKE_P A0
[09:41:04] [PASSED] ALDERLAKE_P B0
[09:41:04] [PASSED] ALDERLAKE_P C0
[09:41:04] [PASSED] ALDERLAKE_S RPLS D0
[09:41:04] [PASSED] ALDERLAKE_P RPLU E0
[09:41:04] [PASSED] DG2 G10 C0
[09:41:04] [PASSED] DG2 G11 B1
[09:41:04] [PASSED] DG2 G12 A1
[09:41:04] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[09:41:04] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[09:41:04] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[09:41:04] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[09:41:04] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[09:41:04] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[09:41:04] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[09:41:04] ==================== [PASSED] xe_wa_gt =====================
[09:41:04] ====================== [PASSED] xe_wa ======================
[09:41:04] ============================================================
[09:41:04] Testing complete. Ran 510 tests: passed: 492, skipped: 18
[09:41:04] Elapsed time: 35.326s total, 4.146s configuring, 30.664s building, 0.465s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[09:41:04] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[09:41:06] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[09:41:31] Starting KUnit Kernel (1/1)...
[09:41:31] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[09:41:31] ============ drm_test_pick_cmdline (2 subtests) ============
[09:41:31] [PASSED] drm_test_pick_cmdline_res_1920_1080_60
[09:41:31] =============== drm_test_pick_cmdline_named  ===============
[09:41:31] [PASSED] NTSC
[09:41:31] [PASSED] NTSC-J
[09:41:31] [PASSED] PAL
[09:41:31] [PASSED] PAL-M
[09:41:31] =========== [PASSED] drm_test_pick_cmdline_named ===========
[09:41:31] ============== [PASSED] drm_test_pick_cmdline ==============
[09:41:31] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[09:41:31] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[09:41:31] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[09:41:31] =========== drm_validate_clone_mode (2 subtests) ===========
[09:41:31] ============== drm_test_check_in_clone_mode  ===============
[09:41:31] [PASSED] in_clone_mode
[09:41:31] [PASSED] not_in_clone_mode
[09:41:31] ========== [PASSED] drm_test_check_in_clone_mode ===========
[09:41:31] =============== drm_test_check_valid_clones  ===============
[09:41:31] [PASSED] not_in_clone_mode
[09:41:31] [PASSED] valid_clone
[09:41:31] [PASSED] invalid_clone
[09:41:31] =========== [PASSED] drm_test_check_valid_clones ===========
[09:41:31] ============= [PASSED] drm_validate_clone_mode =============
[09:41:31] ============= drm_validate_modeset (1 subtest) =============
[09:41:31] [PASSED] drm_test_check_connector_changed_modeset
[09:41:31] ============== [PASSED] drm_validate_modeset ===============
[09:41:31] ====== drm_test_bridge_get_current_state (2 subtests) ======
[09:41:31] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[09:41:31] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[09:41:31] ======== [PASSED] drm_test_bridge_get_current_state ========
[09:41:31] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[09:41:31] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[09:41:31] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[09:41:31] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[09:41:31] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[09:41:31] ============== drm_bridge_alloc (2 subtests) ===============
[09:41:31] [PASSED] drm_test_drm_bridge_alloc_basic
[09:41:31] [PASSED] drm_test_drm_bridge_alloc_get_put
[09:41:31] ================ [PASSED] drm_bridge_alloc =================
[09:41:31] ================== drm_buddy (8 subtests) ==================
[09:41:31] [PASSED] drm_test_buddy_alloc_limit
[09:41:31] [PASSED] drm_test_buddy_alloc_optimistic
[09:41:31] [PASSED] drm_test_buddy_alloc_pessimistic
[09:41:31] [PASSED] drm_test_buddy_alloc_pathological
[09:41:31] [PASSED] drm_test_buddy_alloc_contiguous
[09:41:31] [PASSED] drm_test_buddy_alloc_clear
[09:41:31] [PASSED] drm_test_buddy_alloc_range_bias
[09:41:31] [PASSED] drm_test_buddy_fragmentation_performance
[09:41:31] ==================== [PASSED] drm_buddy ====================
[09:41:31] ============= drm_cmdline_parser (40 subtests) =============
[09:41:31] [PASSED] drm_test_cmdline_force_d_only
[09:41:31] [PASSED] drm_test_cmdline_force_D_only_dvi
[09:41:31] [PASSED] drm_test_cmdline_force_D_only_hdmi
[09:41:31] [PASSED] drm_test_cmdline_force_D_only_not_digital
[09:41:31] [PASSED] drm_test_cmdline_force_e_only
[09:41:31] [PASSED] drm_test_cmdline_res
[09:41:31] [PASSED] drm_test_cmdline_res_vesa
[09:41:31] [PASSED] drm_test_cmdline_res_vesa_rblank
[09:41:31] [PASSED] drm_test_cmdline_res_rblank
[09:41:31] [PASSED] drm_test_cmdline_res_bpp
[09:41:31] [PASSED] drm_test_cmdline_res_refresh
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[09:41:31] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[09:41:31] [PASSED] drm_test_cmdline_res_margins_force_on
[09:41:31] [PASSED] drm_test_cmdline_res_vesa_margins
[09:41:31] [PASSED] drm_test_cmdline_name
[09:41:31] [PASSED] drm_test_cmdline_name_bpp
[09:41:31] [PASSED] drm_test_cmdline_name_option
[09:41:31] [PASSED] drm_test_cmdline_name_bpp_option
[09:41:31] [PASSED] drm_test_cmdline_rotate_0
[09:41:31] [PASSED] drm_test_cmdline_rotate_90
[09:41:31] [PASSED] drm_test_cmdline_rotate_180
[09:41:31] [PASSED] drm_test_cmdline_rotate_270
[09:41:31] [PASSED] drm_test_cmdline_hmirror
[09:41:31] [PASSED] drm_test_cmdline_vmirror
[09:41:31] [PASSED] drm_test_cmdline_margin_options
[09:41:31] [PASSED] drm_test_cmdline_multiple_options
[09:41:31] [PASSED] drm_test_cmdline_bpp_extra_and_option
[09:41:31] [PASSED] drm_test_cmdline_extra_and_option
[09:41:31] [PASSED] drm_test_cmdline_freestanding_options
[09:41:31] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[09:41:31] [PASSED] drm_test_cmdline_panel_orientation
[09:41:31] ================ drm_test_cmdline_invalid  =================
[09:41:31] [PASSED] margin_only
[09:41:31] [PASSED] interlace_only
[09:41:31] [PASSED] res_missing_x
[09:41:31] [PASSED] res_missing_y
[09:41:31] [PASSED] res_bad_y
[09:41:31] [PASSED] res_missing_y_bpp
[09:41:31] [PASSED] res_bad_bpp
[09:41:31] [PASSED] res_bad_refresh
[09:41:31] [PASSED] res_bpp_refresh_force_on_off
[09:41:31] [PASSED] res_invalid_mode
[09:41:31] [PASSED] res_bpp_wrong_place_mode
[09:41:31] [PASSED] name_bpp_refresh
[09:41:31] [PASSED] name_refresh
[09:41:31] [PASSED] name_refresh_wrong_mode
[09:41:31] [PASSED] name_refresh_invalid_mode
[09:41:31] [PASSED] rotate_multiple
[09:41:31] [PASSED] rotate_invalid_val
[09:41:31] [PASSED] rotate_truncated
[09:41:31] [PASSED] invalid_option
[09:41:31] [PASSED] invalid_tv_option
[09:41:31] [PASSED] truncated_tv_option
[09:41:31] ============ [PASSED] drm_test_cmdline_invalid =============
[09:41:31] =============== drm_test_cmdline_tv_options  ===============
[09:41:31] [PASSED] NTSC
[09:41:31] [PASSED] NTSC_443
[09:41:31] [PASSED] NTSC_J
[09:41:31] [PASSED] PAL
[09:41:31] [PASSED] PAL_M
[09:41:31] [PASSED] PAL_N
[09:41:31] [PASSED] SECAM
[09:41:31] [PASSED] MONO_525
[09:41:31] [PASSED] MONO_625
[09:41:31] =========== [PASSED] drm_test_cmdline_tv_options ===========
[09:41:31] =============== [PASSED] drm_cmdline_parser ================
[09:41:31] ========== drmm_connector_hdmi_init (20 subtests) ==========
[09:41:31] [PASSED] drm_test_connector_hdmi_init_valid
[09:41:31] [PASSED] drm_test_connector_hdmi_init_bpc_8
[09:41:31] [PASSED] drm_test_connector_hdmi_init_bpc_10
[09:41:31] [PASSED] drm_test_connector_hdmi_init_bpc_12
[09:41:31] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[09:41:31] [PASSED] drm_test_connector_hdmi_init_bpc_null
[09:41:31] [PASSED] drm_test_connector_hdmi_init_formats_empty
[09:41:31] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[09:41:31] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[09:41:31] [PASSED] supported_formats=0x9 yuv420_allowed=1
[09:41:31] [PASSED] supported_formats=0x9 yuv420_allowed=0
[09:41:31] [PASSED] supported_formats=0x3 yuv420_allowed=1
[09:41:31] [PASSED] supported_formats=0x3 yuv420_allowed=0
[09:41:31] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[09:41:31] [PASSED] drm_test_connector_hdmi_init_null_ddc
[09:41:31] [PASSED] drm_test_connector_hdmi_init_null_product
[09:41:31] [PASSED] drm_test_connector_hdmi_init_null_vendor
[09:41:31] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[09:41:31] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[09:41:31] [PASSED] drm_test_connector_hdmi_init_product_valid
[09:41:31] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[09:41:31] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[09:41:31] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[09:41:31] ========= drm_test_connector_hdmi_init_type_valid  =========
[09:41:31] [PASSED] HDMI-A
[09:41:31] [PASSED] HDMI-B
[09:41:31] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[09:41:31] ======== drm_test_connector_hdmi_init_type_invalid  ========
[09:41:31] [PASSED] Unknown
[09:41:31] [PASSED] VGA
[09:41:31] [PASSED] DVI-I
[09:41:31] [PASSED] DVI-D
[09:41:31] [PASSED] DVI-A
[09:41:31] [PASSED] Composite
[09:41:31] [PASSED] SVIDEO
[09:41:31] [PASSED] LVDS
[09:41:31] [PASSED] Component
[09:41:31] [PASSED] DIN
[09:41:31] [PASSED] DP
[09:41:31] [PASSED] TV
[09:41:31] [PASSED] eDP
[09:41:31] [PASSED] Virtual
[09:41:31] [PASSED] DSI
[09:41:31] [PASSED] DPI
[09:41:31] [PASSED] Writeback
[09:41:31] [PASSED] SPI
[09:41:31] [PASSED] USB
[09:41:31] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[09:41:31] ============ [PASSED] drmm_connector_hdmi_init =============
[09:41:31] ============= drmm_connector_init (3 subtests) =============
[09:41:31] [PASSED] drm_test_drmm_connector_init
[09:41:31] [PASSED] drm_test_drmm_connector_init_null_ddc
[09:41:31] ========= drm_test_drmm_connector_init_type_valid  =========
[09:41:31] [PASSED] Unknown
[09:41:31] [PASSED] VGA
[09:41:31] [PASSED] DVI-I
[09:41:31] [PASSED] DVI-D
[09:41:31] [PASSED] DVI-A
[09:41:31] [PASSED] Composite
[09:41:31] [PASSED] SVIDEO
[09:41:31] [PASSED] LVDS
[09:41:31] [PASSED] Component
[09:41:31] [PASSED] DIN
[09:41:31] [PASSED] DP
[09:41:31] [PASSED] HDMI-A
[09:41:31] [PASSED] HDMI-B
[09:41:31] [PASSED] TV
[09:41:31] [PASSED] eDP
[09:41:31] [PASSED] Virtual
[09:41:31] [PASSED] DSI
[09:41:31] [PASSED] DPI
[09:41:31] [PASSED] Writeback
[09:41:31] [PASSED] SPI
[09:41:31] [PASSED] USB
[09:41:31] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[09:41:31] =============== [PASSED] drmm_connector_init ===============
[09:41:31] ========= drm_connector_dynamic_init (6 subtests) ==========
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_init
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_init_properties
[09:41:31] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[09:41:31] [PASSED] Unknown
[09:41:31] [PASSED] VGA
[09:41:31] [PASSED] DVI-I
[09:41:31] [PASSED] DVI-D
[09:41:31] [PASSED] DVI-A
[09:41:31] [PASSED] Composite
[09:41:31] [PASSED] SVIDEO
[09:41:31] [PASSED] LVDS
[09:41:31] [PASSED] Component
[09:41:31] [PASSED] DIN
[09:41:31] [PASSED] DP
[09:41:31] [PASSED] HDMI-A
[09:41:31] [PASSED] HDMI-B
[09:41:31] [PASSED] TV
[09:41:31] [PASSED] eDP
[09:41:31] [PASSED] Virtual
[09:41:31] [PASSED] DSI
[09:41:31] [PASSED] DPI
[09:41:31] [PASSED] Writeback
[09:41:31] [PASSED] SPI
[09:41:31] [PASSED] USB
[09:41:31] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[09:41:31] ======== drm_test_drm_connector_dynamic_init_name  =========
[09:41:31] [PASSED] Unknown
[09:41:31] [PASSED] VGA
[09:41:31] [PASSED] DVI-I
[09:41:31] [PASSED] DVI-D
[09:41:31] [PASSED] DVI-A
[09:41:31] [PASSED] Composite
[09:41:31] [PASSED] SVIDEO
[09:41:31] [PASSED] LVDS
[09:41:31] [PASSED] Component
[09:41:31] [PASSED] DIN
[09:41:31] [PASSED] DP
[09:41:31] [PASSED] HDMI-A
[09:41:31] [PASSED] HDMI-B
[09:41:31] [PASSED] TV
[09:41:31] [PASSED] eDP
[09:41:31] [PASSED] Virtual
[09:41:31] [PASSED] DSI
[09:41:31] [PASSED] DPI
[09:41:31] [PASSED] Writeback
[09:41:31] [PASSED] SPI
[09:41:31] [PASSED] USB
[09:41:31] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[09:41:31] =========== [PASSED] drm_connector_dynamic_init ============
[09:41:31] ==== drm_connector_dynamic_register_early (4 subtests) =====
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[09:41:31] ====== [PASSED] drm_connector_dynamic_register_early =======
[09:41:31] ======= drm_connector_dynamic_register (7 subtests) ========
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[09:41:31] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[09:41:31] ========= [PASSED] drm_connector_dynamic_register ==========
[09:41:31] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[09:41:31] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[09:41:31] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[09:41:31] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[09:41:31] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[09:41:31] ========== drm_test_get_tv_mode_from_name_valid  ===========
[09:41:31] [PASSED] NTSC
[09:41:31] [PASSED] NTSC-443
[09:41:31] [PASSED] NTSC-J
[09:41:31] [PASSED] PAL
[09:41:31] [PASSED] PAL-M
[09:41:31] [PASSED] PAL-N
[09:41:31] [PASSED] SECAM
[09:41:31] [PASSED] Mono
[09:41:31] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[09:41:31] [PASSED] drm_test_get_tv_mode_from_name_truncated
[09:41:31] ============ [PASSED] drm_get_tv_mode_from_name ============
[09:41:31] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[09:41:31] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[09:41:31] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[09:41:31] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[09:41:31] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[09:41:31] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[09:41:31] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[09:41:31] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[09:41:31] [PASSED] VIC 96
[09:41:31] [PASSED] VIC 97
[09:41:31] [PASSED] VIC 101
[09:41:31] [PASSED] VIC 102
[09:41:31] [PASSED] VIC 106
[09:41:31] [PASSED] VIC 107
[09:41:31] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[09:41:31] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[09:41:31] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[09:41:31] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[09:41:31] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[09:41:31] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[09:41:31] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[09:41:31] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[09:41:31] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[09:41:31] [PASSED] Automatic
[09:41:31] [PASSED] Full
[09:41:31] [PASSED] Limited 16:235
[09:41:31] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[09:41:31] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[09:41:31] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[09:41:31] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[09:41:31] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[09:41:31] [PASSED] RGB
[09:41:31] [PASSED] YUV 4:2:0
[09:41:31] [PASSED] YUV 4:2:2
[09:41:31] [PASSED] YUV 4:4:4
[09:41:31] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[09:41:31] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[09:41:31] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[09:41:31] ============= drm_damage_helper (21 subtests) ==============
[09:41:31] [PASSED] drm_test_damage_iter_no_damage
[09:41:31] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[09:41:31] [PASSED] drm_test_damage_iter_no_damage_src_moved
[09:41:31] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[09:41:31] [PASSED] drm_test_damage_iter_no_damage_not_visible
[09:41:31] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[09:41:31] [PASSED] drm_test_damage_iter_no_damage_no_fb
[09:41:31] [PASSED] drm_test_damage_iter_simple_damage
[09:41:31] [PASSED] drm_test_damage_iter_single_damage
[09:41:31] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[09:41:31] [PASSED] drm_test_damage_iter_single_damage_outside_src
[09:41:31] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[09:41:31] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[09:41:31] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[09:41:31] [PASSED] drm_test_damage_iter_single_damage_src_moved
[09:41:31] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[09:41:31] [PASSED] drm_test_damage_iter_damage
[09:41:31] [PASSED] drm_test_damage_iter_damage_one_intersect
[09:41:31] [PASSED] drm_test_damage_iter_damage_one_outside
[09:41:31] [PASSED] drm_test_damage_iter_damage_src_moved
[09:41:31] [PASSED] drm_test_damage_iter_damage_not_visible
[09:41:31] ================ [PASSED] drm_damage_helper ================
[09:41:31] ============== drm_dp_mst_helper (3 subtests) ==============
[09:41:31] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[09:41:31] [PASSED] Clock 154000 BPP 30 DSC disabled
[09:41:31] [PASSED] Clock 234000 BPP 30 DSC disabled
[09:41:31] [PASSED] Clock 297000 BPP 24 DSC disabled
[09:41:31] [PASSED] Clock 332880 BPP 24 DSC enabled
[09:41:31] [PASSED] Clock 324540 BPP 24 DSC enabled
[09:41:31] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[09:41:31] ============== drm_test_dp_mst_calc_pbn_div  ===============
[09:41:31] [PASSED] Link rate 2000000 lane count 4
[09:41:31] [PASSED] Link rate 2000000 lane count 2
[09:41:31] [PASSED] Link rate 2000000 lane count 1
[09:41:31] [PASSED] Link rate 1350000 lane count 4
[09:41:31] [PASSED] Link rate 1350000 lane count 2
[09:41:31] [PASSED] Link rate 1350000 lane count 1
[09:41:31] [PASSED] Link rate 1000000 lane count 4
[09:41:31] [PASSED] Link rate 1000000 lane count 2
[09:41:31] [PASSED] Link rate 1000000 lane count 1
[09:41:31] [PASSED] Link rate 810000 lane count 4
[09:41:31] [PASSED] Link rate 810000 lane count 2
[09:41:31] [PASSED] Link rate 810000 lane count 1
[09:41:31] [PASSED] Link rate 540000 lane count 4
[09:41:31] [PASSED] Link rate 540000 lane count 2
[09:41:31] [PASSED] Link rate 540000 lane count 1
[09:41:31] [PASSED] Link rate 270000 lane count 4
[09:41:31] [PASSED] Link rate 270000 lane count 2
[09:41:31] [PASSED] Link rate 270000 lane count 1
[09:41:31] [PASSED] Link rate 162000 lane count 4
[09:41:31] [PASSED] Link rate 162000 lane count 2
[09:41:31] [PASSED] Link rate 162000 lane count 1
[09:41:31] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[09:41:31] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[09:41:31] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[09:41:31] [PASSED] DP_POWER_UP_PHY with port number
[09:41:31] [PASSED] DP_POWER_DOWN_PHY with port number
[09:41:31] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[09:41:31] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[09:41:31] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[09:41:31] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[09:41:31] [PASSED] DP_QUERY_PAYLOAD with port number
[09:41:31] [PASSED] DP_QUERY_PAYLOAD with VCPI
[09:41:31] [PASSED] DP_REMOTE_DPCD_READ with port number
[09:41:31] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[09:41:31] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[09:41:31] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[09:41:31] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[09:41:31] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[09:41:31] [PASSED] DP_REMOTE_I2C_READ with port number
[09:41:31] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[09:41:31] [PASSED] DP_REMOTE_I2C_READ with transactions array
[09:41:31] [PASSED] DP_REMOTE_I2C_WRITE with port number
[09:41:31] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[09:41:31] [PASSED] DP_REMOTE_I2C_WRITE with data array
[09:41:31] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[09:41:31] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[09:41:31] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[09:41:31] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[09:41:31] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[09:41:31] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[09:41:31] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[09:41:31] ================ [PASSED] drm_dp_mst_helper ================
[09:41:31] ================== drm_exec (7 subtests) ===================
[09:41:31] [PASSED] sanitycheck
[09:41:31] [PASSED] test_lock
[09:41:31] [PASSED] test_lock_unlock
[09:41:31] [PASSED] test_duplicates
[09:41:31] [PASSED] test_prepare
[09:41:31] [PASSED] test_prepare_array
[09:41:31] [PASSED] test_multiple_loops
[09:41:31] ==================== [PASSED] drm_exec =====================
[09:41:31] =========== drm_format_helper_test (17 subtests) ===========
[09:41:31] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[09:41:31] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[09:41:31] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[09:41:31] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[09:41:31] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[09:41:31] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[09:41:31] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[09:41:31] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[09:41:31] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[09:41:31] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[09:41:31] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[09:41:31] ============== drm_test_fb_xrgb8888_to_mono  ===============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[09:41:31] ==================== drm_test_fb_swab  =====================
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ================ [PASSED] drm_test_fb_swab =================
[09:41:31] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[09:41:31] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[09:41:31] [PASSED] single_pixel_source_buffer
[09:41:31] [PASSED] single_pixel_clip_rectangle
[09:41:31] [PASSED] well_known_colors
[09:41:31] [PASSED] destination_pitch
[09:41:31] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[09:41:31] ================= drm_test_fb_clip_offset  =================
[09:41:31] [PASSED] pass through
[09:41:31] [PASSED] horizontal offset
[09:41:31] [PASSED] vertical offset
[09:41:31] [PASSED] horizontal and vertical offset
[09:41:31] [PASSED] horizontal offset (custom pitch)
[09:41:31] [PASSED] vertical offset (custom pitch)
[09:41:31] [PASSED] horizontal and vertical offset (custom pitch)
[09:41:31] ============= [PASSED] drm_test_fb_clip_offset =============
[09:41:31] =================== drm_test_fb_memcpy  ====================
[09:41:31] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[09:41:31] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[09:41:31] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[09:41:31] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[09:41:31] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[09:41:31] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[09:41:31] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[09:41:31] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[09:41:31] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[09:41:31] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[09:41:31] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[09:41:31] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[09:41:31] =============== [PASSED] drm_test_fb_memcpy ================
[09:41:31] ============= [PASSED] drm_format_helper_test ==============
[09:41:31] ================= drm_format (18 subtests) =================
[09:41:31] [PASSED] drm_test_format_block_width_invalid
[09:41:31] [PASSED] drm_test_format_block_width_one_plane
[09:41:31] [PASSED] drm_test_format_block_width_two_plane
[09:41:31] [PASSED] drm_test_format_block_width_three_plane
[09:41:31] [PASSED] drm_test_format_block_width_tiled
[09:41:31] [PASSED] drm_test_format_block_height_invalid
[09:41:31] [PASSED] drm_test_format_block_height_one_plane
[09:41:31] [PASSED] drm_test_format_block_height_two_plane
[09:41:31] [PASSED] drm_test_format_block_height_three_plane
[09:41:31] [PASSED] drm_test_format_block_height_tiled
[09:41:31] [PASSED] drm_test_format_min_pitch_invalid
[09:41:31] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[09:41:31] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[09:41:31] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[09:41:31] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[09:41:31] [PASSED] drm_test_format_min_pitch_two_plane
[09:41:31] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[09:41:31] [PASSED] drm_test_format_min_pitch_tiled
[09:41:31] =================== [PASSED] drm_format ====================
[09:41:31] ============== drm_framebuffer (10 subtests) ===============
[09:41:31] ========== drm_test_framebuffer_check_src_coords  ==========
[09:41:31] [PASSED] Success: source fits into fb
[09:41:31] [PASSED] Fail: overflowing fb with x-axis coordinate
[09:41:31] [PASSED] Fail: overflowing fb with y-axis coordinate
[09:41:31] [PASSED] Fail: overflowing fb with source width
[09:41:31] [PASSED] Fail: overflowing fb with source height
[09:41:31] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[09:41:31] [PASSED] drm_test_framebuffer_cleanup
[09:41:31] =============== drm_test_framebuffer_create  ===============
[09:41:31] [PASSED] ABGR8888 normal sizes
[09:41:31] [PASSED] ABGR8888 max sizes
[09:41:31] [PASSED] ABGR8888 pitch greater than min required
[09:41:31] [PASSED] ABGR8888 pitch less than min required
[09:41:31] [PASSED] ABGR8888 Invalid width
[09:41:31] [PASSED] ABGR8888 Invalid buffer handle
[09:41:31] [PASSED] No pixel format
[09:41:31] [PASSED] ABGR8888 Width 0
[09:41:31] [PASSED] ABGR8888 Height 0
[09:41:31] [PASSED] ABGR8888 Out of bound height * pitch combination
[09:41:31] [PASSED] ABGR8888 Large buffer offset
[09:41:31] [PASSED] ABGR8888 Buffer offset for inexistent plane
[09:41:31] [PASSED] ABGR8888 Invalid flag
[09:41:31] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[09:41:31] [PASSED] ABGR8888 Valid buffer modifier
[09:41:31] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[09:41:31] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[09:41:31] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[09:41:31] [PASSED] NV12 Normal sizes
[09:41:31] [PASSED] NV12 Max sizes
[09:41:31] [PASSED] NV12 Invalid pitch
[09:41:31] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[09:41:31] [PASSED] NV12 different  modifier per-plane
[09:41:31] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[09:41:31] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[09:41:31] [PASSED] NV12 Modifier for inexistent plane
[09:41:31] [PASSED] NV12 Handle for inexistent plane
[09:41:31] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[09:41:31] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[09:41:31] [PASSED] YVU420 Normal sizes
[09:41:31] [PASSED] YVU420 Max sizes
[09:41:31] [PASSED] YVU420 Invalid pitch
[09:41:31] [PASSED] YVU420 Different pitches
[09:41:31] [PASSED] YVU420 Different buffer offsets/pitches
[09:41:31] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[09:41:31] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[09:41:31] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[09:41:31] [PASSED] YVU420 Valid modifier
[09:41:31] [PASSED] YVU420 Different modifiers per plane
[09:41:31] [PASSED] YVU420 Modifier for inexistent plane
[09:41:31] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[09:41:31] [PASSED] X0L2 Normal sizes
[09:41:31] [PASSED] X0L2 Max sizes
[09:41:31] [PASSED] X0L2 Invalid pitch
[09:41:31] [PASSED] X0L2 Pitch greater than minimum required
[09:41:31] [PASSED] X0L2 Handle for inexistent plane
[09:41:31] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[09:41:31] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[09:41:31] [PASSED] X0L2 Valid modifier
[09:41:31] [PASSED] X0L2 Modifier for inexistent plane
[09:41:31] =========== [PASSED] drm_test_framebuffer_create ===========
[09:41:31] [PASSED] drm_test_framebuffer_free
[09:41:31] [PASSED] drm_test_framebuffer_init
[09:41:31] [PASSED] drm_test_framebuffer_init_bad_format
[09:41:31] [PASSED] drm_test_framebuffer_init_dev_mismatch
[09:41:31] [PASSED] drm_test_framebuffer_lookup
[09:41:31] [PASSED] drm_test_framebuffer_lookup_inexistent
[09:41:31] [PASSED] drm_test_framebuffer_modifiers_not_supported
[09:41:31] ================= [PASSED] drm_framebuffer =================
[09:41:31] ================ drm_gem_shmem (8 subtests) ================
[09:41:31] [PASSED] drm_gem_shmem_test_obj_create
[09:41:31] [PASSED] drm_gem_shmem_test_obj_create_private
[09:41:31] [PASSED] drm_gem_shmem_test_pin_pages
[09:41:31] [PASSED] drm_gem_shmem_test_vmap
[09:41:31] [PASSED] drm_gem_shmem_test_get_pages_sgt
[09:41:31] [PASSED] drm_gem_shmem_test_get_sg_table
[09:41:31] [PASSED] drm_gem_shmem_test_madvise
[09:41:31] [PASSED] drm_gem_shmem_test_purge
[09:41:31] ================== [PASSED] drm_gem_shmem ==================
[09:41:31] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[09:41:31] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[09:41:31] [PASSED] Automatic
[09:41:31] [PASSED] Full
[09:41:31] [PASSED] Limited 16:235
[09:41:31] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[09:41:31] [PASSED] drm_test_check_disable_connector
[09:41:31] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[09:41:31] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[09:41:31] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[09:41:31] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[09:41:31] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[09:41:31] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[09:41:31] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[09:41:31] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[09:41:31] [PASSED] drm_test_check_output_bpc_dvi
[09:41:31] [PASSED] drm_test_check_output_bpc_format_vic_1
[09:41:31] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[09:41:31] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[09:41:31] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[09:41:31] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[09:41:31] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[09:41:31] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[09:41:31] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[09:41:31] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[09:41:31] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[09:41:31] [PASSED] drm_test_check_broadcast_rgb_value
[09:41:31] [PASSED] drm_test_check_bpc_8_value
[09:41:31] [PASSED] drm_test_check_bpc_10_value
[09:41:31] [PASSED] drm_test_check_bpc_12_value
[09:41:31] [PASSED] drm_test_check_format_value
[09:41:31] [PASSED] drm_test_check_tmds_char_value
[09:41:31] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[09:41:31] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[09:41:31] [PASSED] drm_test_check_mode_valid
[09:41:31] [PASSED] drm_test_check_mode_valid_reject
[09:41:31] [PASSED] drm_test_check_mode_valid_reject_rate
[09:41:31] [PASSED] drm_test_check_mode_valid_reject_max_clock
[09:41:31] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[09:41:31] ================= drm_managed (2 subtests) =================
[09:41:31] [PASSED] drm_test_managed_release_action
[09:41:31] [PASSED] drm_test_managed_run_action
[09:41:31] =================== [PASSED] drm_managed ===================
[09:41:31] =================== drm_mm (6 subtests) ====================
[09:41:31] [PASSED] drm_test_mm_init
[09:41:31] [PASSED] drm_test_mm_debug
[09:41:31] [PASSED] drm_test_mm_align32
[09:41:31] [PASSED] drm_test_mm_align64
[09:41:31] [PASSED] drm_test_mm_lowest
[09:41:31] [PASSED] drm_test_mm_highest
[09:41:31] ===================== [PASSED] drm_mm ======================
[09:41:31] ============= drm_modes_analog_tv (5 subtests) =============
[09:41:31] [PASSED] drm_test_modes_analog_tv_mono_576i
[09:41:31] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[09:41:31] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[09:41:31] [PASSED] drm_test_modes_analog_tv_pal_576i
[09:41:31] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[09:41:31] =============== [PASSED] drm_modes_analog_tv ===============
[09:41:31] ============== drm_plane_helper (2 subtests) ===============
[09:41:31] =============== drm_test_check_plane_state  ================
[09:41:31] [PASSED] clipping_simple
[09:41:31] [PASSED] clipping_rotate_reflect
[09:41:31] [PASSED] positioning_simple
[09:41:31] [PASSED] upscaling
[09:41:31] [PASSED] downscaling
[09:41:31] [PASSED] rounding1
[09:41:31] [PASSED] rounding2
[09:41:31] [PASSED] rounding3
[09:41:31] [PASSED] rounding4
[09:41:31] =========== [PASSED] drm_test_check_plane_state ============
[09:41:31] =========== drm_test_check_invalid_plane_state  ============
[09:41:31] [PASSED] positioning_invalid
[09:41:31] [PASSED] upscaling_invalid
[09:41:31] [PASSED] downscaling_invalid
[09:41:31] ======= [PASSED] drm_test_check_invalid_plane_state ========
[09:41:31] ================ [PASSED] drm_plane_helper =================
[09:41:31] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[09:41:31] ====== drm_test_connector_helper_tv_get_modes_check  =======
[09:41:31] [PASSED] None
[09:41:31] [PASSED] PAL
[09:41:31] [PASSED] NTSC
[09:41:31] [PASSED] Both, NTSC Default
[09:41:31] [PASSED] Both, PAL Default
[09:41:31] [PASSED] Both, NTSC Default, with PAL on command-line
[09:41:31] [PASSED] Both, PAL Default, with NTSC on command-line
[09:41:31] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[09:41:31] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[09:41:31] ================== drm_rect (9 subtests) ===================
[09:41:31] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[09:41:31] [PASSED] drm_test_rect_clip_scaled_not_clipped
[09:41:31] [PASSED] drm_test_rect_clip_scaled_clipped
[09:41:31] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[09:41:31] ================= drm_test_rect_intersect  =================
[09:41:31] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[09:41:31] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[09:41:31] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[09:41:31] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[09:41:31] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[09:41:31] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[09:41:31] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[09:41:31] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[09:41:31] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[09:41:31] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[09:41:31] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[09:41:31] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[09:41:31] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[09:41:31] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[09:41:31] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[09:41:31] ============= [PASSED] drm_test_rect_intersect =============
[09:41:31] ================ drm_test_rect_calc_hscale  ================
[09:41:31] [PASSED] normal use
[09:41:31] [PASSED] out of max range
[09:41:31] [PASSED] out of min range
[09:41:31] [PASSED] zero dst
[09:41:31] [PASSED] negative src
[09:41:31] [PASSED] negative dst
[09:41:31] ============ [PASSED] drm_test_rect_calc_hscale ============
[09:41:31] ================ drm_test_rect_calc_vscale  ================
[09:41:31] [PASSED] normal use
stty: 'standard input': Inappropriate ioctl for device
[09:41:31] [PASSED] out of max range
[09:41:31] [PASSED] out of min range
[09:41:31] [PASSED] zero dst
[09:41:31] [PASSED] negative src
[09:41:31] [PASSED] negative dst
[09:41:31] ============ [PASSED] drm_test_rect_calc_vscale ============
[09:41:31] ================== drm_test_rect_rotate  ===================
[09:41:31] [PASSED] reflect-x
[09:41:31] [PASSED] reflect-y
[09:41:31] [PASSED] rotate-0
[09:41:31] [PASSED] rotate-90
[09:41:31] [PASSED] rotate-180
[09:41:31] [PASSED] rotate-270
[09:41:31] ============== [PASSED] drm_test_rect_rotate ===============
[09:41:31] ================ drm_test_rect_rotate_inv  =================
[09:41:31] [PASSED] reflect-x
[09:41:31] [PASSED] reflect-y
[09:41:31] [PASSED] rotate-0
[09:41:31] [PASSED] rotate-90
[09:41:31] [PASSED] rotate-180
[09:41:31] [PASSED] rotate-270
[09:41:31] ============ [PASSED] drm_test_rect_rotate_inv =============
[09:41:31] ==================== [PASSED] drm_rect =====================
[09:41:31] ============ drm_sysfb_modeset_test (1 subtest) ============
[09:41:31] ============ drm_test_sysfb_build_fourcc_list  =============
[09:41:31] [PASSED] no native formats
[09:41:31] [PASSED] XRGB8888 as native format
[09:41:31] [PASSED] remove duplicates
[09:41:31] [PASSED] convert alpha formats
[09:41:31] [PASSED] random formats
[09:41:31] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[09:41:31] ============= [PASSED] drm_sysfb_modeset_test ==============
[09:41:31] ================== drm_fixp (2 subtests) ===================
[09:41:31] [PASSED] drm_test_int2fixp
[09:41:31] [PASSED] drm_test_sm2fixp
[09:41:31] ==================== [PASSED] drm_fixp =====================
[09:41:31] ============================================================
[09:41:31] Testing complete. Ran 624 tests: passed: 624
[09:41:31] Elapsed time: 26.768s total, 1.693s configuring, 24.608s building, 0.432s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[09:41:31] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[09:41:33] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[09:41:42] Starting KUnit Kernel (1/1)...
[09:41:42] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[09:41:42] ================= ttm_device (5 subtests) ==================
[09:41:42] [PASSED] ttm_device_init_basic
[09:41:42] [PASSED] ttm_device_init_multiple
[09:41:42] [PASSED] ttm_device_fini_basic
[09:41:42] [PASSED] ttm_device_init_no_vma_man
[09:41:42] ================== ttm_device_init_pools  ==================
[09:41:42] [PASSED] No DMA allocations, no DMA32 required
[09:41:42] [PASSED] DMA allocations, DMA32 required
[09:41:42] [PASSED] No DMA allocations, DMA32 required
[09:41:42] [PASSED] DMA allocations, no DMA32 required
[09:41:42] ============== [PASSED] ttm_device_init_pools ==============
[09:41:42] =================== [PASSED] ttm_device ====================
[09:41:42] ================== ttm_pool (8 subtests) ===================
[09:41:42] ================== ttm_pool_alloc_basic  ===================
[09:41:42] [PASSED] One page
[09:41:42] [PASSED] More than one page
[09:41:42] [PASSED] Above the allocation limit
[09:41:42] [PASSED] One page, with coherent DMA mappings enabled
[09:41:42] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[09:41:42] ============== [PASSED] ttm_pool_alloc_basic ===============
[09:41:42] ============== ttm_pool_alloc_basic_dma_addr  ==============
[09:41:42] [PASSED] One page
[09:41:42] [PASSED] More than one page
[09:41:42] [PASSED] Above the allocation limit
[09:41:42] [PASSED] One page, with coherent DMA mappings enabled
[09:41:42] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[09:41:42] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[09:41:42] [PASSED] ttm_pool_alloc_order_caching_match
[09:41:42] [PASSED] ttm_pool_alloc_caching_mismatch
[09:41:42] [PASSED] ttm_pool_alloc_order_mismatch
[09:41:42] [PASSED] ttm_pool_free_dma_alloc
[09:41:42] [PASSED] ttm_pool_free_no_dma_alloc
[09:41:42] [PASSED] ttm_pool_fini_basic
[09:41:42] ==================== [PASSED] ttm_pool =====================
[09:41:42] ================ ttm_resource (8 subtests) =================
[09:41:42] ================= ttm_resource_init_basic  =================
[09:41:42] [PASSED] Init resource in TTM_PL_SYSTEM
[09:41:42] [PASSED] Init resource in TTM_PL_VRAM
[09:41:42] [PASSED] Init resource in a private placement
[09:41:42] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[09:41:42] ============= [PASSED] ttm_resource_init_basic =============
[09:41:42] [PASSED] ttm_resource_init_pinned
[09:41:42] [PASSED] ttm_resource_fini_basic
[09:41:42] [PASSED] ttm_resource_manager_init_basic
[09:41:42] [PASSED] ttm_resource_manager_usage_basic
[09:41:42] [PASSED] ttm_resource_manager_set_used_basic
[09:41:42] [PASSED] ttm_sys_man_alloc_basic
[09:41:42] [PASSED] ttm_sys_man_free_basic
[09:41:42] ================== [PASSED] ttm_resource ===================
[09:41:42] =================== ttm_tt (15 subtests) ===================
[09:41:42] ==================== ttm_tt_init_basic  ====================
[09:41:42] [PASSED] Page-aligned size
[09:41:42] [PASSED] Extra pages requested
[09:41:42] ================ [PASSED] ttm_tt_init_basic ================
[09:41:42] [PASSED] ttm_tt_init_misaligned
[09:41:42] [PASSED] ttm_tt_fini_basic
[09:41:42] [PASSED] ttm_tt_fini_sg
[09:41:42] [PASSED] ttm_tt_fini_shmem
[09:41:42] [PASSED] ttm_tt_create_basic
[09:41:42] [PASSED] ttm_tt_create_invalid_bo_type
[09:41:42] [PASSED] ttm_tt_create_ttm_exists
[09:41:42] [PASSED] ttm_tt_create_failed
[09:41:42] [PASSED] ttm_tt_destroy_basic
[09:41:42] [PASSED] ttm_tt_populate_null_ttm
[09:41:42] [PASSED] ttm_tt_populate_populated_ttm
[09:41:42] [PASSED] ttm_tt_unpopulate_basic
[09:41:42] [PASSED] ttm_tt_unpopulate_empty_ttm
[09:41:42] [PASSED] ttm_tt_swapin_basic
[09:41:42] ===================== [PASSED] ttm_tt ======================
[09:41:42] =================== ttm_bo (14 subtests) ===================
[09:41:42] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[09:41:42] [PASSED] Cannot be interrupted and sleeps
[09:41:42] [PASSED] Cannot be interrupted, locks straight away
[09:41:42] [PASSED] Can be interrupted, sleeps
[09:41:42] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[09:41:42] [PASSED] ttm_bo_reserve_locked_no_sleep
[09:41:42] [PASSED] ttm_bo_reserve_no_wait_ticket
[09:41:42] [PASSED] ttm_bo_reserve_double_resv
[09:41:42] [PASSED] ttm_bo_reserve_interrupted
[09:41:42] [PASSED] ttm_bo_reserve_deadlock
[09:41:42] [PASSED] ttm_bo_unreserve_basic
[09:41:42] [PASSED] ttm_bo_unreserve_pinned
[09:41:42] [PASSED] ttm_bo_unreserve_bulk
[09:41:42] [PASSED] ttm_bo_fini_basic
[09:41:42] [PASSED] ttm_bo_fini_shared_resv
[09:41:42] [PASSED] ttm_bo_pin_basic
[09:41:42] [PASSED] ttm_bo_pin_unpin_resource
[09:41:42] [PASSED] ttm_bo_multiple_pin_one_unpin
[09:41:42] ===================== [PASSED] ttm_bo ======================
[09:41:42] ============== ttm_bo_validate (21 subtests) ===============
[09:41:42] ============== ttm_bo_init_reserved_sys_man  ===============
[09:41:42] [PASSED] Buffer object for userspace
[09:41:42] [PASSED] Kernel buffer object
[09:41:42] [PASSED] Shared buffer object
[09:41:42] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[09:41:42] ============== ttm_bo_init_reserved_mock_man  ==============
[09:41:42] [PASSED] Buffer object for userspace
[09:41:42] [PASSED] Kernel buffer object
[09:41:42] [PASSED] Shared buffer object
[09:41:42] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[09:41:42] [PASSED] ttm_bo_init_reserved_resv
[09:41:42] ================== ttm_bo_validate_basic  ==================
[09:41:42] [PASSED] Buffer object for userspace
[09:41:42] [PASSED] Kernel buffer object
[09:41:42] [PASSED] Shared buffer object
[09:41:42] ============== [PASSED] ttm_bo_validate_basic ==============
[09:41:42] [PASSED] ttm_bo_validate_invalid_placement
[09:41:42] ============= ttm_bo_validate_same_placement  ==============
[09:41:42] [PASSED] System manager
[09:41:42] [PASSED] VRAM manager
[09:41:42] ========= [PASSED] ttm_bo_validate_same_placement ==========
[09:41:42] [PASSED] ttm_bo_validate_failed_alloc
[09:41:42] [PASSED] ttm_bo_validate_pinned
[09:41:42] [PASSED] ttm_bo_validate_busy_placement
[09:41:42] ================ ttm_bo_validate_multihop  =================
[09:41:42] [PASSED] Buffer object for userspace
[09:41:42] [PASSED] Kernel buffer object
[09:41:42] [PASSED] Shared buffer object
[09:41:42] ============ [PASSED] ttm_bo_validate_multihop =============
[09:41:42] ========== ttm_bo_validate_no_placement_signaled  ==========
[09:41:42] [PASSED] Buffer object in system domain, no page vector
[09:41:42] [PASSED] Buffer object in system domain with an existing page vector
[09:41:42] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[09:41:42] ======== ttm_bo_validate_no_placement_not_signaled  ========
[09:41:42] [PASSED] Buffer object for userspace
[09:41:42] [PASSED] Kernel buffer object
[09:41:42] [PASSED] Shared buffer object
[09:41:42] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[09:41:42] [PASSED] ttm_bo_validate_move_fence_signaled
[09:41:42] ========= ttm_bo_validate_move_fence_not_signaled  =========
[09:41:42] [PASSED] Waits for GPU
[09:41:42] [PASSED] Tries to lock straight away
[09:41:42] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[09:41:42] [PASSED] ttm_bo_validate_happy_evict
[09:41:42] [PASSED] ttm_bo_validate_all_pinned_evict
[09:41:42] [PASSED] ttm_bo_validate_allowed_only_evict
[09:41:42] [PASSED] ttm_bo_validate_deleted_evict
[09:41:42] [PASSED] ttm_bo_validate_busy_domain_evict
[09:41:42] [PASSED] ttm_bo_validate_evict_gutting
[09:41:42] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[09:41:42] ================= [PASSED] ttm_bo_validate =================
[09:41:42] ============================================================
[09:41:42] Testing complete. Ran 101 tests: passed: 101
[09:41:42] Elapsed time: 11.252s total, 1.698s configuring, 9.339s building, 0.186s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 31+ messages in thread

* ✗ CI.checksparse: warning for Introduce DRM_RAS using generic netlink for RAS (rev3)
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (5 preceding siblings ...)
  2025-12-05  9:41 ` ✓ CI.KUnit: success " Patchwork
@ 2025-12-05  9:56 ` Patchwork
  2025-12-05 11:27 ` ✗ Xe.CI.Full: failure " Patchwork
  2025-12-09 21:56 ` [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Alex Deucher
  8 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2025-12-05  9:56 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev3)
URL   : https://patchwork.freedesktop.org/series/155188/
State : warning

== Summary ==

+ trap cleanup EXIT
+ KERNEL=/kernel
+ MT=/root/linux/maintainer-tools
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools /root/linux/maintainer-tools
Cloning into '/root/linux/maintainer-tools'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ make -C /root/linux/maintainer-tools
make: Entering directory '/root/linux/maintainer-tools'
cc -O2 -g -Wextra -o remap-log remap-log.c
make: Leaving directory '/root/linux/maintainer-tools'
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ /root/linux/maintainer-tools/dim sparse --fast 0949b969da10a9fc389d255f398653eb9bc4ffaf
Sparse version: 0.6.4 (Ubuntu: 0.6.4-4ubuntu3)
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/drm_drv.c:450:6: warning: context imbalance in 'drm_dev_enter' - different lock contexts for basic block
+drivers/gpu/drm/drm_drv.c: note: in included file (through include/linux/notifier.h, arch/x86/include/asm/uprobes.h, include/linux/uprobes.h, include/linux/mm_types.h, include/linux/mmzone.h, include/linux/gfp.h, ...):
+./include/linux/srcu.h:389:9: warning: context imbalance in 'drm_dev_exit' - unexpected unlock

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 31+ messages in thread

* ✗ Xe.CI.Full: failure for Introduce DRM_RAS using generic netlink for RAS (rev3)
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (6 preceding siblings ...)
  2025-12-05  9:56 ` ✗ CI.checksparse: warning " Patchwork
@ 2025-12-05 11:27 ` Patchwork
  2025-12-09 21:56 ` [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Alex Deucher
  8 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2025-12-05 11:27 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 385 bytes --]

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev3)
URL   : https://patchwork.freedesktop.org/series/155188/
State : failure

== Summary ==

ERROR: The runconfig 'xe-4197-7954eb633a41ad5f2012477cdc245ec0cf07e7a2_FULL' does not exist in the database

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v3/index.html

[-- Attachment #2: Type: text/html, Size: 950 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS
  2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (7 preceding siblings ...)
  2025-12-05 11:27 ` ✗ Xe.CI.Full: failure " Patchwork
@ 2025-12-09 21:56 ` Alex Deucher
  8 siblings, 0 replies; 31+ messages in thread
From: Alex Deucher @ 2025-12-09 21:56 UTC (permalink / raw)
  To: Riana Tauro, Hawking Zhang, Lazar, Lijo
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, lukas, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar

+ Hawking, Lijo to help review.

On Fri, Dec 5, 2025 at 3:19 AM Riana Tauro <riana.tauro@intel.com> wrote:
>
> This work is a continuation of the great work started by Aravind ([1] and [2])
> in order to fulfill the RAS requirements and proposal as previously discussed
> and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
>
> [1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
> [2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
> [3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> During the past review round, Lukas pointed out that netlink had evolved
> in parallel during these years and that now, any new usage of netlink families
> would require the usage of the YAML description and scripts.
>
> With this new requirement in place, the family name is hardcoded in the yaml file,
> so we are forced to have a single family name for the entire drm, and then we now
> we are forced to have a registration.
>
> So, while doing the registration, we now created the concept of drm-ras-node.
> For now the only node type supported is the agreed error-counter. But that could
> be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
> driver.
>
> In this first version, only querying counter is supported. But also this is expandable
> to future introduction of multicast notification and also clearing the counters.
>
> This design with multiple nodes per device is already flexible enough for driver
> to decide if it wants to handle error per device, or per IP block, or per error
> category. I believe this fully attend to the requested AMD feedback in the earlier
> reviews.
>
> So, my proposal is to start simple with this case as is, and then iterate over
> with the drm-ras in tree so we evolve together according to various driver's RAS
> needs.
>
> I have provided a documentation and the first Xe implementation of the counter
> as reference.
>
> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
> exercises this new API, hence I hope this can be the reference code for the uAPI
> usage, while we continue with the plan of introducing IGT tests and tools for this
> and adjusting the internal vendor tools to open with open source developments and
> changing them to support these flows.
>
> Example:
>
> $ sudo ynl --family drm_ras  --dump list-nodes
> [{'device-name': '0000:03:00.0',
>   'node-id': 0,
>   'node-name': 'correctable-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 1,
>   'node-name': 'nonfatal-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 2,
>   'node-name': 'fatal-errors',
>   'node-type': 'error-counter'}]
>
> $ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
>  {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]
>
> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":1}'
> {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
>
> IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3
>
> Rev2: Fix review comments
>       Add support for GT and SOC errors
>
> Rev3: Add uAPI for errors and nodes
>       Update documentation
>
>
> Riana Tauro (3):
>   drm/xe/xe_drm_ras: Add support for drm ras
>   drm/xe/xe_hw_error: Add support for GT hardware errors
>   drm/xe/xe_hw_error: Add support for PVC SOC errors
>
> Rodrigo Vivi (1):
>   drm/ras: Introduce the DRM RAS infrastructure over generic netlink
>
>  Documentation/gpu/drm-ras.rst              | 109 +++++
>  Documentation/gpu/index.rst                |   1 +
>  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++
>  drivers/gpu/drm/Kconfig                    |   9 +
>  drivers/gpu/drm/Makefile                   |   1 +
>  drivers/gpu/drm/drm_drv.c                  |   6 +
>  drivers/gpu/drm/drm_ras.c                  | 351 ++++++++++++++++
>  drivers/gpu/drm/drm_ras_genl_family.c      |  42 ++
>  drivers/gpu/drm/drm_ras_nl.c               |  54 +++
>  drivers/gpu/drm/xe/Makefile                |   1 +
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  68 ++++
>  drivers/gpu/drm/xe/xe_device_types.h       |   4 +
>  drivers/gpu/drm/xe/xe_drm_ras.c            | 199 +++++++++
>  drivers/gpu/drm/xe/xe_drm_ras.h            |  12 +
>  drivers/gpu/drm/xe/xe_drm_ras_types.h      |  40 ++
>  drivers/gpu/drm/xe/xe_hw_error.c           | 444 +++++++++++++++++++--
>  include/drm/drm_ras.h                      |  76 ++++
>  include/drm/drm_ras_genl_family.h          |  17 +
>  include/drm/drm_ras_nl.h                   |  24 ++
>  include/uapi/drm/drm_ras.h                 |  49 +++
>  include/uapi/drm/xe_drm.h                  |  82 ++++
>  21 files changed, 1682 insertions(+), 37 deletions(-)
>  create mode 100644 Documentation/gpu/drm-ras.rst
>  create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>  create mode 100644 drivers/gpu/drm/drm_ras.c
>  create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>  create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
>  create mode 100644 include/drm/drm_ras.h
>  create mode 100644 include/drm/drm_ras_genl_family.h
>  create mode 100644 include/drm/drm_ras_nl.h
>  create mode 100644 include/uapi/drm/drm_ras.h
>
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2026-01-16 20:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-05  8:39 [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2025-12-05  8:39 ` [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2025-12-09 21:35   ` Rodrigo Vivi
2026-01-08 22:36     ` Zack McKevitt
2026-01-09 20:57       ` Rodrigo Vivi
2026-01-13  8:20         ` Riana Tauro
2026-01-15 23:39           ` Zack McKevitt
2026-01-16  5:56             ` Riana Tauro
2026-01-16 20:26               ` Rodrigo Vivi
2025-12-05  8:39 ` [PATCH v3 2/4] drm/xe/xe_drm_ras: Add support for drm ras Riana Tauro
2025-12-09  8:22   ` Raag Jadav
2026-01-09  8:08     ` Riana Tauro
2026-01-09 14:13       ` Rodrigo Vivi
2026-01-09 15:58         ` Raag Jadav
2026-01-12  6:13           ` Riana Tauro
2026-01-12 10:27             ` Raag Jadav
2025-12-09 21:57   ` Rodrigo Vivi
2026-01-07  9:48     ` Aravind Iddamsetty
2025-12-05  8:39 ` [PATCH v3 3/4] drm/xe/xe_hw_error: Add support for GT hardware errors Riana Tauro
2025-12-10 18:18   ` Raag Jadav
2026-01-12  3:41     ` Riana Tauro
2026-01-12 10:02       ` Raag Jadav
2025-12-05  8:39 ` [PATCH v3 4/4] drm/xe/xe_hw_error: Add support for PVC SOC errors Riana Tauro
2025-12-15 10:52   ` Raag Jadav
2026-01-12  4:45     ` Riana Tauro
2026-01-12 10:06       ` Raag Jadav
2025-12-05  9:40 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev3) Patchwork
2025-12-05  9:41 ` ✓ CI.KUnit: success " Patchwork
2025-12-05  9:56 ` ✗ CI.checksparse: warning " Patchwork
2025-12-05 11:27 ` ✗ Xe.CI.Full: failure " Patchwork
2025-12-09 21:56 ` [PATCH v3 0/4] Introduce DRM_RAS using generic netlink for RAS Alex Deucher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox