public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS
@ 2026-02-02  6:43 Riana Tauro
  2026-02-02  6:43 ` [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Riana Tauro @ 2026-02-02  6:43 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro

This work is a continuation of the great work started by Aravind ([1] and [2])
in order to fulfill the RAS requirements and proposal as previously discussed
and agreed in the Linux Plumbers accelerator's bof of 2022 [3].

[1]: https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.iddamsetty@linux.intel.com/
[2]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
[3]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

During the past review round, Lukas pointed out that netlink had evolved
in parallel during these years and that now, any new usage of netlink families
would require the usage of the YAML description and scripts.

With this new requirement in place, the family name is hardcoded in the yaml file,
so we are forced to have a single family name for the entire drm, and then we now
we are forced to have a registration.

So, while doing the registration, we now created the concept of drm-ras-node.
For now the only node type supported is the agreed error-counter. But that could
be expanded for other cases like telemetry, requested by Zack for the qualcomm accel
driver.

In this first version, only querying counter is supported. But also this is expandable
to future introduction of multicast notification and also clearing the counters.

This design with multiple nodes per device is already flexible enough for driver
to decide if it wants to handle error per device, or per IP block, or per error
category. I believe this fully attend to the requested AMD feedback in the earlier
reviews.

So, my proposal is to start simple with this case as is, and then iterate over
with the drm-ras in tree so we evolve together according to various driver's RAS
needs.

I have provided a documentation and the first Xe implementation of the counter
as reference.

Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that entirely
exercises this new API, hence I hope this can be the reference code for the uAPI
usage, while we continue with the plan of introducing IGT tests and tools for this
and adjusting the internal vendor tools to open with open source developments and
changing them to support these flows.

Example:

List Nodes:

$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '0000:03:00.0',
  'node-id': 0,
  'node-name': 'correctable-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 1,
  'node-name': 'uncorrectable-errors',
  'node-type': 'error-counter'}]

Get Error counters:

$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]

Query Error counter:

$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":2}'
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}


IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3

Rev2: Fix review comments
      Add support for GT and SOC errors

Rev3: Add uAPI for errors and nodes
      Update documentation

Rev4: Use only correctable and uncorrectable error nodes
      use REG_BIT
      remove redundant error strings

Rev5: Split patch 2
      use atomic_t
      fix memory leaks
      fix logs
      fix hook failure
      change component and severity UAPI


Riana Tauro (4):
  drm/xe/xe_drm_ras: Add support for drm ras
  drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
  drm/xe/xe_hw_error: Add support for Core-Compute errors
  drm/xe/xe_hw_error: Add support for PVC SOC errors

Rodrigo Vivi (1):
  drm/ras: Introduce the DRM RAS infrastructure over generic netlink

 Documentation/gpu/drm-ras.rst              | 109 +++++
 Documentation/gpu/index.rst                |   1 +
 Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++
 drivers/gpu/drm/Kconfig                    |   9 +
 drivers/gpu/drm/Makefile                   |   1 +
 drivers/gpu/drm/drm_drv.c                  |   6 +
 drivers/gpu/drm/drm_ras.c                  | 351 +++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c      |  42 ++
 drivers/gpu/drm/drm_ras_nl.c               |  54 +++
 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  86 +++-
 drivers/gpu/drm/xe/xe_device_types.h       |   4 +
 drivers/gpu/drm/xe/xe_drm_ras.c            | 184 ++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h            |  15 +
 drivers/gpu/drm/xe/xe_drm_ras_types.h      |  48 +++
 drivers/gpu/drm/xe/xe_hw_error.c           | 474 +++++++++++++++++++--
 include/drm/drm_ras.h                      |  76 ++++
 include/drm/drm_ras_genl_family.h          |  17 +
 include/drm/drm_ras_nl.h                   |  24 ++
 include/uapi/drm/drm_ras.h                 |  49 +++
 include/uapi/drm/xe_drm.h                  |  79 ++++
 21 files changed, 1718 insertions(+), 42 deletions(-)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
@ 2026-02-02  6:43 ` Riana Tauro
  2026-02-02 10:08   ` kernel test robot
  2026-02-02 22:52   ` kernel test robot
  2026-02-02  6:43 ` [PATCH v5 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 24+ messages in thread
From: Riana Tauro @ 2026-02-02  6:43 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Eric Dumazet,
	netdev, Riana Tauro

From: Rodrigo Vivi <rodrigo.vivi@intel.com>

Introduces the DRM RAS infrastructure over generic netlink.

The new interface allows drivers to expose RAS nodes and their
associated error counters to userspace in a structured and extensible
way. Each drm_ras node can register its own set of error counters, which
are then discoverable and queryable through netlink operations. This
lays the groundwork for reporting and managing hardware error states
in a unified manner across different DRM drivers.

Currently is only supports error-counter nodes. But it can be
extended later.

The registration is also no tied to any drm node, so it can be
used by accel devices as well.

It uses the new and mandatory YAML description format stored in
Documentation/netlink/specs/. This forces a single generic netlink
family namespace for the entire drm: "drm-ras".
But multiple-endpoints are supported within the single family.

Any modification to this API needs to be applied to
Documentation/netlink/specs/drm_ras.yaml before regenerating the
code:

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode uapi --header \
 > include/uapi/drm/drm_ras.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --header \
 > include/drm/drm_ras_nl.h

$ tools/net/ynl/pyynl/ynl_gen_c.py --spec \
 Documentation/netlink/specs/drm_ras.yaml --mode kernel --source \
 > drivers/gpu/drm/drm_ras_nl.c

Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Co-developed-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
---
 Documentation/gpu/drm-ras.rst            | 109 +++++++
 Documentation/gpu/index.rst              |   1 +
 Documentation/netlink/specs/drm_ras.yaml | 130 +++++++++
 drivers/gpu/drm/Kconfig                  |   9 +
 drivers/gpu/drm/Makefile                 |   1 +
 drivers/gpu/drm/drm_drv.c                |   6 +
 drivers/gpu/drm/drm_ras.c                | 351 +++++++++++++++++++++++
 drivers/gpu/drm/drm_ras_genl_family.c    |  42 +++
 drivers/gpu/drm/drm_ras_nl.c             |  54 ++++
 include/drm/drm_ras.h                    |  76 +++++
 include/drm/drm_ras_genl_family.h        |  17 ++
 include/drm/drm_ras_nl.h                 |  24 ++
 include/uapi/drm/drm_ras.h               |  49 ++++
 13 files changed, 869 insertions(+)
 create mode 100644 Documentation/gpu/drm-ras.rst
 create mode 100644 Documentation/netlink/specs/drm_ras.yaml
 create mode 100644 drivers/gpu/drm/drm_ras.c
 create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
 create mode 100644 drivers/gpu/drm/drm_ras_nl.c
 create mode 100644 include/drm/drm_ras.h
 create mode 100644 include/drm/drm_ras_genl_family.h
 create mode 100644 include/drm/drm_ras_nl.h
 create mode 100644 include/uapi/drm/drm_ras.h

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
new file mode 100644
index 000000000000..cec60cf5d17d
--- /dev/null
+++ b/Documentation/gpu/drm-ras.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================
+DRM RAS over Generic Netlink
+============================
+
+The DRM RAS (Reliability, Availability, Serviceability) interface provides a
+standardized way for GPU/accelerator drivers to expose error counters and
+other reliability nodes to user space via Generic Netlink. This allows
+diagnostic tools, monitoring daemons, or test infrastructure to query hardware
+health in a uniform way across different DRM drivers.
+
+Key Goals:
+
+* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
+  data center monitoring and reliability operations.
+* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
+  specifications and centralize all RAS-related communication in one namespace.
+* Support a basic error counter interface, addressing the immediate, essential
+  monitoring needs.
+* Offer a flexible, future-proof interface that can be extended to support
+  additional types of RAS data in the future.
+* Allow multiple nodes per driver, enabling drivers to register separate
+  nodes for different IP blocks, sub-blocks, or other logical subdivisions
+  as applicable.
+
+Nodes
+=====
+
+Nodes are logical abstractions representing an error source or block within
+the device. Currently, only error counter nodes is supported.
+
+Drivers are responsible for registering and unregistering nodes via the
+`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
+
+Node Management
+-------------------
+
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :doc: DRM RAS Node Management
+.. kernel-doc:: drivers/gpu/drm/drm_ras.c
+   :internal:
+
+Generic Netlink Usage
+=====================
+
+The interface is implemented as a Generic Netlink family named ``drm-ras``.
+User space tools can:
+
+* List registered nodes with the ``get-nodes`` command.
+* List all error counters in an node with the ``get-error-counters`` command.
+* Query error counters using the ``query-error-counter`` command.
+
+YAML-based Interface
+--------------------
+
+The interface is described in a YAML specification:
+
+:ref:`Documentation/netlink/specs/drm_ras.yaml`
+
+This YAML is used to auto-generate user space bindings via
+``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
+attributes and operations.
+
+Usage Notes
+-----------
+
+* User space must first enumerate nodes to obtain their IDs.
+* Node IDs or Node names can be used for all further queries, such as error counters.
+* Error counters can be queried by either the Error ID or Error name.
+* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
+* The interface supports future extension by adding new node types and
+  additional attributes.
+
+Example: List nodes using ynl
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras  --dump list-nodes
+    [{'device-name': '0000:03:00.0',
+    'node-id': 0,
+    'node-name': 'correctable-errors',
+    'node-type': 'error-counter'},
+    {'device-name': '0000:03:00.0',
+     'node-id': 1,
+    'node-name': 'nonfatal-errors',
+    'node-type': 'error-counter'},
+    {'device-name': '0000:03:00.0',
+    'node-id': 2,
+    'node-name': 'fatal-errors',
+    'node-type': 'error-counter'}]
+
+Example: List all error counters using ynl
+
+.. code-block:: bash
+
+
+   sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
+   [{'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0},
+   {'error-id': 2, 'error-name': 'error_name_2', 'error-value': 0}]
+
+
+Example: Query an error counter for a given node
+
+.. code-block:: bash
+
+   sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":2, "error-id":1}'
+   {'error-id': 1, 'error-name': 'error_name_1', 'error-value': 0}
+
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index 7dcb15850afd..60c73fdcfeed 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -9,6 +9,7 @@ GPU Driver Developer's Guide
    drm-mm
    drm-kms
    drm-kms-helpers
+   drm-ras
    drm-uapi
    drm-usage-stats
    driver-uapi
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
new file mode 100644
index 000000000000..be0e379c5bc9
--- /dev/null
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -0,0 +1,130 @@
+# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+---
+name: drm-ras
+protocol: genetlink
+uapi-header: drm/drm_ras.h
+
+doc: >-
+  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
+  Provides a standardized mechanism for DRM drivers to register "nodes"
+  representing hardware/software components capable of reporting error counters.
+  Userspace tools can query the list of nodes or individual error counters
+  via the Generic Netlink interface.
+
+definitions:
+  -
+    type: enum
+    name: node-type
+    value-start: 1
+    entries: [error-counter]
+    doc: >-
+         Type of the node. Currently, only error-counter nodes are
+         supported, which expose reliability counters for a hardware/software
+         component.
+
+attribute-sets:
+  -
+    name: node-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc: >-
+             Unique identifier for the node.
+             Assigned dynamically by the DRM RAS core upon registration.
+      -
+        name: device-name
+        type: string
+        doc: >-
+             Device name chosen by the driver at registration.
+             Can be a PCI BDF, UUID, or module name if unique.
+      -
+        name: node-name
+        type: string
+        doc: >-
+             Node name chosen by the driver at registration.
+             Can be an IP block name, or any name that identifies the
+             RAS node inside the device.
+      -
+        name: node-type
+        type: u32
+        doc: Type of this node, identifying its function.
+        enum: node-type
+  -
+    name: error-counter-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc:  Node ID targeted by this error counter operation.
+      -
+        name: error-id
+        type: u32
+        doc: Unique identifier for a specific error counter within an node.
+      -
+        name: error-name
+        type: string
+        doc: Name of the error.
+      -
+        name: error-value
+        type: u32
+        doc: Current value of the requested error counter.
+
+operations:
+  list:
+    -
+      name: list-nodes
+      doc: >-
+           Retrieve the full list of currently registered DRM RAS nodes.
+           Each node includes its dynamically assigned ID, name, and type.
+           **Important:** User space must call this operation first to obtain
+           the node IDs. These IDs are required for all subsequent
+           operations on nodes, such as querying error counters.
+      attribute-set: node-attrs
+      flags: [admin-perm]
+      dump:
+        reply:
+          attributes:
+            - node-id
+            - device-name
+            - node-name
+            - node-type
+    -
+      name: get-error-counters
+      doc: >-
+           Retrieve the full list of error counters for a given node.
+           The response include the id, the name, and even the current
+           value of each counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      dump:
+        request:
+          attributes:
+            - node-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+    -
+      name: query-error-counter
+      doc: >-
+           Query the information of a specific error counter for a given node.
+           Users must provide the node ID and the error counter ID.
+           The response contains the id, the name, and the current value
+           of the counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes:
+            - node-id
+            - error-id
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-value
+
+kernel-family:
+  headers: ["drm/drm_ras_nl.h"]
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index d3d52310c9cc..d29ac485b6ac 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -130,6 +130,15 @@ config DRM_PANIC_SCREEN_QR_VERSION
 	  Smaller QR code are easier to read, but will contain less debugging
 	  data. Default is 40.
 
+config DRM_RAS
+	bool "DRM RAS support"
+	depends on DRM
+	help
+	  Enables the DRM RAS (Reliability, Availability and Serviceability)
+	  support for DRM drivers. This provides a Generic Netlink interface
+	  for error reporting and queries.
+	  If in doubt, say "N".
+
 config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
         bool "Enable refcount backtrace history in the DP MST helpers"
 	depends on STACKTRACE_SUPPORT
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 0c21029c446f..d1ad4ce873a3 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -95,6 +95,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
 drm-$(CONFIG_DRM_PANIC) += drm_panic.o
 drm-$(CONFIG_DRM_DRAW) += drm_draw.o
 drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
+drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
 obj-$(CONFIG_DRM)	+= drm.o
 
 obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 2915118436ce..6b965c3d3307 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -53,6 +53,7 @@
 #include <drm/drm_panic.h>
 #include <drm/drm_print.h>
 #include <drm/drm_privacy_screen_machine.h>
+#include <drm/drm_ras_genl_family.h>
 
 #include "drm_crtc_internal.h"
 #include "drm_internal.h"
@@ -1223,6 +1224,7 @@ static const struct file_operations drm_stub_fops = {
 
 static void drm_core_exit(void)
 {
+	drm_ras_genl_family_unregister();
 	drm_privacy_screen_lookup_exit();
 	drm_panic_exit();
 	accel_core_exit();
@@ -1261,6 +1263,10 @@ static int __init drm_core_init(void)
 
 	drm_privacy_screen_lookup_init();
 
+	ret = drm_ras_genl_family_register();
+	if (ret < 0)
+		goto error;
+
 	drm_core_init_complete = true;
 
 	DRM_DEBUG("Initialized\n");
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
new file mode 100644
index 000000000000..7bc77ea24fe2
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/xarray.h>
+#include <net/genetlink.h>
+
+#include <drm/drm_ras.h>
+
+/**
+ * DOC: DRM RAS Node Management
+ *
+ * This module provides the infrastructure to manage RAS (Reliability,
+ * Availability, and Serviceability) nodes for DRM drivers. Each
+ * DRM driver may register one or more RAS nodes, which represent
+ * logical components capable of reporting error counters and other
+ * reliability metrics.
+ *
+ * The nodes are stored in a global xarray `drm_ras_xa` to allow
+ * efficient lookup by ID. Nodes can be registered or unregistered
+ * dynamically at runtime.
+ *
+ * A Generic Netlink family `drm_ras` exposes three main operations to
+ * userspace:
+ *
+ * 1. LIST_NODES: Dump all currently registered RAS nodes.
+ *    The user receives an array of node IDs, names, and types.
+ *
+ * 2. GET_ERROR_COUNTERS: Dump all error counters of a given node.
+ *    The user receives an array of error IDs, names, and current value.
+ *
+ * 3. QUERY_ERROR_COUNTER: Query a specific error counter for a given node.
+ *    Userspace must provide the node ID and the counter ID, and
+ *    receives the ID, the error name, and its current value.
+ *
+ * Node registration:
+ * - drm_ras_node_register(): Registers a new node and assigns
+ *   it a unique ID in the xarray.
+ * - drm_ras_node_unregister(): Removes a previously registered
+ *   node from the xarray.
+ *
+ * Node type:
+ * - ERROR_COUNTER:
+ *     + Currently, only error counters are supported.
+ *     + The driver must implement the query_error_counter() callback to provide
+ *       the name and the value of the error counter.
+ *     + The driver must provide a error_counter_range.last value informing the
+ *       last valid error ID.
+ *     + The driver can provide a error_counter_range.first value informing the
+ *       frst valid error ID.
+ *     + The error counters in the driver doesn't need to be contiguous, but the
+ *       driver must return -ENOENT to the query_error_counter as an indication
+ *       that the ID should be skipped and not listed in the netlink API.
+ *
+ * Netlink handlers:
+ * - drm_ras_nl_list_nodes_dumpit(): Implements the LIST_NODES
+ *   operation, iterating over the xarray.
+ * - drm_ras_nl_get_error_counters_dumpit(): Implements the GET_ERROR_COUNTERS
+ *   operation, iterating over the know valid error_counter_range.
+ * - drm_ras_nl_query_error_counter_doit(): Implements the QUERY_ERROR_COUNTER
+ *   operation, fetching a counter value from a specific node.
+ */
+
+static DEFINE_XARRAY_ALLOC(drm_ras_xa);
+
+/*
+ * The netlink callback context carries dump state across multiple dumpit calls
+ */
+struct drm_ras_ctx {
+	/* Which xarray id to restart the dump from */
+	unsigned long restart;
+};
+
+/**
+ * drm_ras_nl_list_nodes_dumpit() - Dump all registered RAS nodes
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all registered RAS nodes in the global xarray and appends
+ * their attributes (ID, name, type) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all nodes fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	unsigned long id;
+	int ret;
+
+	xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
+		hdr = genlmsg_iput(skb, info);
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_ID, node->id);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+				     node->device_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_string(skb, DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+				     node->node_name);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		ret = nla_put_u32(skb, DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+				  node->type);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE)
+		ctx->restart = id;
+
+	return ret;
+}
+
+static int get_node_error_counter(u32 node_id, u32 error_id,
+				  const char **name, u32 *value)
+{
+	struct drm_ras_node *node;
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node || !node->query_error_counter)
+		return -ENOENT;
+
+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->query_error_counter(node, error_id, name, value);
+}
+
+static int msg_reply_value(struct sk_buff *msg, u32 error_id,
+			   const char *error_name, u32 value)
+{
+	int ret;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+	if (ret)
+		return ret;
+
+	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+			     error_name);
+	if (ret)
+		return ret;
+
+	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+			   value);
+}
+
+static int doit_reply_value(struct genl_info *info, u32 node_id,
+			    u32 error_id)
+{
+	struct sk_buff *msg;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 value;
+	int ret;
+
+	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_iput(msg, info);
+	if (!hdr) {
+		nlmsg_free(msg);
+		return -EMSGSIZE;
+	}
+
+	ret = get_node_error_counter(node_id, error_id,
+				     &error_name, &value);
+	if (ret)
+		return ret;
+
+	ret = msg_reply_value(msg, error_id, error_name, value);
+	if (ret) {
+		genlmsg_cancel(msg, hdr);
+		nlmsg_free(msg);
+		return ret;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_reply(msg, info);
+}
+
+/**
+ * drm_ras_nl_get_error_counters_dumpit() - Dump all Error Counters
+ * @skb: Netlink message buffer
+ * @cb: Callback context for multi-part dumps
+ *
+ * Iterates over all error counters in a given Node and appends
+ * their attributes (ID, name, value) to the given netlink message buffer.
+ * Uses @cb->ctx to track progress in case the message buffer fills up, allowing
+ * multi-part dump support. On buffer overflow, updates the context to resume
+ * from the last node on the next invocation.
+ *
+ * Return: 0 if all errors fit in @skb, number of bytes added to @skb if
+ *          the buffer filled up (requires multi-part continuation), or
+ *          a negative error code on failure.
+ */
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	struct drm_ras_ctx *ctx = (void *)cb->ctx;
+	struct drm_ras_node *node;
+	struct nlattr *hdr;
+	const char *error_name;
+	u32 node_id, error_id, value;
+	int ret;
+
+	if (!info->attrs || !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node)
+		return -ENOENT;
+
+	for (error_id = max(node->error_counter_range.first, ctx->restart);
+	     error_id <= node->error_counter_range.last;
+	     error_id++) {
+		ret = get_node_error_counter(node_id, error_id,
+					     &error_name, &value);
+		/*
+		 * For non-contiguous range, driver return -ENOENT as indication
+		 * to skip this ID when listing all errors.
+		 */
+		if (ret == -ENOENT)
+			continue;
+		if (ret)
+			return ret;
+
+		hdr = genlmsg_iput(skb, info);
+
+		if (!hdr) {
+			ret = -EMSGSIZE;
+			break;
+		}
+
+		ret = msg_reply_value(skb, error_id, error_name, value);
+		if (ret) {
+			genlmsg_cancel(skb, hdr);
+			break;
+		}
+
+		genlmsg_end(skb, hdr);
+	}
+
+	if (ret == -EMSGSIZE)
+		ctx->restart = error_id;
+
+	return ret;
+}
+
+/**
+ * drm_ras_nl_query_error_counter_doit() - Query an error counter of an node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * retrieves the current value of the corresponding error counter. Sends the
+ * result back to the requesting user via the standard Genl reply.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] ||
+	    !info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID])
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	return doit_reply_value(info, node_id, error_id);
+}
+
+/**
+ * drm_ras_node_register() - Register a new RAS node
+ * @node: Node structure to register
+ *
+ * Adds the given RAS node to the global node xarray and assigns it
+ * a unique ID. Both @node->name and @node->type must be valid.
+ *
+ * Return: 0 on success, or negative errno on failure:
+ */
+int drm_ras_node_register(struct drm_ras_node *node)
+{
+	if (!node->device_name || !node->node_name)
+		return -EINVAL;
+
+	/* Currently, only Error Counter Endpoinnts are supported */
+	if (node->type != DRM_RAS_NODE_TYPE_ERROR_COUNTER)
+		return -EINVAL;
+
+	/* Mandatorty entries for Error Counter Node */
+	if (node->type == DRM_RAS_NODE_TYPE_ERROR_COUNTER &&
+	    (!node->error_counter_range.last || !node->query_error_counter))
+		return -EINVAL;
+
+	return xa_alloc(&drm_ras_xa, &node->id, node, xa_limit_32b, GFP_KERNEL);
+}
+EXPORT_SYMBOL(drm_ras_node_register);
+
+/**
+ * drm_ras_node_unregister() - Unregister a previously registered node
+ * @node: Node structure to unregister
+ *
+ * Removes the given node from the global node xarray using its ID.
+ */
+void drm_ras_node_unregister(struct drm_ras_node *node)
+{
+	xa_erase(&drm_ras_xa, node->id);
+}
+EXPORT_SYMBOL(drm_ras_node_unregister);
diff --git a/drivers/gpu/drm/drm_ras_genl_family.c b/drivers/gpu/drm/drm_ras_genl_family.c
new file mode 100644
index 000000000000..2d818b8c3808
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_genl_family.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <drm/drm_ras_genl_family.h>
+#include <drm/drm_ras_nl.h>
+
+/* Track family registration so the drm_exit can be called at any time */
+static bool registered;
+
+/**
+ * drm_ras_genl_family_register() - Register drm-ras genl family
+ *
+ * Only to be called one at drm_drv_init()
+ */
+int drm_ras_genl_family_register(void)
+{
+	int ret;
+
+	registered = false;
+
+	ret = genl_register_family(&drm_ras_nl_family);
+	if (ret)
+		return ret;
+
+	registered = true;
+	return 0;
+}
+
+/**
+ * drm_ras_genl_family_unregister() - Unregister drm-ras genl family
+ *
+ * To be called one at drm_drv_exit() at any moment, but only once.
+ */
+void drm_ras_genl_family_unregister(void)
+{
+	if (registered) {
+		genl_unregister_family(&drm_ras_nl_family);
+		registered = false;
+	}
+}
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
new file mode 100644
index 000000000000..fcd1392410e4
--- /dev/null
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel source */
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+/* DRM_RAS_CMD_GET_ERROR_COUNTERS - dump */
+static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+};
+
+/* DRM_RAS_CMD_QUERY_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
+/* Ops table for drm_ras */
+static const struct genl_split_ops drm_ras_nl_ops[] = {
+	{
+		.cmd	= DRM_RAS_CMD_LIST_NODES,
+		.dumpit	= drm_ras_nl_list_nodes_dumpit,
+		.flags	= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_GET_ERROR_COUNTERS,
+		.dumpit		= drm_ras_nl_get_error_counters_dumpit,
+		.policy		= drm_ras_get_error_counters_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+		.doit		= drm_ras_nl_query_error_counter_doit,
+		.policy		= drm_ras_query_error_counter_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+};
+
+struct genl_family drm_ras_nl_family __ro_after_init = {
+	.name		= DRM_RAS_FAMILY_NAME,
+	.version	= DRM_RAS_FAMILY_VERSION,
+	.netnsok	= true,
+	.parallel_ops	= true,
+	.module		= THIS_MODULE,
+	.split_ops	= drm_ras_nl_ops,
+	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
+};
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
new file mode 100644
index 000000000000..bba47a282ef8
--- /dev/null
+++ b/include/drm/drm_ras.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_H__
+#define __DRM_RAS_H__
+
+#include "drm_ras_nl.h"
+
+/**
+ * struct drm_ras_node - A DRM RAS Node
+ */
+struct drm_ras_node {
+	/** @id: Unique identifier for the node. Dynamically assigned. */
+	u32 id;
+	/**
+	 * @device_name: Human-readable name of the device. Given by the driver.
+	 */
+	const char *device_name;
+	/** @node_name: Human-readable name of the node. Given by the driver. */
+	const char *node_name;
+	/** @type: Type of the node (enum drm_ras_node_type). */
+	enum drm_ras_node_type type;
+
+	/* Error-Counter Related Callback and Variables */
+
+	/** @error_counter_range: Range of valid Error IDs for this node. */
+	struct {
+		/** @first: First valid Error ID. */
+		u32 first;
+		/** @last: Last valid Error ID. Mandatory entry. */
+		u32 last;
+	} error_counter_range;
+
+	/**
+	 * @query_error_counter:
+	 *
+	 * This callback is used by drm-ras to query a specific error counter.
+	 * counters supported by this node. Used for input check and to
+	 * iterate in all counters.
+	 *
+	 * Driver should expect query_error_counters() to be called with
+	 * error_id from `error_counter_range.first` to
+	 * `error_counter_range.last`.
+	 *
+	 * The @query_error_counter is a mandatory callback for
+	 * error_counter_node.
+	 *
+	 * Returns: 0 on success,
+	 *          -ENOENT when error_id is not supported as an indication that
+	 *                  drm_ras should silently skip this entry. Used for
+	 *                  supporting non-contiguous error ranges.
+	 *                  Driver is responsible for maintaining the list of
+	 *                  supported error IDs in the range of first to last.
+	 *          Other negative values on errors that should terminate the
+	 *          netlink query.
+	 */
+	int (*query_error_counter)(struct drm_ras_node *ep, u32 error_id,
+				   const char **name, u32 *val);
+
+	/** @priv: Driver private data */
+	void *priv;
+};
+
+struct drm_device;
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_node_register(struct drm_ras_node *ep);
+void drm_ras_node_unregister(struct drm_ras_node *ep);
+#else
+static inline int drm_ras_node_register(struct drm_ras_node *ep) { return 0; }
+static inline void drm_ras_node_unregister(struct drm_ras_node *ep) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_genl_family.h b/include/drm/drm_ras_genl_family.h
new file mode 100644
index 000000000000..5931b53429f1
--- /dev/null
+++ b/include/drm/drm_ras_genl_family.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef __DRM_RAS_GENL_FAMILY_H__
+#define __DRM_RAS_GENL_FAMILY_H__
+
+#if IS_ENABLED(CONFIG_DRM_RAS)
+int drm_ras_genl_family_register(void);
+void drm_ras_genl_family_unregister(void);
+#else
+static inline int drm_ras_genl_family_register(void) { return 0; }
+static inline void drm_ras_genl_family_unregister(void) { }
+#endif
+
+#endif
diff --git a/include/drm/drm_ras_nl.h b/include/drm/drm_ras_nl.h
new file mode 100644
index 000000000000..9613b7d9ffdb
--- /dev/null
+++ b/include/drm/drm_ras_nl.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN kernel header */
+
+#ifndef _LINUX_DRM_RAS_GEN_H
+#define _LINUX_DRM_RAS_GEN_H
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/drm/drm_ras.h>
+#include <drm/drm_ras_nl.h>
+
+int drm_ras_nl_list_nodes_dumpit(struct sk_buff *skb,
+				 struct netlink_callback *cb);
+int drm_ras_nl_get_error_counters_dumpit(struct sk_buff *skb,
+					 struct netlink_callback *cb);
+int drm_ras_nl_query_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info);
+
+extern struct genl_family drm_ras_nl_family;
+
+#endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
new file mode 100644
index 000000000000..3415ba345ac8
--- /dev/null
+++ b/include/uapi/drm/drm_ras.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/drm_ras.yaml */
+/* YNL-GEN uapi header */
+
+#ifndef _UAPI_LINUX_DRM_RAS_H
+#define _UAPI_LINUX_DRM_RAS_H
+
+#define DRM_RAS_FAMILY_NAME	"drm-ras"
+#define DRM_RAS_FAMILY_VERSION	1
+
+/*
+ * Type of the node. Currently, only error-counter nodes are supported, which
+ * expose reliability counters for a hardware/software component.
+ */
+enum drm_ras_node_type {
+	DRM_RAS_NODE_TYPE_ERROR_COUNTER = 1,
+};
+
+enum {
+	DRM_RAS_A_NODE_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_NODE_ATTRS_DEVICE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_NAME,
+	DRM_RAS_A_NODE_ATTRS_NODE_TYPE,
+
+	__DRM_RAS_A_NODE_ATTRS_MAX,
+	DRM_RAS_A_NODE_ATTRS_MAX = (__DRM_RAS_A_NODE_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID = 1,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+
+	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
+};
+
+enum {
+	DRM_RAS_CMD_LIST_NODES = 1,
+	DRM_RAS_CMD_GET_ERROR_COUNTERS,
+	DRM_RAS_CMD_QUERY_ERROR_COUNTER,
+
+	__DRM_RAS_CMD_MAX,
+	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
+};
+
+#endif /* _UAPI_LINUX_DRM_RAS_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
  2026-02-02  6:43 ` [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2026-02-02  6:43 ` Riana Tauro
  2026-02-03 17:58   ` Raag Jadav
  2026-02-02  6:43 ` [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Riana Tauro @ 2026-02-02  6:43 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro

Allocate correctable, uncorrectable nodes for every xe device
Each node contains error component, counters and respective
query counter functions.

Add basic functionality to create and register drm nodes.
Below operations can be performed using Generic netlink DRM RAS interface

List Nodes:

$ sudo ynl --family drm_ras --dump list-nodes
[{'device-name': '0000:03:00.0',
  'node-id': 0,
  'node-name': 'correctable-errors',
  'node-type': 'error-counter'},
 {'device-name': '0000:03:00.0',
  'node-id': 1,
  'node-name': 'uncorrectable-errors',
  'node-type': 'error-counter'}]

Get Error counters:

$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]

Query Error counter:

$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, "error-id":2}'
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
    Add documentation
    Modify commit message

v3: remove 'error' from counters
    use drmm_kcalloc
    add a for_each for severity
    differentitate error classes and severity in uapi
    Use GT instead of Core Compute(Raag)
    Use correctable and uncorrectable in uapi (Pratik / Aravind)

v4: change UAPI enums
    split patches
    make drm_kcalloc robust
    fix function names and memory leak (Raag)
    add null pointer check for device_name
    start loop counter from 1
---
 drivers/gpu/drm/xe/Makefile           |   1 +
 drivers/gpu/drm/xe/xe_device_types.h  |   4 +
 drivers/gpu/drm/xe/xe_drm_ras.c       | 184 ++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h       |  15 +++
 drivers/gpu/drm/xe/xe_drm_ras_types.h |  40 ++++++
 include/uapi/drm/xe_drm.h             |  79 +++++++++++
 6 files changed, 323 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
 create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index b39cbb756232..b25564649492 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -41,6 +41,7 @@ xe-y += xe_bb.o \
 	xe_device_sysfs.o \
 	xe_dma_buf.o \
 	xe_drm_client.o \
+	xe_drm_ras.o \
 	xe_eu_stall.o \
 	xe_exec.o \
 	xe_exec_queue.o \
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 34feef79fa4e..2e863fcb2f08 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -13,6 +13,7 @@
 #include <drm/ttm/ttm_device.h>
 
 #include "xe_devcoredump_types.h"
+#include "xe_drm_ras_types.h"
 #include "xe_heci_gsc.h"
 #include "xe_late_bind_fw_types.h"
 #include "xe_lmtt_types.h"
@@ -674,6 +675,9 @@ struct xe_device {
 	/** @pmu: performance monitoring unit */
 	struct xe_pmu pmu;
 
+	/** @ras: RAS structure for device */
+	struct xe_drm_ras ras;
+
 	/** @i2c: I2C host controller */
 	struct xe_i2c *i2c;
 
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
new file mode 100644
index 000000000000..ec057bc82fbb
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -0,0 +1,184 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#include <drm/drm_managed.h>
+#include <drm/drm_print.h>
+#include <drm/drm_ras.h>
+#include <linux/bitmap.h>
+
+#include "xe_device_types.h"
+#include "xe_drm_ras.h"
+
+static const char * const errors[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+
+static int hw_query_error_counter(struct xe_drm_ras_counter *info,
+				  u32 error_id, const char **name, u32 *val)
+{
+	if (!info || !info[error_id].name)
+		return -ENOENT;
+
+	*name = info[error_id].name;
+	*val = atomic_read(&info[error_id].counter);
+
+	return 0;
+}
+
+static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
+					     const char **name, u32 *val)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
+
+	return hw_query_error_counter(info, error_id, name, val);
+}
+
+static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
+					   const char **name, u32 *val)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+	return hw_query_error_counter(info, error_id, name, val);
+}
+
+static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
+{
+	struct xe_drm_ras_counter *counter;
+	int i;
+
+	counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_COMP_MAX,
+			       sizeof(*counter), GFP_KERNEL);
+	if (!counter)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE; i < DRM_XE_RAS_ERR_COMP_MAX; i++) {
+		if (!errors[i])
+			continue;
+
+		counter[i].name = errors[i];
+		atomic_set(&counter[i].counter, 0);
+	}
+
+	return counter;
+}
+
+static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
+			      const enum drm_xe_ras_error_severity severity)
+{
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+	struct xe_drm_ras *ras = &xe->ras;
+	const char *device_name;
+
+	device_name = kasprintf(GFP_KERNEL, "%04x:%02x:%02x.%d",
+				pci_domain_nr(pdev->bus), pdev->bus->number,
+				PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+	if (!device_name)
+		return -ENOMEM;
+
+	node->device_name = device_name;
+	node->node_name = error_severity[severity];
+	node->type = DRM_RAS_NODE_TYPE_ERROR_COUNTER;
+	node->error_counter_range.first = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE;
+	node->error_counter_range.last = DRM_XE_RAS_ERR_COMP_MAX - 1;
+	node->priv = xe;
+
+	ras->info[severity] = allocate_and_copy_counters(xe);
+	if (IS_ERR(ras->info[severity]))
+		return PTR_ERR(ras->info[severity]);
+
+	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+		node->query_error_counter = query_correctable_error_counter;
+	else
+		node->query_error_counter = query_uncorrectable_error_counter;
+
+	return 0;
+}
+
+static void cleanup_node_param(struct drm_ras_node *node)
+{
+	if (node && node->device_name) {
+		kfree(node->device_name);
+		node->device_name = NULL;
+	}
+}
+
+static int register_nodes(struct xe_device *xe)
+{
+	struct xe_drm_ras *ras = &xe->ras;
+	int i;
+
+	for_each_error_severity(i) {
+		struct drm_ras_node *node = &ras->node[i];
+		int ret;
+
+		ret = assign_node_params(xe, node, i);
+		if (ret) {
+			cleanup_node_param(node);
+			return ret;
+		}
+
+		ret = drm_ras_node_register(node);
+		if (ret) {
+			cleanup_node_param(node);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void xe_drm_ras_unregister_nodes(void *arg)
+{
+	struct xe_device *xe = arg;
+	struct xe_drm_ras *ras = &xe->ras;
+	int i;
+
+	for_each_error_severity(i) {
+		struct drm_ras_node *node = &ras->node[i];
+
+		drm_ras_node_unregister(node);
+		cleanup_node_param(node);
+	}
+}
+
+/**
+ * xe_drm_ras_allocate_nodes() - Allocate DRM RAS nodes
+ * @xe: xe device instance
+ *
+ * Allocate and register DRM RAS nodes per device
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int xe_drm_ras_allocate_nodes(struct xe_device *xe)
+{
+	struct xe_drm_ras *ras = &xe->ras;
+	struct drm_ras_node *node;
+	int err;
+
+	node = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_SEV_MAX, sizeof(*node),
+			    GFP_KERNEL);
+	if (!node)
+		return -ENOMEM;
+
+	ras->node = node;
+
+	err = register_nodes(xe);
+	if (err) {
+		drm_err(&xe->drm, "Failed to register DRM RAS node\n");
+		return err;
+	}
+
+	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
+	if (err) {
+		drm_err(&xe->drm, "Failed to add action for Xe DRM RAS\n");
+		return err;
+	}
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
new file mode 100644
index 000000000000..d5206c04c274
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+#ifndef XE_DRM_RAS_H_
+#define XE_DRM_RAS_H_
+
+struct xe_device;
+
+#define for_each_error_severity(i)	\
+	for (i = 0; i < DRM_XE_RAS_ERR_SEV_MAX; i++)
+
+int xe_drm_ras_allocate_nodes(struct xe_device *xe);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
new file mode 100644
index 000000000000..0ac4ae324f37
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef _XE_DRM_RAS_TYPES_H_
+#define _XE_DRM_RAS_TYPES_H_
+
+#include <drm/xe_drm.h>
+#include <linux/atomic.h>
+
+struct drm_ras_node;
+
+/**
+ * struct xe_drm_ras_counter - XE RAS counter
+ *
+ * This structure contains error component and counter information
+ */
+struct xe_drm_ras_counter {
+	/** @name: error component name */
+	const char *name;
+
+	/** @counter: count of error */
+	atomic_t counter;
+};
+
+/**
+ * struct xe_drm_ras - XE DRM RAS structure
+ *
+ * This structure has details of error counters
+ */
+struct xe_drm_ras {
+	/** @node: DRM RAS node */
+	struct drm_ras_node *node;
+
+	/** @info: info array for all types of errors */
+	struct xe_drm_ras_counter *info[DRM_XE_RAS_ERR_SEV_MAX];
+};
+
+#endif
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 077e66a682e2..765bafd63ca1 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -2357,6 +2357,85 @@ struct drm_xe_exec_queue_set_property {
 	__u64 reserved[2];
 };
 
+/**
+ * DOC: Xe DRM RAS
+ *
+ * The enums and strings defined below map to the attributes of the DRM RAS Netlink Interface.
+ * Refer to Documentation/netlink/specs/drm_ras.yaml for complete interface specification.
+ *
+ * Node Registration
+ * =================
+ *
+ * The driver registers DRM RAS nodes for each error severity level.
+ * enum drm_xe_ras_error_severity defines the node-id, while DRM_XE_RAS_ERROR_SEVERITY_NAMES maps
+ * node-id to node-name.
+ *
+ * Error Classification
+ * ====================
+ *
+ * Each node contains a list of error counters. Each error is identified by a error-id and
+ * an error-name. enum drm_xe_ras_error_component defines the error-id, while
+ * DRM_XE_RAS_ERROR_COMPONENT_NAMES maps error-id to error-name.
+ *
+ * User Interface
+ * ==============
+ *
+ * To retrieve error values of a error counter, userspace applications should
+ * follow the below steps:
+ *
+ * 1. Use command LIST_NODES to enumerate all available nodes
+ * 2. Select node by node-id or node-name
+ * 3. Use command GET_ERROR_COUNTERS to list errors of specific node
+ * 4. Query specific error values using either error-id or error-name
+ *
+ * .. code-block:: C
+ *
+ *	// Lookup tables for ID-to-name resolution
+ *	static const char *nodes[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
+ *	static const char *errors[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;
+ *
+ */
+
+/**
+ * enum drm_xe_ras_error_severity - DRM RAS error severity.
+ */
+enum drm_xe_ras_error_severity {
+	/** @DRM_XE_RAS_ERR_SEV_CORRECTABLE: Correctable Error */
+	DRM_XE_RAS_ERR_SEV_CORRECTABLE = 0,
+	/** @DRM_XE_RAS_ERR_SEV_UNCORRECTABLE: Uncorrectable Error */
+	DRM_XE_RAS_ERR_SEV_UNCORRECTABLE,
+	/** @DRM_XE_RAS_ERR_SEV_MAX: Max severity */
+	DRM_XE_RAS_ERR_SEV_MAX /* non-ABI */
+};
+
+/**
+ * enum drm_xe_ras_error_component - DRM RAS error component.
+ */
+enum drm_xe_ras_error_component {
+	/** @DRM_XE_RAS_ERR_COMP_CORE_COMPUTE: Core Compute Error */
+	DRM_XE_RAS_ERR_COMP_CORE_COMPUTE = 1,
+	/** @DRM_XE_RAS_ERR_COMP_SOC_INTERNAL: SoC Internal Error */
+	DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
+	/** @DRM_XE_RAS_ERR_COMP_MAX: Max Error */
+	DRM_XE_RAS_ERR_COMP_MAX	/* non-ABI */
+};
+
+/*
+ * Error severity to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_SEVERITY_NAMES {				\
+	[DRM_XE_RAS_ERR_SEV_CORRECTABLE] = "correctable-errors",	\
+	[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE] = "uncorrectable-errors",	\
+}
+
+/*
+ * Error component to name mapping.
+ */
+#define DRM_XE_RAS_ERROR_COMPONENT_NAMES {				\
+	[DRM_XE_RAS_ERR_COMP_CORE_COMPUTE] = "core-compute",		\
+	[DRM_XE_RAS_ERR_COMP_SOC_INTERNAL] = "soc-internal"		\
+}
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
  2026-02-02  6:43 ` [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
  2026-02-02  6:43 ` [PATCH v5 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
@ 2026-02-02  6:43 ` Riana Tauro
  2026-02-05  8:30   ` Raag Jadav
  2026-02-02  6:44 ` [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Riana Tauro @ 2026-02-02  6:43 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro

Initialize DRM RAS in hw error init. Map the UAPI error severities
with the hardware error severities and refactor file.

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_drm_ras_types.h |  8 ++++
 drivers/gpu/drm/xe/xe_hw_error.c      | 68 ++++++++++++++++-----------
 2 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
index 0ac4ae324f37..beed48811d6a 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras_types.h
+++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
@@ -11,6 +11,14 @@
 
 struct drm_ras_node;
 
+/* Error categories reported by hardware */
+enum hardware_error {
+	HARDWARE_ERROR_CORRECTABLE = 0,
+	HARDWARE_ERROR_NONFATAL = 1,
+	HARDWARE_ERROR_FATAL = 2,
+	HARDWARE_ERROR_MAX,
+};
+
 /**
  * struct xe_drm_ras_counter - XE RAS counter
  *
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 8c65291f36fc..2019aaaa1ebe 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -10,20 +10,16 @@
 #include "regs/xe_irq_regs.h"
 
 #include "xe_device.h"
+#include "xe_drm_ras.h"
 #include "xe_hw_error.h"
 #include "xe_mmio.h"
 #include "xe_survivability_mode.h"
 
 #define  HEC_UNCORR_FW_ERR_BITS 4
+
 extern struct fault_attr inject_csc_hw_error;
 
-/* Error categories reported by hardware */
-enum hardware_error {
-	HARDWARE_ERROR_CORRECTABLE = 0,
-	HARDWARE_ERROR_NONFATAL = 1,
-	HARDWARE_ERROR_FATAL = 2,
-	HARDWARE_ERROR_MAX,
-};
+static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
 
 static const char * const hec_uncorrected_fw_errors[] = {
 	"Fatal",
@@ -32,23 +28,18 @@ static const char * const hec_uncorrected_fw_errors[] = {
 	"Data Corruption"
 };
 
-static const char *hw_error_to_str(const enum hardware_error hw_err)
+static bool fault_inject_csc_hw_error(void)
 {
-	switch (hw_err) {
-	case HARDWARE_ERROR_CORRECTABLE:
-		return "CORRECTABLE";
-	case HARDWARE_ERROR_NONFATAL:
-		return "NONFATAL";
-	case HARDWARE_ERROR_FATAL:
-		return "FATAL";
-	default:
-		return "UNKNOWN";
-	}
+	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
 }
 
-static bool fault_inject_csc_hw_error(void)
+static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_err)
 {
-	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
+	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+		return DRM_XE_RAS_ERR_SEV_CORRECTABLE;
+
+	/* Uncorrectable errors comprise of both fatal and non-fatal errors */
+	return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
 }
 
 static void csc_hw_error_work(struct work_struct *work)
@@ -64,7 +55,8 @@ static void csc_hw_error_work(struct work_struct *work)
 
 static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
-	const char *hw_err_str = hw_error_to_str(hw_err);
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_mmio *mmio = &tile->mmio;
 	u32 base, err_bit, err_src;
@@ -77,8 +69,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	lockdep_assert_held(&xe->irq.lock);
 	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
 	if (!err_src) {
-		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
-				    tile->id, hw_err_str);
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s HEC_ERR_STATUS register blank\n",
+				    tile->id, severity_str);
 		return;
 	}
 
@@ -86,8 +78,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
 		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
 			drm_err_ratelimited(&xe->drm, HW_ERR
-					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
-					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
+					    "HEC FW %s error reported, bit[%d] is set\n",
+					     hec_uncorrected_fw_errors[err_bit],
 					     err_bit);
 
 			schedule_work(&tile->csc_hw_error_work);
@@ -99,7 +91,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
-	const char *hw_err_str = hw_error_to_str(hw_err);
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
 	unsigned long flags;
 	u32 err_src;
@@ -110,8 +103,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 	spin_lock_irqsave(&xe->irq.lock, flags);
 	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
 	if (!err_src) {
-		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
-				    tile->id, hw_err_str);
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported %s DEV_ERR_STAT register blank!\n",
+				    tile->id, severity_str);
 		goto unlock;
 	}
 
@@ -146,6 +139,20 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 			hw_error_source_handler(tile, hw_err);
 }
 
+static int hw_error_info_init(struct xe_device *xe)
+{
+	int ret;
+
+	if (xe->info.platform != XE_PVC)
+		return 0;
+
+	ret = xe_drm_ras_allocate_nodes(xe);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
 /*
  * Process hardware errors during boot
  */
@@ -172,11 +179,16 @@ static void process_hw_errors(struct xe_device *xe)
 void xe_hw_error_init(struct xe_device *xe)
 {
 	struct xe_tile *tile = xe_device_get_root_tile(xe);
+	int ret;
 
 	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
 		return;
 
 	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
 
+	ret = hw_error_info_init(xe);
+	if (ret)
+		drm_warn(&xe->drm, "Failed to allocate DRM RAS nodes\n");
+
 	process_hw_errors(xe);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (2 preceding siblings ...)
  2026-02-02  6:43 ` [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
@ 2026-02-02  6:44 ` Riana Tauro
  2026-02-05 15:30   ` Raag Jadav
  2026-02-02  6:44 ` [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Riana Tauro @ 2026-02-02  6:44 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro, Himal Prasad Ghimiray

PVC supports GT error reporting via vector registers along with
error status register. Add support to report these errors and
update respective counters. Incase of Subslice error reported
by vector register, process the error status register
for applicable bits.

The counter is embedded in the xe drm ras structure and is
exposed to the userspace using the drm_ras generic netlink
interface.

$ sudo ynl --family drm_ras --do query-error-counter  --json
  '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)
    Add documentation
    Modify commit message

v3: remove 'error' from counters
    use drmm_kcalloc
    add a for_each for severity
    differentitate error classes and severity in UAPI(Raag)
    Use correctable and uncorrectable in uapi (Pratik / Aravind)

v4: modify enums in UAPI
    improve comments
    add bounds check in handler
    add error mask macro (Raag)
    use atomic_t
    add null pointer checks
---
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  62 ++++++-
 drivers/gpu/drm/xe/xe_hw_error.c           | 199 +++++++++++++++++++--
 2 files changed, 241 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index c146b9ef44eb..17982a335941 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -6,15 +6,59 @@
 #ifndef _XE_HW_ERROR_REGS_H_
 #define _XE_HW_ERROR_REGS_H_
 
-#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
-#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
+#define HEC_UNCORR_ERR_STATUS(base)			XE_REG((base) + 0x118)
+#define   UNCORR_FW_REPORTED_ERR			REG_BIT(6)
 
-#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
+#define HEC_UNCORR_FW_ERR_DW0(base)			XE_REG((base) + 0x124)
+
+#define ERR_STAT_GT_COR					0x100160
+#define   EU_GRF_COR_ERR				REG_BIT(15)
+#define   EU_IC_COR_ERR					REG_BIT(14)
+#define   SLM_COR_ERR					REG_BIT(13)
+#define   GUC_COR_ERR					REG_BIT(1)
+
+#define ERR_STAT_GT_NONFATAL				0x100164
+#define ERR_STAT_GT_FATAL				0x100168
+#define   EU_GRF_FAT_ERR				REG_BIT(15)
+#define   SLM_FAT_ERR					REG_BIT(13)
+#define   GUC_FAT_ERR					REG_BIT(6)
+#define   FPU_FAT_ERR					REG_BIT(3)
+
+#define ERR_STAT_GT_REG(x)				XE_REG(_PICK_EVEN((x), \
+									  ERR_STAT_GT_COR, \
+									  ERR_STAT_GT_NONFATAL))
+
+#define PVC_COR_ERR_MASK				(GUC_COR_ERR | SLM_COR_ERR | \
+							 EU_IC_COR_ERR | EU_GRF_COR_ERR)
+
+#define PVC_FAT_ERR_MASK				(FPU_FAT_ERR | GUC_FAT_ERR | \
+							EU_GRF_FAT_ERR | SLM_FAT_ERR)
+
+#define DEV_ERR_STAT_NONFATAL				0x100178
+#define DEV_ERR_STAT_CORRECTABLE			0x10017c
+#define DEV_ERR_STAT_REG(x)				XE_REG(_PICK_EVEN((x), \
+									  DEV_ERR_STAT_CORRECTABLE, \
+									  DEV_ERR_STAT_NONFATAL))
+
+#define   XE_CSC_ERROR					17
+#define   XE_GT_ERROR					0
+
+#define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
+#define ERR_STAT_GT_FATAL_VECTOR_1			0x100264
+
+#define ERR_STAT_GT_FATAL_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
+								  ERR_STAT_GT_FATAL_VECTOR_0, \
+								  ERR_STAT_GT_FATAL_VECTOR_1))
+
+#define ERR_STAT_GT_COR_VECTOR_0			0x1002a0
+#define ERR_STAT_GT_COR_VECTOR_1			0x1002a4
+
+#define ERR_STAT_GT_COR_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
+									  ERR_STAT_GT_COR_VECTOR_0, \
+									  ERR_STAT_GT_COR_VECTOR_1))
+
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)		(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+							ERR_STAT_GT_COR_VECTOR_REG(x) : \
+							ERR_STAT_GT_FATAL_VECTOR_REG(x))
 
-#define DEV_ERR_STAT_NONFATAL			0x100178
-#define DEV_ERR_STAT_CORRECTABLE		0x10017c
-#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
-								  DEV_ERR_STAT_CORRECTABLE, \
-								  DEV_ERR_STAT_NONFATAL))
-#define   XE_CSC_ERROR				BIT(17)
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 2019aaaa1ebe..ff31fb322c8a 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,7 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <linux/bitmap.h>
 #include <linux/fault-inject.h>
 
 #include "regs/xe_gsc_regs.h"
@@ -15,7 +16,13 @@
 #include "xe_mmio.h"
 #include "xe_survivability_mode.h"
 
-#define  HEC_UNCORR_FW_ERR_BITS 4
+#define  GT_HW_ERROR_MAX_ERR_BITS	16
+#define  HEC_UNCORR_FW_ERR_BITS		4
+#define  XE_RAS_REG_SIZE		32
+
+#define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
+	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
+	(BIT(err_bit) & PVC_FAT_ERR_MASK))
 
 extern struct fault_attr inject_csc_hw_error;
 
@@ -28,10 +35,21 @@ static const char * const hec_uncorrected_fw_errors[] = {
 	"Data Corruption"
 };
 
-static bool fault_inject_csc_hw_error(void)
-{
-	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
-}
+static const unsigned long xe_hw_error_map[] = {
+	[XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
+};
+
+enum gt_vector_regs {
+	ERR_STAT_GT_VECTOR0 = 0,
+	ERR_STAT_GT_VECTOR1,
+	ERR_STAT_GT_VECTOR2,
+	ERR_STAT_GT_VECTOR3,
+	ERR_STAT_GT_VECTOR4,
+	ERR_STAT_GT_VECTOR5,
+	ERR_STAT_GT_VECTOR6,
+	ERR_STAT_GT_VECTOR7,
+	ERR_STAT_GT_VECTOR_MAX
+};
 
 static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_err)
 {
@@ -42,6 +60,11 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
 	return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
 }
 
+static bool fault_inject_csc_hw_error(void)
+{
+	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
+}
+
 static void csc_hw_error_work(struct work_struct *work)
 {
 	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
@@ -89,15 +112,126 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
 }
 
+static void log_hw_error(struct xe_tile *tile, const char *name,
+			 const enum drm_xe_ras_error_severity severity)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+
+	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+		drm_warn(&xe->drm, "%s %s detected\n", name, severity_str);
+	else
+		drm_err_ratelimited(&xe->drm, "%s %s detected\n", name, severity_str);
+}
+
+static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
+		       const enum drm_xe_ras_error_severity severity)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+
+	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+		drm_warn(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+			 name, severity_str, i, err);
+	else
+		drm_err_ratelimited(&xe->drm, "%s %s detected, ERROR_STAT_GT_VECTOR%d:0x%08x\n",
+				    name, severity_str, i, err);
+}
+
+static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
+				u32 error_id)
+{
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	struct xe_mmio *mmio = &tile->mmio;
+	unsigned long err_stat = 0;
+	int i, len;
+
+	if (xe->info.platform != XE_PVC)
+		return;
+
+	if (!info)
+		return;
+
+	if (hw_err == HARDWARE_ERROR_NONFATAL) {
+		atomic_inc(&info[error_id].counter);
+		log_hw_error(tile, info[error_id].name, severity);
+		return;
+	}
+
+	/* Registers till ERR_STAT_GT_VECTOR4 are applicable for correctable errors */
+	len = (hw_err == HARDWARE_ERROR_CORRECTABLE) ? ERR_STAT_GT_VECTOR4
+						     : ERR_STAT_GT_VECTOR_MAX;
+
+	for (i = 0; i < len; i++) {
+		u32 vector, val;
+
+		vector = xe_mmio_read32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i));
+		if (!vector)
+			continue;
+
+		switch (i) {
+		case ERR_STAT_GT_VECTOR0:
+		case ERR_STAT_GT_VECTOR1: {
+			u32 errbit;
+
+			val = hweight32(vector);
+			atomic_add(val, &info[error_id].counter);
+			log_gt_err(tile, "Subslice", i, vector, severity);
+
+			/*
+			 * Error status register is only populated once per error.
+			 * Read the register and clear once.
+			 */
+			if (err_stat)
+				break;
+
+			err_stat = xe_mmio_read32(mmio, ERR_STAT_GT_REG(hw_err));
+			for_each_set_bit(errbit, &err_stat, GT_HW_ERROR_MAX_ERR_BITS) {
+				if (PVC_ERROR_MASK_SET(hw_err, errbit))
+					atomic_inc(&info[error_id].counter);
+			}
+			if (err_stat)
+				xe_mmio_write32(mmio, ERR_STAT_GT_REG(hw_err), err_stat);
+			break;
+		}
+		case ERR_STAT_GT_VECTOR2:
+		case ERR_STAT_GT_VECTOR3:
+			val = hweight32(vector);
+			atomic_add(val, &info[error_id].counter);
+			log_gt_err(tile, "L3 BANK", i, vector, severity);
+			break;
+		case ERR_STAT_GT_VECTOR6:
+			val = hweight32(vector);
+			atomic_add(val, &info[error_id].counter);
+			log_gt_err(tile, "TLB", i, vector, severity);
+			break;
+		case ERR_STAT_GT_VECTOR7:
+			val = hweight32(vector);
+			atomic_add(val, &info[error_id].counter);
+			log_gt_err(tile, "L3 Fabric", i, vector, severity);
+			break;
+		default:
+			log_gt_err(tile, "Undefined", i, vector, severity);
+		}
+
+		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
+	}
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
 	const char *severity_str = error_severity[severity];
 	struct xe_device *xe = tile_to_xe(tile);
-	unsigned long flags;
-	u32 err_src;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	unsigned long flags, err_src;
+	u32 err_bit;
 
-	if (xe->info.platform != XE_BATTLEMAGE)
+	if (!IS_DGFX(xe))
 		return;
 
 	spin_lock_irqsave(&xe->irq.lock, flags);
@@ -108,11 +242,53 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 		goto unlock;
 	}
 
-	if (err_src & XE_CSC_ERROR)
+	/*
+	 * On encountering CSC firmware errors, the graphics device becomes unrecoverable
+	 * so return immediately on error. The only way to recover from these errors is
+	 * firmware flash. The device will enter Runtime Survivability mode when such
+	 * errors are detected.
+	 */
+	if (err_src & XE_CSC_ERROR) {
 		csc_hw_error_handler(tile, hw_err);
+		goto clear_reg;
+	}
 
-	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+	if (!info)
+		goto clear_reg;
+
+	for_each_set_bit(err_bit, &err_src, XE_RAS_REG_SIZE) {
+		const char *name;
+		u32 error_id;
+
+		/* Check error bit is within bounds */
+		if (err_bit >= ARRAY_SIZE(xe_hw_error_map))
+			break;
+
+		error_id = xe_hw_error_map[err_bit];
+
+		/* Check error component is within max */
+		if (!error_id || error_id >= DRM_XE_RAS_ERR_COMP_MAX)
+			continue;
 
+		name = info[error_id].name;
+		if (!name)
+			continue;
+
+		if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
+			drm_warn(&xe->drm, HW_ERR
+				 "TILE%d reported %s %s, bit[%d] is set\n",
+				 tile->id, name, severity_str, err_bit);
+		} else {
+			drm_err_ratelimited(&xe->drm, HW_ERR
+					    "TILE%d reported %s %s, bit[%d] is set\n",
+					    tile->id, name, severity_str, err_bit);
+		}
+		if (err_bit == XE_GT_ERROR)
+			gt_hw_error_handler(tile, hw_err, error_id);
+	}
+
+clear_reg:
+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
 unlock:
 	spin_unlock_irqrestore(&xe->irq.lock, flags);
 }
@@ -134,9 +310,10 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 	if (fault_inject_csc_hw_error())
 		schedule_work(&tile->csc_hw_error_work);
 
-	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
 		if (master_ctl & ERROR_IRQ(hw_err))
 			hw_error_source_handler(tile, hw_err);
+	}
 }
 
 static int hw_error_info_init(struct xe_device *xe)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (3 preceding siblings ...)
  2026-02-02  6:44 ` [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
@ 2026-02-02  6:44 ` Riana Tauro
  2026-02-05 18:10   ` Raag Jadav
  2026-02-02 16:15 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev5) Patchwork
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Riana Tauro @ 2026-02-02  6:44 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, Riana Tauro, Himal Prasad Ghimiray

Report the SoC nonfatal/fatal hardware error and update the counters.

$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":2}'
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Add ID's and names as uAPI (Rodrigo)

v3: reorder and align arrays
    remove redundant string err
    use REG_BIT
    fix aesthic review comments (Raag)
    use only correctable/uncorrectable error severity (Aravind)

v4: fix comments
    use master as variable name
    add static_assert (Raag)
---
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
 drivers/gpu/drm/xe/xe_hw_error.c           | 221 ++++++++++++++++++++-
 2 files changed, 244 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index 17982a335941..a89a07d067fc 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -41,6 +41,7 @@
 									  DEV_ERR_STAT_NONFATAL))
 
 #define   XE_CSC_ERROR					17
+#define   XE_SOC_ERROR					16
 #define   XE_GT_ERROR					0
 
 #define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
@@ -61,4 +62,27 @@
 							ERR_STAT_GT_COR_VECTOR_REG(x) : \
 							ERR_STAT_GT_FATAL_VECTOR_REG(x))
 
+#define SOC_PVC_MASTER_BASE				0x282000
+#define SOC_PVC_SLAVE_BASE				0x283000
+
+#define SOC_GCOERRSTS					0x200
+#define SOC_GNFERRSTS					0x210
+#define SOC_GLOBAL_ERR_STAT_REG(base, x)		XE_REG(_PICK_EVEN((x), \
+									  (base) + SOC_GCOERRSTS, \
+									  (base) + SOC_GNFERRSTS))
+#define   SOC_SLAVE_IEH					REG_BIT(1)
+#define   SOC_IEH0_LOCAL_ERR_STATUS			REG_BIT(0)
+#define   SOC_IEH1_LOCAL_ERR_STATUS			REG_BIT(0)
+
+#define SOC_GSYSEVTCTL					0x264
+#define SOC_GSYSEVTCTL_REG(master, slave, x)		XE_REG(_PICK_EVEN((x), \
+									  (master) + SOC_GSYSEVTCTL, \
+									  (slave) + SOC_GSYSEVTCTL))
+
+#define SOC_LERRUNCSTS					0x280
+#define SOC_LERRCORSTS					0x294
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)		XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+							       (base) + SOC_LERRCORSTS : \
+							       (base) + SOC_LERRUNCSTS)
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index ff31fb322c8a..159ec796386a 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -19,6 +19,7 @@
 #define  GT_HW_ERROR_MAX_ERR_BITS	16
 #define  HEC_UNCORR_FW_ERR_BITS		4
 #define  XE_RAS_REG_SIZE		32
+#define  XE_SOC_NUM_IEH			2
 
 #define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
 	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
@@ -36,7 +37,8 @@ static const char * const hec_uncorrected_fw_errors[] = {
 };
 
 static const unsigned long xe_hw_error_map[] = {
-	[XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
+	[XE_GT_ERROR]	= DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
+	[XE_SOC_ERROR]	= DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
 };
 
 enum gt_vector_regs {
@@ -60,6 +62,102 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
 	return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
 }
 
+static const char * const pvc_master_global_err_reg[] = {
+	[0 ... 1]	= "Undefined",
+	[2]		= "HBM SS0: Channel0",
+	[3]		= "HBM SS0: Channel1",
+	[4]		= "HBM SS0: Channel2",
+	[5]		= "HBM SS0: Channel3",
+	[6]		= "HBM SS0: Channel4",
+	[7]		= "HBM SS0: Channel5",
+	[8]		= "HBM SS0: Channel6",
+	[9]		= "HBM SS0: Channel7",
+	[10]		= "HBM SS1: Channel0",
+	[11]		= "HBM SS1: Channel1",
+	[12]		= "HBM SS1: Channel2",
+	[13]		= "HBM SS1: Channel3",
+	[14]		= "HBM SS1: Channel4",
+	[15]		= "HBM SS1: Channel5",
+	[16]		= "HBM SS1: Channel6",
+	[17]		= "HBM SS1: Channel7",
+	[18 ... 31]	= "Undefined",
+};
+
+static_assert(ARRAY_SIZE(pvc_master_global_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_slave_global_err_reg[] = {
+	[0]		= "Undefined",
+	[1]		= "HBM SS2: Channel0",
+	[2]		= "HBM SS2: Channel1",
+	[3]		= "HBM SS2: Channel2",
+	[4]		= "HBM SS2: Channel3",
+	[5]		= "HBM SS2: Channel4",
+	[6]		= "HBM SS2: Channel5",
+	[7]		= "HBM SS2: Channel6",
+	[8]		= "HBM SS2: Channel7",
+	[9]		= "HBM SS3: Channel0",
+	[10]		= "HBM SS3: Channel1",
+	[11]		= "HBM SS3: Channel2",
+	[12]		= "HBM SS3: Channel3",
+	[13]		= "HBM SS3: Channel4",
+	[14]		= "HBM SS3: Channel5",
+	[15]		= "HBM SS3: Channel6",
+	[16]		= "HBM SS3: Channel7",
+	[17]		= "Undefined",
+	[18]		= "ANR MDFI",
+	[19 ... 31]	= "Undefined",
+};
+
+static_assert(ARRAY_SIZE(pvc_slave_global_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_slave_local_fatal_err_reg[] = {
+	[0]		= "Local IEH: Malformed PCIe AER",
+	[1]		= "Local IEH: Malformed PCIe ERR",
+	[2]		= "Local IEH: UR conditions in IEH",
+	[3]		= "Local IEH: From SERR Sources",
+	[4 ... 19]	= "Undefined",
+	[20]		= "Malformed MCA error packet (HBM/Punit)",
+	[21 ... 31]	= "Undefined",
+};
+
+static_assert(ARRAY_SIZE(pvc_slave_local_fatal_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_master_local_fatal_err_reg[] = {
+	[0]		= "Local IEH: Malformed IOSF PCIe AER",
+	[1]		= "Local IEH: Malformed IOSF PCIe ERR",
+	[2]		= "Local IEH: UR RESPONSE",
+	[3]		= "Local IEH: From SERR SPI controller",
+	[4]		= "Base Die MDFI T2T",
+	[5]		= "Undefined",
+	[6]		= "Base Die MDFI T2C",
+	[7]		= "Undefined",
+	[8]		= "Invalid CSC PSF Command Parity",
+	[9]		= "Invalid CSC PSF Unexpected Completion",
+	[10]		= "Invalid CSC PSF Unsupported Request",
+	[11]		= "Invalid PCIe PSF Command Parity",
+	[12]		= "PCIe PSF Unexpected Completion",
+	[13]		= "PCIe PSF Unsupported Request",
+	[14 ... 19]	= "Undefined",
+	[20]		= "Malformed MCA error packet (HBM/Punit)",
+	[21 ... 31]	= "Undefined",
+};
+
+static_assert(ARRAY_SIZE(pvc_master_local_fatal_err_reg) == XE_RAS_REG_SIZE);
+
+static const char * const pvc_master_local_nonfatal_err_reg[] = {
+	[0 ... 3]	= "Undefined",
+	[4]		= "Base Die MDFI T2T",
+	[5]		= "Undefined",
+	[6]		= "Base Die MDFI T2C",
+	[7]		= "Undefined",
+	[8]		= "Invalid CSC PSF Command Parity",
+	[9]		= "Invalid CSC PSF Unexpected Completion",
+	[10]		= "Invalid PCIe PSF Command Parity",
+	[11 ... 31]	= "Undefined",
+};
+
+static_assert(ARRAY_SIZE(pvc_master_local_nonfatal_err_reg) == XE_RAS_REG_SIZE);
+
 static bool fault_inject_csc_hw_error(void)
 {
 	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
@@ -138,6 +236,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
 				    name, severity_str, i, err);
 }
 
+static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
+			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
+{
+	const char *severity_str = error_severity[severity];
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[severity];
+	const char *name;
+
+	name = reg_info[err_bit];
+
+	if (strcmp(name, "Undefined")) {
+		if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+			drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
+		else
+			drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
+		atomic_inc(&info[index].counter);
+	}
+}
+
 static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
 				u32 error_id)
 {
@@ -221,6 +339,104 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	}
 }
 
+static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
+{
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	unsigned long slave_global_errstat, slave_local_errstat;
+	struct xe_mmio *mmio = &tile->mmio;
+	u32 regbit, slave_base;
+
+	slave_base = SOC_PVC_SLAVE_BASE;
+	slave_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err));
+
+	if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
+		slave_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err));
+
+		if (hw_err == HARDWARE_ERROR_FATAL) {
+			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
+				log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
+					      regbit, error_id);
+		}
+
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
+				slave_local_errstat);
+	}
+
+	for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
+		log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
+
+	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err), slave_global_errstat);
+}
+
+static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
+				 u32 error_id)
+{
+	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_mmio *mmio = &tile->mmio;
+	unsigned long master_global_errstat, master_local_errstat;
+	u32 master_base, slave_base, regbit;
+	int i;
+
+	if (xe->info.platform != XE_PVC)
+		return;
+
+	master_base = SOC_PVC_MASTER_BASE;
+	slave_base = SOC_PVC_SLAVE_BASE;
+
+	/* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
+	for (i = 0; i < XE_SOC_NUM_IEH; i++)
+		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
+				~REG_BIT(hw_err));
+
+	if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
+				REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
+				REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
+				REG_GENMASK(31, 0));
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
+				REG_GENMASK(31, 0));
+		goto unmask_gsysevtctl;
+	}
+
+	/*
+	 * Read the master global IEH error register if BIT(1) is set then process
+	 * the slave IEH first. If BIT(0) in global error register is set then process
+	 * the corresponding local error registers.
+	 */
+	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err));
+	if (master_global_errstat & SOC_SLAVE_IEH)
+		soc_slave_ieh_handler(tile, hw_err, error_id);
+
+	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
+		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err));
+
+		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
+			const char * const *reg_info = (hw_err == HARDWARE_ERROR_FATAL) ?
+						       pvc_master_local_fatal_err_reg :
+						       pvc_master_local_nonfatal_err_reg;
+
+			log_soc_error(tile, reg_info, severity, regbit, error_id);
+		}
+
+		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
+				master_local_errstat);
+	}
+
+	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
+		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
+
+	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
+			master_global_errstat);
+
+unmask_gsysevtctl:
+	for (i = 0; i < XE_SOC_NUM_IEH; i++)
+		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
+				(HARDWARE_ERROR_MAX << 1) + 1);
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
@@ -283,8 +499,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 					    "TILE%d reported %s %s, bit[%d] is set\n",
 					    tile->id, name, severity_str, err_bit);
 		}
+
 		if (err_bit == XE_GT_ERROR)
 			gt_hw_error_handler(tile, hw_err, error_id);
+		if (err_bit == XE_SOC_ERROR)
+			soc_hw_error_handler(tile, hw_err, error_id);
 	}
 
 clear_reg:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-02-02  6:43 ` [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
@ 2026-02-02 10:08   ` kernel test robot
  2026-02-02 22:52   ` kernel test robot
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2026-02-02 10:08 UTC (permalink / raw)
  To: Riana Tauro, intel-xe, dri-devel
  Cc: oe-kbuild-all, aravind.iddamsetty, anshuman.gupta, rodrigo.vivi,
	joonas.lahtinen, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
	ravi.kishore.koppuravuri, raag.jadav, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, Jakub Kicinski, Paolo Abeni, Eric Dumazet, netdev,
	Riana Tauro

Hi Riana,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-xe/drm-xe-next]
[also build test WARNING on drm-misc/drm-misc-next drm/drm-next linus/master v6.16-rc1 next-20260130]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Riana-Tauro/drm-ras-Introduce-the-DRM-RAS-infrastructure-over-generic-netlink/20260202-141553
base:   https://gitlab.freedesktop.org/drm/xe/kernel.git drm-xe-next
patch link:    https://lore.kernel.org/r/20260202064356.286243-8-riana.tauro%40intel.com
patch subject: [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
reproduce: (https://download.01.org/0day-ci/archive/20260202/202602021142.G5vT8UkJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602021142.G5vT8UkJ-lkp@intel.com/

All warnings (new ones prefixed by >>):

   ERROR: Cannot find file ./include/linux/hdmi.h
   ERROR: Cannot find file ./include/linux/hdmi.h
   WARNING: No kernel-doc for file ./include/linux/hdmi.h
   WARNING: ./drivers/gpu/drm/scheduler/sched_main.c:367 function parameter 'result' not described in 'drm_sched_job_done'
   Documentation/gpu/drm-ras:39: ./drivers/gpu/drm/drm_ras.c:40: ERROR: Unexpected indentation. [docutils]
>> Documentation/gpu/drm-ras:39: ./drivers/gpu/drm/drm_ras.c:41: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
   Documentation/gpu/drm-ras:39: ./drivers/gpu/drm/drm_ras.c:46: ERROR: Unexpected indentation. [docutils]
   Documentation/gpu/drm-ras:39: ./drivers/gpu/drm/drm_ras.c:59: ERROR: Unexpected indentation. [docutils]
   Documentation/gpu/drm-ras:39: ./drivers/gpu/drm/drm_ras.c:60: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
   Documentation/gpu/drm-uapi:607: ./drivers/gpu/drm/drm_ioctl.c:923: WARNING: Duplicate C declaration, also defined at gpu/drm-uapi:69.
   Declaration is '.. c:function:: bool drm_ioctl_flags (unsigned int nr, unsigned int *flags)'. [duplicate_declaration.c]
--
   ERROR: Cannot find file ./include/linux/mutex.h
   ERROR: Cannot find file ./include/linux/mutex.h
   WARNING: No kernel-doc for file ./include/linux/mutex.h
   ERROR: Cannot find file ./include/linux/fwctl.h
   WARNING: No kernel-doc for file ./include/linux/fwctl.h
>> Documentation/gpu/drm-ras.rst:59: WARNING: undefined label: 'documentation/netlink/specs/drm_ras.yaml' [ref.ref]

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev5)
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (4 preceding siblings ...)
  2026-02-02  6:44 ` [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
@ 2026-02-02 16:15 ` Patchwork
  2026-02-02 16:16 ` ✓ CI.KUnit: success " Patchwork
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-02-02 16:15 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev5)
URL   : https://patchwork.freedesktop.org/series/155188/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
1f57ba1afceae32108bd24770069f764d940a0e4
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit fcca046b31c2fcbb66c93b92a4dccfcfbb7ce6c2
Author: Riana Tauro <riana.tauro@intel.com>
Date:   Mon Feb 2 12:14:01 2026 +0530

    drm/xe/xe_hw_error: Add support for PVC SoC errors
    
    Report the SoC nonfatal/fatal hardware error and update the counters.
    
    $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":2}'
    {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}
    
    Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
    Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
    Signed-off-by: Riana Tauro <riana.tauro@intel.com>
+ /mt/dim checkpatch 4af5aac5c903c12b53b9cab756c49496f595a27d drm-intel
47e5280eac8e drm/ras: Introduce the DRM RAS infrastructure over generic netlink
-:58: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#58: 
new file mode 100644

-:806: WARNING:LONG_LINE: line length of 114 exceeds 100 columns
#806: FILE: drivers/gpu/drm/drm_ras_nl.c:13:
+static const struct nla_policy drm_ras_get_error_counters_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID + 1] = {

-:811: WARNING:LONG_LINE: line length of 116 exceeds 100 columns
#811: FILE: drivers/gpu/drm/drm_ras_nl.c:18:
+static const struct nla_policy drm_ras_query_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {

total: 0 errors, 3 warnings, 0 checks, 905 lines checked
3f3ad751454b drm/xe/xe_drm_ras: Add support for XE DRM RAS
-:27: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#27: 
$ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'

-:73: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#73: 
new file mode 100644

-:277: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'i' - possible side-effects?
#277: FILE: drivers/gpu/drm/xe/xe_drm_ras.h:10:
+#define for_each_error_severity(i)	\
+	for (i = 0; i < DRM_XE_RAS_ERR_SEV_MAX; i++)

total: 0 errors, 2 warnings, 1 checks, 347 lines checked
eb45f7caee60 drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
ad91cbc89f23 drm/xe/xe_hw_error: Add support for Core-Compute errors
-:66: WARNING:LONG_LINE: line length of 101 exceeds 100 columns
#66: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:40:
+									  DEV_ERR_STAT_CORRECTABLE, \

-:83: WARNING:LONG_LINE: line length of 101 exceeds 100 columns
#83: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:57:
+									  ERR_STAT_GT_COR_VECTOR_0, \

-:86: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'hw_err' may be better as '(hw_err)' to avoid precedence issues
#86: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:60:
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)		(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+							ERR_STAT_GT_COR_VECTOR_REG(x) : \
+							ERR_STAT_GT_FATAL_VECTOR_REG(x))

-:86: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'x' - possible side-effects?
#86: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:60:
+#define ERR_STAT_GT_VECTOR_REG(hw_err, x)		(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+							ERR_STAT_GT_COR_VECTOR_REG(x) : \
+							ERR_STAT_GT_FATAL_VECTOR_REG(x))

-:118: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'hw_err' may be better as '(hw_err)' to avoid precedence issues
#118: FILE: drivers/gpu/drm/xe/xe_hw_error.c:23:
+#define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
+	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
+	(BIT(err_bit) & PVC_FAT_ERR_MASK))

-:118: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'err_bit' - possible side-effects?
#118: FILE: drivers/gpu/drm/xe/xe_hw_error.c:23:
+#define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
+	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
+	(BIT(err_bit) & PVC_FAT_ERR_MASK))

total: 0 errors, 2 warnings, 4 checks, 320 lines checked
fcca046b31c2 drm/xe/xe_hw_error: Add support for PVC SoC errors
-:8: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#8: 
$ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":2}'

-:36: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'base' - possible side-effects?
#36: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:70:
+#define SOC_GLOBAL_ERR_STAT_REG(base, x)		XE_REG(_PICK_EVEN((x), \
+									  (base) + SOC_GCOERRSTS, \
+									  (base) + SOC_GNFERRSTS))

-:45: WARNING:LONG_LINE: line length of 102 exceeds 100 columns
#45: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:79:
+									  (master) + SOC_GSYSEVTCTL, \

-:50: WARNING:LONG_LINE: line length of 103 exceeds 100 columns
#50: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:84:
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)		XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \

-:50: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'base' - possible side-effects?
#50: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:84:
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)		XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+							       (base) + SOC_LERRCORSTS : \
+							       (base) + SOC_LERRUNCSTS)

-:50: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'hw_err' may be better as '(hw_err)' to avoid precedence issues
#50: FILE: drivers/gpu/drm/xe/regs/xe_hw_error_regs.h:84:
+#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)		XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
+							       (base) + SOC_LERRCORSTS : \
+							       (base) + SOC_LERRUNCSTS)

-:211: WARNING:LONG_LINE: line length of 103 exceeds 100 columns
#211: FILE: drivers/gpu/drm/xe/xe_hw_error.c:342:
+static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)

-:222: WARNING:LONG_LINE: line length of 103 exceeds 100 columns
#222: FILE: drivers/gpu/drm/xe/xe_hw_error.c:353:
+		slave_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err));

-:283: WARNING:LONG_LINE: line length of 105 exceeds 100 columns
#283: FILE: drivers/gpu/drm/xe/xe_hw_error.c:414:
+		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err));

total: 0 errors, 6 warnings, 3 checks, 293 lines checked



^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✓ CI.KUnit: success for Introduce DRM_RAS using generic netlink for RAS (rev5)
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (5 preceding siblings ...)
  2026-02-02 16:15 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev5) Patchwork
@ 2026-02-02 16:16 ` Patchwork
  2026-02-02 16:31 ` ✗ CI.checksparse: warning " Patchwork
  2026-02-02 16:51 ` ✓ Xe.CI.BAT: success " Patchwork
  8 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-02-02 16:16 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev5)
URL   : https://patchwork.freedesktop.org/series/155188/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[16:15:17] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[16:15:22] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[16:15:54] Starting KUnit Kernel (1/1)...
[16:15:54] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[16:15:54] ================== guc_buf (11 subtests) ===================
[16:15:54] [PASSED] test_smallest
[16:15:54] [PASSED] test_largest
[16:15:54] [PASSED] test_granular
[16:15:54] [PASSED] test_unique
[16:15:54] [PASSED] test_overlap
[16:15:54] [PASSED] test_reusable
[16:15:54] [PASSED] test_too_big
[16:15:54] [PASSED] test_flush
[16:15:54] [PASSED] test_lookup
[16:15:54] [PASSED] test_data
[16:15:54] [PASSED] test_class
[16:15:54] ===================== [PASSED] guc_buf =====================
[16:15:54] =================== guc_dbm (7 subtests) ===================
[16:15:54] [PASSED] test_empty
[16:15:54] [PASSED] test_default
[16:15:54] ======================== test_size  ========================
[16:15:54] [PASSED] 4
[16:15:54] [PASSED] 8
[16:15:54] [PASSED] 32
[16:15:54] [PASSED] 256
[16:15:54] ==================== [PASSED] test_size ====================
[16:15:54] ======================= test_reuse  ========================
[16:15:54] [PASSED] 4
[16:15:54] [PASSED] 8
[16:15:54] [PASSED] 32
[16:15:54] [PASSED] 256
[16:15:54] =================== [PASSED] test_reuse ====================
[16:15:54] =================== test_range_overlap  ====================
[16:15:54] [PASSED] 4
[16:15:54] [PASSED] 8
[16:15:54] [PASSED] 32
[16:15:54] [PASSED] 256
[16:15:54] =============== [PASSED] test_range_overlap ================
[16:15:54] =================== test_range_compact  ====================
[16:15:54] [PASSED] 4
[16:15:54] [PASSED] 8
[16:15:54] [PASSED] 32
[16:15:54] [PASSED] 256
[16:15:54] =============== [PASSED] test_range_compact ================
[16:15:54] ==================== test_range_spare  =====================
[16:15:54] [PASSED] 4
[16:15:54] [PASSED] 8
[16:15:54] [PASSED] 32
[16:15:54] [PASSED] 256
[16:15:54] ================ [PASSED] test_range_spare =================
[16:15:54] ===================== [PASSED] guc_dbm =====================
[16:15:54] =================== guc_idm (6 subtests) ===================
[16:15:54] [PASSED] bad_init
[16:15:54] [PASSED] no_init
[16:15:54] [PASSED] init_fini
[16:15:54] [PASSED] check_used
[16:15:54] [PASSED] check_quota
[16:15:54] [PASSED] check_all
[16:15:54] ===================== [PASSED] guc_idm =====================
[16:15:54] ================== no_relay (3 subtests) ===================
[16:15:54] [PASSED] xe_drops_guc2pf_if_not_ready
[16:15:54] [PASSED] xe_drops_guc2vf_if_not_ready
[16:15:54] [PASSED] xe_rejects_send_if_not_ready
[16:15:54] ==================== [PASSED] no_relay =====================
[16:15:54] ================== pf_relay (14 subtests) ==================
[16:15:54] [PASSED] pf_rejects_guc2pf_too_short
[16:15:54] [PASSED] pf_rejects_guc2pf_too_long
[16:15:54] [PASSED] pf_rejects_guc2pf_no_payload
[16:15:54] [PASSED] pf_fails_no_payload
[16:15:54] [PASSED] pf_fails_bad_origin
[16:15:54] [PASSED] pf_fails_bad_type
[16:15:54] [PASSED] pf_txn_reports_error
[16:15:54] [PASSED] pf_txn_sends_pf2guc
[16:15:54] [PASSED] pf_sends_pf2guc
[16:15:54] [SKIPPED] pf_loopback_nop
[16:15:54] [SKIPPED] pf_loopback_echo
[16:15:54] [SKIPPED] pf_loopback_fail
[16:15:54] [SKIPPED] pf_loopback_busy
[16:15:54] [SKIPPED] pf_loopback_retry
[16:15:54] ==================== [PASSED] pf_relay =====================
[16:15:54] ================== vf_relay (3 subtests) ===================
[16:15:54] [PASSED] vf_rejects_guc2vf_too_short
[16:15:54] [PASSED] vf_rejects_guc2vf_too_long
[16:15:54] [PASSED] vf_rejects_guc2vf_no_payload
[16:15:54] ==================== [PASSED] vf_relay =====================
[16:15:54] ================ pf_gt_config (6 subtests) =================
[16:15:54] [PASSED] fair_contexts_1vf
[16:15:54] [PASSED] fair_doorbells_1vf
[16:15:54] [PASSED] fair_ggtt_1vf
[16:15:54] ====================== fair_contexts  ======================
[16:15:54] [PASSED] 1 VF
[16:15:54] [PASSED] 2 VFs
[16:15:54] [PASSED] 3 VFs
[16:15:54] [PASSED] 4 VFs
[16:15:54] [PASSED] 5 VFs
[16:15:54] [PASSED] 6 VFs
[16:15:54] [PASSED] 7 VFs
[16:15:54] [PASSED] 8 VFs
[16:15:54] [PASSED] 9 VFs
[16:15:54] [PASSED] 10 VFs
[16:15:54] [PASSED] 11 VFs
[16:15:54] [PASSED] 12 VFs
[16:15:54] [PASSED] 13 VFs
[16:15:54] [PASSED] 14 VFs
[16:15:54] [PASSED] 15 VFs
[16:15:54] [PASSED] 16 VFs
[16:15:54] [PASSED] 17 VFs
[16:15:54] [PASSED] 18 VFs
[16:15:54] [PASSED] 19 VFs
[16:15:54] [PASSED] 20 VFs
[16:15:54] [PASSED] 21 VFs
[16:15:54] [PASSED] 22 VFs
[16:15:54] [PASSED] 23 VFs
[16:15:54] [PASSED] 24 VFs
[16:15:54] [PASSED] 25 VFs
[16:15:54] [PASSED] 26 VFs
[16:15:54] [PASSED] 27 VFs
[16:15:54] [PASSED] 28 VFs
[16:15:54] [PASSED] 29 VFs
[16:15:54] [PASSED] 30 VFs
[16:15:54] [PASSED] 31 VFs
[16:15:54] [PASSED] 32 VFs
[16:15:54] [PASSED] 33 VFs
[16:15:54] [PASSED] 34 VFs
[16:15:54] [PASSED] 35 VFs
[16:15:54] [PASSED] 36 VFs
[16:15:54] [PASSED] 37 VFs
[16:15:54] [PASSED] 38 VFs
[16:15:54] [PASSED] 39 VFs
[16:15:54] [PASSED] 40 VFs
[16:15:54] [PASSED] 41 VFs
[16:15:54] [PASSED] 42 VFs
[16:15:54] [PASSED] 43 VFs
[16:15:54] [PASSED] 44 VFs
[16:15:54] [PASSED] 45 VFs
[16:15:54] [PASSED] 46 VFs
[16:15:54] [PASSED] 47 VFs
[16:15:54] [PASSED] 48 VFs
[16:15:54] [PASSED] 49 VFs
[16:15:54] [PASSED] 50 VFs
[16:15:54] [PASSED] 51 VFs
[16:15:54] [PASSED] 52 VFs
[16:15:54] [PASSED] 53 VFs
[16:15:54] [PASSED] 54 VFs
[16:15:54] [PASSED] 55 VFs
[16:15:54] [PASSED] 56 VFs
[16:15:54] [PASSED] 57 VFs
[16:15:54] [PASSED] 58 VFs
[16:15:54] [PASSED] 59 VFs
[16:15:54] [PASSED] 60 VFs
[16:15:54] [PASSED] 61 VFs
[16:15:54] [PASSED] 62 VFs
[16:15:54] [PASSED] 63 VFs
[16:15:54] ================== [PASSED] fair_contexts ==================
[16:15:54] ===================== fair_doorbells  ======================
[16:15:54] [PASSED] 1 VF
[16:15:54] [PASSED] 2 VFs
[16:15:54] [PASSED] 3 VFs
[16:15:54] [PASSED] 4 VFs
[16:15:54] [PASSED] 5 VFs
[16:15:54] [PASSED] 6 VFs
[16:15:54] [PASSED] 7 VFs
[16:15:54] [PASSED] 8 VFs
[16:15:54] [PASSED] 9 VFs
[16:15:54] [PASSED] 10 VFs
[16:15:54] [PASSED] 11 VFs
[16:15:54] [PASSED] 12 VFs
[16:15:54] [PASSED] 13 VFs
[16:15:54] [PASSED] 14 VFs
[16:15:54] [PASSED] 15 VFs
[16:15:54] [PASSED] 16 VFs
[16:15:54] [PASSED] 17 VFs
[16:15:54] [PASSED] 18 VFs
[16:15:54] [PASSED] 19 VFs
[16:15:54] [PASSED] 20 VFs
[16:15:54] [PASSED] 21 VFs
[16:15:54] [PASSED] 22 VFs
[16:15:54] [PASSED] 23 VFs
[16:15:54] [PASSED] 24 VFs
[16:15:54] [PASSED] 25 VFs
[16:15:54] [PASSED] 26 VFs
[16:15:54] [PASSED] 27 VFs
[16:15:54] [PASSED] 28 VFs
[16:15:54] [PASSED] 29 VFs
[16:15:54] [PASSED] 30 VFs
[16:15:54] [PASSED] 31 VFs
[16:15:54] [PASSED] 32 VFs
[16:15:54] [PASSED] 33 VFs
[16:15:54] [PASSED] 34 VFs
[16:15:54] [PASSED] 35 VFs
[16:15:54] [PASSED] 36 VFs
[16:15:54] [PASSED] 37 VFs
[16:15:54] [PASSED] 38 VFs
[16:15:54] [PASSED] 39 VFs
[16:15:54] [PASSED] 40 VFs
[16:15:54] [PASSED] 41 VFs
[16:15:54] [PASSED] 42 VFs
[16:15:54] [PASSED] 43 VFs
[16:15:54] [PASSED] 44 VFs
[16:15:54] [PASSED] 45 VFs
[16:15:54] [PASSED] 46 VFs
[16:15:54] [PASSED] 47 VFs
[16:15:54] [PASSED] 48 VFs
[16:15:54] [PASSED] 49 VFs
[16:15:54] [PASSED] 50 VFs
[16:15:54] [PASSED] 51 VFs
[16:15:54] [PASSED] 52 VFs
[16:15:54] [PASSED] 53 VFs
[16:15:54] [PASSED] 54 VFs
[16:15:54] [PASSED] 55 VFs
[16:15:54] [PASSED] 56 VFs
[16:15:54] [PASSED] 57 VFs
[16:15:54] [PASSED] 58 VFs
[16:15:54] [PASSED] 59 VFs
[16:15:54] [PASSED] 60 VFs
[16:15:54] [PASSED] 61 VFs
[16:15:54] [PASSED] 62 VFs
[16:15:54] [PASSED] 63 VFs
[16:15:54] ================= [PASSED] fair_doorbells ==================
[16:15:54] ======================== fair_ggtt  ========================
[16:15:54] [PASSED] 1 VF
[16:15:54] [PASSED] 2 VFs
[16:15:54] [PASSED] 3 VFs
[16:15:54] [PASSED] 4 VFs
[16:15:54] [PASSED] 5 VFs
[16:15:54] [PASSED] 6 VFs
[16:15:54] [PASSED] 7 VFs
[16:15:54] [PASSED] 8 VFs
[16:15:54] [PASSED] 9 VFs
[16:15:54] [PASSED] 10 VFs
[16:15:54] [PASSED] 11 VFs
[16:15:54] [PASSED] 12 VFs
[16:15:54] [PASSED] 13 VFs
[16:15:54] [PASSED] 14 VFs
[16:15:54] [PASSED] 15 VFs
[16:15:54] [PASSED] 16 VFs
[16:15:54] [PASSED] 17 VFs
[16:15:54] [PASSED] 18 VFs
[16:15:54] [PASSED] 19 VFs
[16:15:54] [PASSED] 20 VFs
[16:15:54] [PASSED] 21 VFs
[16:15:54] [PASSED] 22 VFs
[16:15:54] [PASSED] 23 VFs
[16:15:54] [PASSED] 24 VFs
[16:15:54] [PASSED] 25 VFs
[16:15:54] [PASSED] 26 VFs
[16:15:54] [PASSED] 27 VFs
[16:15:54] [PASSED] 28 VFs
[16:15:54] [PASSED] 29 VFs
[16:15:54] [PASSED] 30 VFs
[16:15:54] [PASSED] 31 VFs
[16:15:54] [PASSED] 32 VFs
[16:15:54] [PASSED] 33 VFs
[16:15:54] [PASSED] 34 VFs
[16:15:54] [PASSED] 35 VFs
[16:15:54] [PASSED] 36 VFs
[16:15:54] [PASSED] 37 VFs
[16:15:54] [PASSED] 38 VFs
[16:15:54] [PASSED] 39 VFs
[16:15:54] [PASSED] 40 VFs
[16:15:54] [PASSED] 41 VFs
[16:15:54] [PASSED] 42 VFs
[16:15:54] [PASSED] 43 VFs
[16:15:54] [PASSED] 44 VFs
[16:15:54] [PASSED] 45 VFs
[16:15:54] [PASSED] 46 VFs
[16:15:54] [PASSED] 47 VFs
[16:15:54] [PASSED] 48 VFs
[16:15:54] [PASSED] 49 VFs
[16:15:54] [PASSED] 50 VFs
[16:15:54] [PASSED] 51 VFs
[16:15:54] [PASSED] 52 VFs
[16:15:54] [PASSED] 53 VFs
[16:15:54] [PASSED] 54 VFs
[16:15:54] [PASSED] 55 VFs
[16:15:54] [PASSED] 56 VFs
[16:15:54] [PASSED] 57 VFs
[16:15:54] [PASSED] 58 VFs
[16:15:54] [PASSED] 59 VFs
[16:15:54] [PASSED] 60 VFs
[16:15:54] [PASSED] 61 VFs
[16:15:54] [PASSED] 62 VFs
[16:15:54] [PASSED] 63 VFs
[16:15:54] ==================== [PASSED] fair_ggtt ====================
[16:15:54] ================== [PASSED] pf_gt_config ===================
[16:15:54] ===================== lmtt (1 subtest) =====================
[16:15:54] ======================== test_ops  =========================
[16:15:54] [PASSED] 2-level
[16:15:54] [PASSED] multi-level
[16:15:54] ==================== [PASSED] test_ops =====================
[16:15:54] ====================== [PASSED] lmtt =======================
[16:15:54] ================= pf_service (11 subtests) =================
[16:15:54] [PASSED] pf_negotiate_any
[16:15:54] [PASSED] pf_negotiate_base_match
[16:15:54] [PASSED] pf_negotiate_base_newer
[16:15:54] [PASSED] pf_negotiate_base_next
[16:15:54] [SKIPPED] pf_negotiate_base_older
[16:15:54] [PASSED] pf_negotiate_base_prev
[16:15:54] [PASSED] pf_negotiate_latest_match
[16:15:54] [PASSED] pf_negotiate_latest_newer
[16:15:54] [PASSED] pf_negotiate_latest_next
[16:15:54] [SKIPPED] pf_negotiate_latest_older
[16:15:54] [SKIPPED] pf_negotiate_latest_prev
[16:15:54] =================== [PASSED] pf_service ====================
[16:15:54] ================= xe_guc_g2g (2 subtests) ==================
[16:15:54] ============== xe_live_guc_g2g_kunit_default  ==============
[16:15:54] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[16:15:54] ============== xe_live_guc_g2g_kunit_allmem  ===============
[16:15:54] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[16:15:54] =================== [SKIPPED] xe_guc_g2g ===================
[16:15:54] =================== xe_mocs (2 subtests) ===================
[16:15:54] ================ xe_live_mocs_kernel_kunit  ================
[16:15:54] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[16:15:54] ================ xe_live_mocs_reset_kunit  =================
[16:15:54] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[16:15:54] ==================== [SKIPPED] xe_mocs =====================
[16:15:54] ================= xe_migrate (2 subtests) ==================
[16:15:54] ================= xe_migrate_sanity_kunit  =================
[16:15:54] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[16:15:54] ================== xe_validate_ccs_kunit  ==================
[16:15:54] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[16:15:54] =================== [SKIPPED] xe_migrate ===================
[16:15:54] ================== xe_dma_buf (1 subtest) ==================
[16:15:54] ==================== xe_dma_buf_kunit  =====================
[16:15:54] ================ [SKIPPED] xe_dma_buf_kunit ================
[16:15:54] =================== [SKIPPED] xe_dma_buf ===================
[16:15:54] ================= xe_bo_shrink (1 subtest) =================
[16:15:54] =================== xe_bo_shrink_kunit  ====================
[16:15:54] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[16:15:54] ================== [SKIPPED] xe_bo_shrink ==================
[16:15:54] ==================== xe_bo (2 subtests) ====================
[16:15:54] ================== xe_ccs_migrate_kunit  ===================
[16:15:54] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[16:15:54] ==================== xe_bo_evict_kunit  ====================
[16:15:54] =============== [SKIPPED] xe_bo_evict_kunit ================
[16:15:54] ===================== [SKIPPED] xe_bo ======================
[16:15:54] ==================== args (13 subtests) ====================
[16:15:54] [PASSED] count_args_test
[16:15:54] [PASSED] call_args_example
[16:15:54] [PASSED] call_args_test
[16:15:54] [PASSED] drop_first_arg_example
[16:15:54] [PASSED] drop_first_arg_test
[16:15:54] [PASSED] first_arg_example
[16:15:54] [PASSED] first_arg_test
[16:15:54] [PASSED] last_arg_example
[16:15:54] [PASSED] last_arg_test
[16:15:54] [PASSED] pick_arg_example
[16:15:54] [PASSED] if_args_example
[16:15:54] [PASSED] if_args_test
[16:15:54] [PASSED] sep_comma_example
[16:15:54] ====================== [PASSED] args =======================
[16:15:54] =================== xe_pci (3 subtests) ====================
[16:15:54] ==================== check_graphics_ip  ====================
[16:15:54] [PASSED] 12.00 Xe_LP
[16:15:54] [PASSED] 12.10 Xe_LP+
[16:15:54] [PASSED] 12.55 Xe_HPG
[16:15:54] [PASSED] 12.60 Xe_HPC
[16:15:54] [PASSED] 12.70 Xe_LPG
[16:15:54] [PASSED] 12.71 Xe_LPG
[16:15:54] [PASSED] 12.74 Xe_LPG+
[16:15:54] [PASSED] 20.01 Xe2_HPG
[16:15:54] [PASSED] 20.02 Xe2_HPG
[16:15:54] [PASSED] 20.04 Xe2_LPG
[16:15:54] [PASSED] 30.00 Xe3_LPG
[16:15:54] [PASSED] 30.01 Xe3_LPG
[16:15:54] [PASSED] 30.03 Xe3_LPG
[16:15:54] [PASSED] 30.04 Xe3_LPG
[16:15:54] [PASSED] 30.05 Xe3_LPG
[16:15:54] [PASSED] 35.11 Xe3p_XPC
[16:15:54] ================ [PASSED] check_graphics_ip ================
[16:15:54] ===================== check_media_ip  ======================
[16:15:54] [PASSED] 12.00 Xe_M
[16:15:54] [PASSED] 12.55 Xe_HPM
[16:15:54] [PASSED] 13.00 Xe_LPM+
[16:15:54] [PASSED] 13.01 Xe2_HPM
[16:15:54] [PASSED] 20.00 Xe2_LPM
[16:15:54] [PASSED] 30.00 Xe3_LPM
[16:15:54] [PASSED] 30.02 Xe3_LPM
[16:15:54] [PASSED] 35.00 Xe3p_LPM
[16:15:54] [PASSED] 35.03 Xe3p_HPM
[16:15:54] ================= [PASSED] check_media_ip ==================
[16:15:54] =================== check_platform_desc  ===================
[16:15:54] [PASSED] 0x9A60 (TIGERLAKE)
[16:15:54] [PASSED] 0x9A68 (TIGERLAKE)
[16:15:54] [PASSED] 0x9A70 (TIGERLAKE)
[16:15:54] [PASSED] 0x9A40 (TIGERLAKE)
[16:15:54] [PASSED] 0x9A49 (TIGERLAKE)
[16:15:54] [PASSED] 0x9A59 (TIGERLAKE)
[16:15:54] [PASSED] 0x9A78 (TIGERLAKE)
[16:15:54] [PASSED] 0x9AC0 (TIGERLAKE)
[16:15:54] [PASSED] 0x9AC9 (TIGERLAKE)
[16:15:54] [PASSED] 0x9AD9 (TIGERLAKE)
[16:15:54] [PASSED] 0x9AF8 (TIGERLAKE)
[16:15:54] [PASSED] 0x4C80 (ROCKETLAKE)
[16:15:54] [PASSED] 0x4C8A (ROCKETLAKE)
[16:15:54] [PASSED] 0x4C8B (ROCKETLAKE)
[16:15:54] [PASSED] 0x4C8C (ROCKETLAKE)
[16:15:54] [PASSED] 0x4C90 (ROCKETLAKE)
[16:15:54] [PASSED] 0x4C9A (ROCKETLAKE)
[16:15:54] [PASSED] 0x4680 (ALDERLAKE_S)
[16:15:54] [PASSED] 0x4682 (ALDERLAKE_S)
[16:15:54] [PASSED] 0x4688 (ALDERLAKE_S)
[16:15:54] [PASSED] 0x468A (ALDERLAKE_S)
[16:15:54] [PASSED] 0x468B (ALDERLAKE_S)
[16:15:54] [PASSED] 0x4690 (ALDERLAKE_S)
[16:15:54] [PASSED] 0x4692 (ALDERLAKE_S)
[16:15:54] [PASSED] 0x4693 (ALDERLAKE_S)
[16:15:54] [PASSED] 0x46A0 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46A1 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46A2 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46A3 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46A6 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46A8 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46AA (ALDERLAKE_P)
[16:15:54] [PASSED] 0x462A (ALDERLAKE_P)
[16:15:54] [PASSED] 0x4626 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x4628 (ALDERLAKE_P)
stty: 'standard input': Inappropriate ioctl for device
[16:15:54] [PASSED] 0x46B0 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46B1 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46B2 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46B3 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46C0 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46C1 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46C2 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46C3 (ALDERLAKE_P)
[16:15:54] [PASSED] 0x46D0 (ALDERLAKE_N)
[16:15:54] [PASSED] 0x46D1 (ALDERLAKE_N)
[16:15:54] [PASSED] 0x46D2 (ALDERLAKE_N)
[16:15:54] [PASSED] 0x46D3 (ALDERLAKE_N)
[16:15:54] [PASSED] 0x46D4 (ALDERLAKE_N)
[16:15:54] [PASSED] 0xA721 (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7A1 (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7A9 (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7AC (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7AD (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA720 (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7A0 (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7A8 (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7AA (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA7AB (ALDERLAKE_P)
[16:15:54] [PASSED] 0xA780 (ALDERLAKE_S)
[16:15:54] [PASSED] 0xA781 (ALDERLAKE_S)
[16:15:54] [PASSED] 0xA782 (ALDERLAKE_S)
[16:15:54] [PASSED] 0xA783 (ALDERLAKE_S)
[16:15:54] [PASSED] 0xA788 (ALDERLAKE_S)
[16:15:54] [PASSED] 0xA789 (ALDERLAKE_S)
[16:15:54] [PASSED] 0xA78A (ALDERLAKE_S)
[16:15:54] [PASSED] 0xA78B (ALDERLAKE_S)
[16:15:54] [PASSED] 0x4905 (DG1)
[16:15:54] [PASSED] 0x4906 (DG1)
[16:15:54] [PASSED] 0x4907 (DG1)
[16:15:54] [PASSED] 0x4908 (DG1)
[16:15:54] [PASSED] 0x4909 (DG1)
[16:15:54] [PASSED] 0x56C0 (DG2)
[16:15:54] [PASSED] 0x56C2 (DG2)
[16:15:54] [PASSED] 0x56C1 (DG2)
[16:15:54] [PASSED] 0x7D51 (METEORLAKE)
[16:15:54] [PASSED] 0x7DD1 (METEORLAKE)
[16:15:54] [PASSED] 0x7D41 (METEORLAKE)
[16:15:54] [PASSED] 0x7D67 (METEORLAKE)
[16:15:54] [PASSED] 0xB640 (METEORLAKE)
[16:15:54] [PASSED] 0x56A0 (DG2)
[16:15:54] [PASSED] 0x56A1 (DG2)
[16:15:54] [PASSED] 0x56A2 (DG2)
[16:15:54] [PASSED] 0x56BE (DG2)
[16:15:54] [PASSED] 0x56BF (DG2)
[16:15:54] [PASSED] 0x5690 (DG2)
[16:15:54] [PASSED] 0x5691 (DG2)
[16:15:54] [PASSED] 0x5692 (DG2)
[16:15:54] [PASSED] 0x56A5 (DG2)
[16:15:54] [PASSED] 0x56A6 (DG2)
[16:15:54] [PASSED] 0x56B0 (DG2)
[16:15:54] [PASSED] 0x56B1 (DG2)
[16:15:54] [PASSED] 0x56BA (DG2)
[16:15:54] [PASSED] 0x56BB (DG2)
[16:15:54] [PASSED] 0x56BC (DG2)
[16:15:54] [PASSED] 0x56BD (DG2)
[16:15:54] [PASSED] 0x5693 (DG2)
[16:15:54] [PASSED] 0x5694 (DG2)
[16:15:54] [PASSED] 0x5695 (DG2)
[16:15:54] [PASSED] 0x56A3 (DG2)
[16:15:54] [PASSED] 0x56A4 (DG2)
[16:15:54] [PASSED] 0x56B2 (DG2)
[16:15:54] [PASSED] 0x56B3 (DG2)
[16:15:54] [PASSED] 0x5696 (DG2)
[16:15:54] [PASSED] 0x5697 (DG2)
[16:15:54] [PASSED] 0xB69 (PVC)
[16:15:54] [PASSED] 0xB6E (PVC)
[16:15:54] [PASSED] 0xBD4 (PVC)
[16:15:54] [PASSED] 0xBD5 (PVC)
[16:15:54] [PASSED] 0xBD6 (PVC)
[16:15:54] [PASSED] 0xBD7 (PVC)
[16:15:54] [PASSED] 0xBD8 (PVC)
[16:15:54] [PASSED] 0xBD9 (PVC)
[16:15:54] [PASSED] 0xBDA (PVC)
[16:15:54] [PASSED] 0xBDB (PVC)
[16:15:54] [PASSED] 0xBE0 (PVC)
[16:15:54] [PASSED] 0xBE1 (PVC)
[16:15:54] [PASSED] 0xBE5 (PVC)
[16:15:54] [PASSED] 0x7D40 (METEORLAKE)
[16:15:54] [PASSED] 0x7D45 (METEORLAKE)
[16:15:54] [PASSED] 0x7D55 (METEORLAKE)
[16:15:54] [PASSED] 0x7D60 (METEORLAKE)
[16:15:54] [PASSED] 0x7DD5 (METEORLAKE)
[16:15:54] [PASSED] 0x6420 (LUNARLAKE)
[16:15:54] [PASSED] 0x64A0 (LUNARLAKE)
[16:15:54] [PASSED] 0x64B0 (LUNARLAKE)
[16:15:54] [PASSED] 0xE202 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE209 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE20B (BATTLEMAGE)
[16:15:54] [PASSED] 0xE20C (BATTLEMAGE)
[16:15:54] [PASSED] 0xE20D (BATTLEMAGE)
[16:15:54] [PASSED] 0xE210 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE211 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE212 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE216 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE220 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE221 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE222 (BATTLEMAGE)
[16:15:54] [PASSED] 0xE223 (BATTLEMAGE)
[16:15:54] [PASSED] 0xB080 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB081 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB082 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB083 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB084 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB085 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB086 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB087 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB08F (PANTHERLAKE)
[16:15:54] [PASSED] 0xB090 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB0A0 (PANTHERLAKE)
[16:15:54] [PASSED] 0xB0B0 (PANTHERLAKE)
[16:15:54] [PASSED] 0xFD80 (PANTHERLAKE)
[16:15:54] [PASSED] 0xFD81 (PANTHERLAKE)
[16:15:54] [PASSED] 0xD740 (NOVALAKE_S)
[16:15:54] [PASSED] 0xD741 (NOVALAKE_S)
[16:15:54] [PASSED] 0xD742 (NOVALAKE_S)
[16:15:54] [PASSED] 0xD743 (NOVALAKE_S)
[16:15:54] [PASSED] 0xD744 (NOVALAKE_S)
[16:15:54] [PASSED] 0xD745 (NOVALAKE_S)
[16:15:54] [PASSED] 0x674C (CRESCENTISLAND)
[16:15:54] =============== [PASSED] check_platform_desc ===============
[16:15:54] ===================== [PASSED] xe_pci ======================
[16:15:54] =================== xe_rtp (2 subtests) ====================
[16:15:54] =============== xe_rtp_process_to_sr_tests  ================
[16:15:54] [PASSED] coalesce-same-reg
[16:15:54] [PASSED] no-match-no-add
[16:15:54] [PASSED] match-or
[16:15:54] [PASSED] match-or-xfail
[16:15:54] [PASSED] no-match-no-add-multiple-rules
[16:15:54] [PASSED] two-regs-two-entries
[16:15:54] [PASSED] clr-one-set-other
[16:15:54] [PASSED] set-field
[16:15:54] [PASSED] conflict-duplicate
[16:15:54] [PASSED] conflict-not-disjoint
[16:15:54] [PASSED] conflict-reg-type
[16:15:54] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[16:15:54] ================== xe_rtp_process_tests  ===================
[16:15:54] [PASSED] active1
[16:15:54] [PASSED] active2
[16:15:54] [PASSED] active-inactive
[16:15:54] [PASSED] inactive-active
[16:15:54] [PASSED] inactive-1st_or_active-inactive
[16:15:54] [PASSED] inactive-2nd_or_active-inactive
[16:15:54] [PASSED] inactive-last_or_active-inactive
[16:15:54] [PASSED] inactive-no_or_active-inactive
[16:15:54] ============== [PASSED] xe_rtp_process_tests ===============
[16:15:54] ===================== [PASSED] xe_rtp ======================
[16:15:54] ==================== xe_wa (1 subtest) =====================
[16:15:54] ======================== xe_wa_gt  =========================
[16:15:54] [PASSED] TIGERLAKE B0
[16:15:54] [PASSED] DG1 A0
[16:15:54] [PASSED] DG1 B0
[16:15:54] [PASSED] ALDERLAKE_S A0
[16:15:54] [PASSED] ALDERLAKE_S B0
[16:15:54] [PASSED] ALDERLAKE_S C0
[16:15:54] [PASSED] ALDERLAKE_S D0
[16:15:54] [PASSED] ALDERLAKE_P A0
[16:15:54] [PASSED] ALDERLAKE_P B0
[16:15:54] [PASSED] ALDERLAKE_P C0
[16:15:54] [PASSED] ALDERLAKE_S RPLS D0
[16:15:54] [PASSED] ALDERLAKE_P RPLU E0
[16:15:54] [PASSED] DG2 G10 C0
[16:15:54] [PASSED] DG2 G11 B1
[16:15:54] [PASSED] DG2 G12 A1
[16:15:54] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[16:15:54] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[16:15:54] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[16:15:54] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[16:15:54] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[16:15:54] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[16:15:54] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[16:15:54] ==================== [PASSED] xe_wa_gt =====================
[16:15:54] ====================== [PASSED] xe_wa ======================
[16:15:54] ============================================================
[16:15:54] Testing complete. Ran 512 tests: passed: 494, skipped: 18
[16:15:54] Elapsed time: 36.899s total, 4.715s configuring, 31.666s building, 0.471s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[16:15:54] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[16:15:56] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[16:16:21] Starting KUnit Kernel (1/1)...
[16:16:21] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[16:16:21] ============ drm_test_pick_cmdline (2 subtests) ============
[16:16:21] [PASSED] drm_test_pick_cmdline_res_1920_1080_60
[16:16:21] =============== drm_test_pick_cmdline_named  ===============
[16:16:21] [PASSED] NTSC
[16:16:21] [PASSED] NTSC-J
[16:16:21] [PASSED] PAL
[16:16:21] [PASSED] PAL-M
[16:16:21] =========== [PASSED] drm_test_pick_cmdline_named ===========
[16:16:21] ============== [PASSED] drm_test_pick_cmdline ==============
[16:16:21] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[16:16:21] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[16:16:21] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[16:16:21] =========== drm_validate_clone_mode (2 subtests) ===========
[16:16:21] ============== drm_test_check_in_clone_mode  ===============
[16:16:21] [PASSED] in_clone_mode
[16:16:21] [PASSED] not_in_clone_mode
[16:16:21] ========== [PASSED] drm_test_check_in_clone_mode ===========
[16:16:21] =============== drm_test_check_valid_clones  ===============
[16:16:21] [PASSED] not_in_clone_mode
[16:16:21] [PASSED] valid_clone
[16:16:21] [PASSED] invalid_clone
[16:16:21] =========== [PASSED] drm_test_check_valid_clones ===========
[16:16:21] ============= [PASSED] drm_validate_clone_mode =============
[16:16:21] ============= drm_validate_modeset (1 subtest) =============
[16:16:21] [PASSED] drm_test_check_connector_changed_modeset
[16:16:21] ============== [PASSED] drm_validate_modeset ===============
[16:16:21] ====== drm_test_bridge_get_current_state (2 subtests) ======
[16:16:21] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[16:16:21] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[16:16:21] ======== [PASSED] drm_test_bridge_get_current_state ========
[16:16:21] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[16:16:21] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[16:16:21] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[16:16:21] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[16:16:21] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[16:16:21] ============== drm_bridge_alloc (2 subtests) ===============
[16:16:21] [PASSED] drm_test_drm_bridge_alloc_basic
[16:16:21] [PASSED] drm_test_drm_bridge_alloc_get_put
[16:16:21] ================ [PASSED] drm_bridge_alloc =================
[16:16:21] ================== drm_buddy (9 subtests) ==================
[16:16:21] [PASSED] drm_test_buddy_alloc_limit
[16:16:21] [PASSED] drm_test_buddy_alloc_optimistic
[16:16:21] [PASSED] drm_test_buddy_alloc_pessimistic
[16:16:21] [PASSED] drm_test_buddy_alloc_pathological
[16:16:21] [PASSED] drm_test_buddy_alloc_contiguous
[16:16:21] [PASSED] drm_test_buddy_alloc_clear
[16:16:22] [PASSED] drm_test_buddy_alloc_range_bias
[16:16:22] [PASSED] drm_test_buddy_fragmentation_performance
[16:16:22] [PASSED] drm_test_buddy_alloc_exceeds_max_order
[16:16:22] ==================== [PASSED] drm_buddy ====================
[16:16:22] ============= drm_cmdline_parser (40 subtests) =============
[16:16:22] [PASSED] drm_test_cmdline_force_d_only
[16:16:22] [PASSED] drm_test_cmdline_force_D_only_dvi
[16:16:22] [PASSED] drm_test_cmdline_force_D_only_hdmi
[16:16:22] [PASSED] drm_test_cmdline_force_D_only_not_digital
[16:16:22] [PASSED] drm_test_cmdline_force_e_only
[16:16:22] [PASSED] drm_test_cmdline_res
[16:16:22] [PASSED] drm_test_cmdline_res_vesa
[16:16:22] [PASSED] drm_test_cmdline_res_vesa_rblank
[16:16:22] [PASSED] drm_test_cmdline_res_rblank
[16:16:22] [PASSED] drm_test_cmdline_res_bpp
[16:16:22] [PASSED] drm_test_cmdline_res_refresh
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[16:16:22] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[16:16:22] [PASSED] drm_test_cmdline_res_margins_force_on
[16:16:22] [PASSED] drm_test_cmdline_res_vesa_margins
[16:16:22] [PASSED] drm_test_cmdline_name
[16:16:22] [PASSED] drm_test_cmdline_name_bpp
[16:16:22] [PASSED] drm_test_cmdline_name_option
[16:16:22] [PASSED] drm_test_cmdline_name_bpp_option
[16:16:22] [PASSED] drm_test_cmdline_rotate_0
[16:16:22] [PASSED] drm_test_cmdline_rotate_90
[16:16:22] [PASSED] drm_test_cmdline_rotate_180
[16:16:22] [PASSED] drm_test_cmdline_rotate_270
[16:16:22] [PASSED] drm_test_cmdline_hmirror
[16:16:22] [PASSED] drm_test_cmdline_vmirror
[16:16:22] [PASSED] drm_test_cmdline_margin_options
[16:16:22] [PASSED] drm_test_cmdline_multiple_options
[16:16:22] [PASSED] drm_test_cmdline_bpp_extra_and_option
[16:16:22] [PASSED] drm_test_cmdline_extra_and_option
[16:16:22] [PASSED] drm_test_cmdline_freestanding_options
[16:16:22] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[16:16:22] [PASSED] drm_test_cmdline_panel_orientation
[16:16:22] ================ drm_test_cmdline_invalid  =================
[16:16:22] [PASSED] margin_only
[16:16:22] [PASSED] interlace_only
[16:16:22] [PASSED] res_missing_x
[16:16:22] [PASSED] res_missing_y
[16:16:22] [PASSED] res_bad_y
[16:16:22] [PASSED] res_missing_y_bpp
[16:16:22] [PASSED] res_bad_bpp
[16:16:22] [PASSED] res_bad_refresh
[16:16:22] [PASSED] res_bpp_refresh_force_on_off
[16:16:22] [PASSED] res_invalid_mode
[16:16:22] [PASSED] res_bpp_wrong_place_mode
[16:16:22] [PASSED] name_bpp_refresh
[16:16:22] [PASSED] name_refresh
[16:16:22] [PASSED] name_refresh_wrong_mode
[16:16:22] [PASSED] name_refresh_invalid_mode
[16:16:22] [PASSED] rotate_multiple
[16:16:22] [PASSED] rotate_invalid_val
[16:16:22] [PASSED] rotate_truncated
[16:16:22] [PASSED] invalid_option
[16:16:22] [PASSED] invalid_tv_option
[16:16:22] [PASSED] truncated_tv_option
[16:16:22] ============ [PASSED] drm_test_cmdline_invalid =============
[16:16:22] =============== drm_test_cmdline_tv_options  ===============
[16:16:22] [PASSED] NTSC
[16:16:22] [PASSED] NTSC_443
[16:16:22] [PASSED] NTSC_J
[16:16:22] [PASSED] PAL
[16:16:22] [PASSED] PAL_M
[16:16:22] [PASSED] PAL_N
[16:16:22] [PASSED] SECAM
[16:16:22] [PASSED] MONO_525
[16:16:22] [PASSED] MONO_625
[16:16:22] =========== [PASSED] drm_test_cmdline_tv_options ===========
[16:16:22] =============== [PASSED] drm_cmdline_parser ================
[16:16:22] ========== drmm_connector_hdmi_init (20 subtests) ==========
[16:16:22] [PASSED] drm_test_connector_hdmi_init_valid
[16:16:22] [PASSED] drm_test_connector_hdmi_init_bpc_8
[16:16:22] [PASSED] drm_test_connector_hdmi_init_bpc_10
[16:16:22] [PASSED] drm_test_connector_hdmi_init_bpc_12
[16:16:22] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[16:16:22] [PASSED] drm_test_connector_hdmi_init_bpc_null
[16:16:22] [PASSED] drm_test_connector_hdmi_init_formats_empty
[16:16:22] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[16:16:22] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[16:16:22] [PASSED] supported_formats=0x9 yuv420_allowed=1
[16:16:22] [PASSED] supported_formats=0x9 yuv420_allowed=0
[16:16:22] [PASSED] supported_formats=0x3 yuv420_allowed=1
[16:16:22] [PASSED] supported_formats=0x3 yuv420_allowed=0
[16:16:22] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[16:16:22] [PASSED] drm_test_connector_hdmi_init_null_ddc
[16:16:22] [PASSED] drm_test_connector_hdmi_init_null_product
[16:16:22] [PASSED] drm_test_connector_hdmi_init_null_vendor
[16:16:22] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[16:16:22] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[16:16:22] [PASSED] drm_test_connector_hdmi_init_product_valid
[16:16:22] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[16:16:22] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[16:16:22] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[16:16:22] ========= drm_test_connector_hdmi_init_type_valid  =========
[16:16:22] [PASSED] HDMI-A
[16:16:22] [PASSED] HDMI-B
[16:16:22] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[16:16:22] ======== drm_test_connector_hdmi_init_type_invalid  ========
[16:16:22] [PASSED] Unknown
[16:16:22] [PASSED] VGA
[16:16:22] [PASSED] DVI-I
[16:16:22] [PASSED] DVI-D
[16:16:22] [PASSED] DVI-A
[16:16:22] [PASSED] Composite
[16:16:22] [PASSED] SVIDEO
[16:16:22] [PASSED] LVDS
[16:16:22] [PASSED] Component
[16:16:22] [PASSED] DIN
[16:16:22] [PASSED] DP
[16:16:22] [PASSED] TV
[16:16:22] [PASSED] eDP
[16:16:22] [PASSED] Virtual
[16:16:22] [PASSED] DSI
[16:16:22] [PASSED] DPI
[16:16:22] [PASSED] Writeback
[16:16:22] [PASSED] SPI
[16:16:22] [PASSED] USB
[16:16:22] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[16:16:22] ============ [PASSED] drmm_connector_hdmi_init =============
[16:16:22] ============= drmm_connector_init (3 subtests) =============
[16:16:22] [PASSED] drm_test_drmm_connector_init
[16:16:22] [PASSED] drm_test_drmm_connector_init_null_ddc
[16:16:22] ========= drm_test_drmm_connector_init_type_valid  =========
[16:16:22] [PASSED] Unknown
[16:16:22] [PASSED] VGA
[16:16:22] [PASSED] DVI-I
[16:16:22] [PASSED] DVI-D
[16:16:22] [PASSED] DVI-A
[16:16:22] [PASSED] Composite
[16:16:22] [PASSED] SVIDEO
[16:16:22] [PASSED] LVDS
[16:16:22] [PASSED] Component
[16:16:22] [PASSED] DIN
[16:16:22] [PASSED] DP
[16:16:22] [PASSED] HDMI-A
[16:16:22] [PASSED] HDMI-B
[16:16:22] [PASSED] TV
[16:16:22] [PASSED] eDP
[16:16:22] [PASSED] Virtual
[16:16:22] [PASSED] DSI
[16:16:22] [PASSED] DPI
[16:16:22] [PASSED] Writeback
[16:16:22] [PASSED] SPI
[16:16:22] [PASSED] USB
[16:16:22] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[16:16:22] =============== [PASSED] drmm_connector_init ===============
[16:16:22] ========= drm_connector_dynamic_init (6 subtests) ==========
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_init
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_init_properties
[16:16:22] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[16:16:22] [PASSED] Unknown
[16:16:22] [PASSED] VGA
[16:16:22] [PASSED] DVI-I
[16:16:22] [PASSED] DVI-D
[16:16:22] [PASSED] DVI-A
[16:16:22] [PASSED] Composite
[16:16:22] [PASSED] SVIDEO
[16:16:22] [PASSED] LVDS
[16:16:22] [PASSED] Component
[16:16:22] [PASSED] DIN
[16:16:22] [PASSED] DP
[16:16:22] [PASSED] HDMI-A
[16:16:22] [PASSED] HDMI-B
[16:16:22] [PASSED] TV
[16:16:22] [PASSED] eDP
[16:16:22] [PASSED] Virtual
[16:16:22] [PASSED] DSI
[16:16:22] [PASSED] DPI
[16:16:22] [PASSED] Writeback
[16:16:22] [PASSED] SPI
[16:16:22] [PASSED] USB
[16:16:22] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[16:16:22] ======== drm_test_drm_connector_dynamic_init_name  =========
[16:16:22] [PASSED] Unknown
[16:16:22] [PASSED] VGA
[16:16:22] [PASSED] DVI-I
[16:16:22] [PASSED] DVI-D
[16:16:22] [PASSED] DVI-A
[16:16:22] [PASSED] Composite
[16:16:22] [PASSED] SVIDEO
[16:16:22] [PASSED] LVDS
[16:16:22] [PASSED] Component
[16:16:22] [PASSED] DIN
[16:16:22] [PASSED] DP
[16:16:22] [PASSED] HDMI-A
[16:16:22] [PASSED] HDMI-B
[16:16:22] [PASSED] TV
[16:16:22] [PASSED] eDP
[16:16:22] [PASSED] Virtual
[16:16:22] [PASSED] DSI
[16:16:22] [PASSED] DPI
[16:16:22] [PASSED] Writeback
[16:16:22] [PASSED] SPI
[16:16:22] [PASSED] USB
[16:16:22] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[16:16:22] =========== [PASSED] drm_connector_dynamic_init ============
[16:16:22] ==== drm_connector_dynamic_register_early (4 subtests) =====
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[16:16:22] ====== [PASSED] drm_connector_dynamic_register_early =======
[16:16:22] ======= drm_connector_dynamic_register (7 subtests) ========
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[16:16:22] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[16:16:22] ========= [PASSED] drm_connector_dynamic_register ==========
[16:16:22] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[16:16:22] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[16:16:22] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[16:16:22] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[16:16:22] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[16:16:22] ========== drm_test_get_tv_mode_from_name_valid  ===========
[16:16:22] [PASSED] NTSC
[16:16:22] [PASSED] NTSC-443
[16:16:22] [PASSED] NTSC-J
[16:16:22] [PASSED] PAL
[16:16:22] [PASSED] PAL-M
[16:16:22] [PASSED] PAL-N
[16:16:22] [PASSED] SECAM
[16:16:22] [PASSED] Mono
[16:16:22] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[16:16:22] [PASSED] drm_test_get_tv_mode_from_name_truncated
[16:16:22] ============ [PASSED] drm_get_tv_mode_from_name ============
[16:16:22] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[16:16:22] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[16:16:22] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[16:16:22] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[16:16:22] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[16:16:22] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[16:16:22] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[16:16:22] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[16:16:22] [PASSED] VIC 96
[16:16:22] [PASSED] VIC 97
[16:16:22] [PASSED] VIC 101
[16:16:22] [PASSED] VIC 102
[16:16:22] [PASSED] VIC 106
[16:16:22] [PASSED] VIC 107
[16:16:22] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[16:16:22] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[16:16:22] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[16:16:22] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[16:16:22] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[16:16:22] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[16:16:22] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[16:16:22] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[16:16:22] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[16:16:22] [PASSED] Automatic
[16:16:22] [PASSED] Full
[16:16:22] [PASSED] Limited 16:235
[16:16:22] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[16:16:22] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[16:16:22] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[16:16:22] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[16:16:22] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[16:16:22] [PASSED] RGB
[16:16:22] [PASSED] YUV 4:2:0
[16:16:22] [PASSED] YUV 4:2:2
[16:16:22] [PASSED] YUV 4:4:4
[16:16:22] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[16:16:22] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[16:16:22] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[16:16:22] ============= drm_damage_helper (21 subtests) ==============
[16:16:22] [PASSED] drm_test_damage_iter_no_damage
[16:16:22] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[16:16:22] [PASSED] drm_test_damage_iter_no_damage_src_moved
[16:16:22] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[16:16:22] [PASSED] drm_test_damage_iter_no_damage_not_visible
[16:16:22] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[16:16:22] [PASSED] drm_test_damage_iter_no_damage_no_fb
[16:16:22] [PASSED] drm_test_damage_iter_simple_damage
[16:16:22] [PASSED] drm_test_damage_iter_single_damage
[16:16:22] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[16:16:22] [PASSED] drm_test_damage_iter_single_damage_outside_src
[16:16:22] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[16:16:22] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[16:16:22] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[16:16:22] [PASSED] drm_test_damage_iter_single_damage_src_moved
[16:16:22] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[16:16:22] [PASSED] drm_test_damage_iter_damage
[16:16:22] [PASSED] drm_test_damage_iter_damage_one_intersect
[16:16:22] [PASSED] drm_test_damage_iter_damage_one_outside
[16:16:22] [PASSED] drm_test_damage_iter_damage_src_moved
[16:16:22] [PASSED] drm_test_damage_iter_damage_not_visible
[16:16:22] ================ [PASSED] drm_damage_helper ================
[16:16:22] ============== drm_dp_mst_helper (3 subtests) ==============
[16:16:22] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[16:16:22] [PASSED] Clock 154000 BPP 30 DSC disabled
[16:16:22] [PASSED] Clock 234000 BPP 30 DSC disabled
[16:16:22] [PASSED] Clock 297000 BPP 24 DSC disabled
[16:16:22] [PASSED] Clock 332880 BPP 24 DSC enabled
[16:16:22] [PASSED] Clock 324540 BPP 24 DSC enabled
[16:16:22] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[16:16:22] ============== drm_test_dp_mst_calc_pbn_div  ===============
[16:16:22] [PASSED] Link rate 2000000 lane count 4
[16:16:22] [PASSED] Link rate 2000000 lane count 2
[16:16:22] [PASSED] Link rate 2000000 lane count 1
[16:16:22] [PASSED] Link rate 1350000 lane count 4
[16:16:22] [PASSED] Link rate 1350000 lane count 2
[16:16:22] [PASSED] Link rate 1350000 lane count 1
[16:16:22] [PASSED] Link rate 1000000 lane count 4
[16:16:22] [PASSED] Link rate 1000000 lane count 2
[16:16:22] [PASSED] Link rate 1000000 lane count 1
[16:16:22] [PASSED] Link rate 810000 lane count 4
[16:16:22] [PASSED] Link rate 810000 lane count 2
[16:16:22] [PASSED] Link rate 810000 lane count 1
[16:16:22] [PASSED] Link rate 540000 lane count 4
[16:16:22] [PASSED] Link rate 540000 lane count 2
[16:16:22] [PASSED] Link rate 540000 lane count 1
[16:16:22] [PASSED] Link rate 270000 lane count 4
[16:16:22] [PASSED] Link rate 270000 lane count 2
[16:16:22] [PASSED] Link rate 270000 lane count 1
[16:16:22] [PASSED] Link rate 162000 lane count 4
[16:16:22] [PASSED] Link rate 162000 lane count 2
[16:16:22] [PASSED] Link rate 162000 lane count 1
[16:16:22] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[16:16:22] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[16:16:22] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[16:16:22] [PASSED] DP_POWER_UP_PHY with port number
[16:16:22] [PASSED] DP_POWER_DOWN_PHY with port number
[16:16:22] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[16:16:22] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[16:16:22] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[16:16:22] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[16:16:22] [PASSED] DP_QUERY_PAYLOAD with port number
[16:16:22] [PASSED] DP_QUERY_PAYLOAD with VCPI
[16:16:22] [PASSED] DP_REMOTE_DPCD_READ with port number
[16:16:22] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[16:16:22] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[16:16:22] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[16:16:22] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[16:16:22] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[16:16:22] [PASSED] DP_REMOTE_I2C_READ with port number
[16:16:22] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[16:16:22] [PASSED] DP_REMOTE_I2C_READ with transactions array
[16:16:22] [PASSED] DP_REMOTE_I2C_WRITE with port number
[16:16:22] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[16:16:22] [PASSED] DP_REMOTE_I2C_WRITE with data array
[16:16:22] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[16:16:22] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[16:16:22] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[16:16:22] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[16:16:22] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[16:16:22] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[16:16:22] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[16:16:22] ================ [PASSED] drm_dp_mst_helper ================
[16:16:22] ================== drm_exec (7 subtests) ===================
[16:16:22] [PASSED] sanitycheck
[16:16:22] [PASSED] test_lock
[16:16:22] [PASSED] test_lock_unlock
[16:16:22] [PASSED] test_duplicates
[16:16:22] [PASSED] test_prepare
[16:16:22] [PASSED] test_prepare_array
[16:16:22] [PASSED] test_multiple_loops
[16:16:22] ==================== [PASSED] drm_exec =====================
[16:16:22] =========== drm_format_helper_test (17 subtests) ===========
[16:16:22] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[16:16:22] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[16:16:22] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[16:16:22] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[16:16:22] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[16:16:22] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[16:16:22] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[16:16:22] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[16:16:22] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[16:16:22] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[16:16:22] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[16:16:22] ============== drm_test_fb_xrgb8888_to_mono  ===============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[16:16:22] ==================== drm_test_fb_swab  =====================
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ================ [PASSED] drm_test_fb_swab =================
[16:16:22] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[16:16:22] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[16:16:22] [PASSED] single_pixel_source_buffer
[16:16:22] [PASSED] single_pixel_clip_rectangle
[16:16:22] [PASSED] well_known_colors
[16:16:22] [PASSED] destination_pitch
[16:16:22] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[16:16:22] ================= drm_test_fb_clip_offset  =================
[16:16:22] [PASSED] pass through
[16:16:22] [PASSED] horizontal offset
[16:16:22] [PASSED] vertical offset
[16:16:22] [PASSED] horizontal and vertical offset
[16:16:22] [PASSED] horizontal offset (custom pitch)
[16:16:22] [PASSED] vertical offset (custom pitch)
[16:16:22] [PASSED] horizontal and vertical offset (custom pitch)
[16:16:22] ============= [PASSED] drm_test_fb_clip_offset =============
[16:16:22] =================== drm_test_fb_memcpy  ====================
[16:16:22] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[16:16:22] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[16:16:22] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[16:16:22] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[16:16:22] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[16:16:22] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[16:16:22] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[16:16:22] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[16:16:22] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[16:16:22] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[16:16:22] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[16:16:22] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[16:16:22] =============== [PASSED] drm_test_fb_memcpy ================
[16:16:22] ============= [PASSED] drm_format_helper_test ==============
[16:16:22] ================= drm_format (18 subtests) =================
[16:16:22] [PASSED] drm_test_format_block_width_invalid
[16:16:22] [PASSED] drm_test_format_block_width_one_plane
[16:16:22] [PASSED] drm_test_format_block_width_two_plane
[16:16:22] [PASSED] drm_test_format_block_width_three_plane
[16:16:22] [PASSED] drm_test_format_block_width_tiled
[16:16:22] [PASSED] drm_test_format_block_height_invalid
[16:16:22] [PASSED] drm_test_format_block_height_one_plane
[16:16:22] [PASSED] drm_test_format_block_height_two_plane
[16:16:22] [PASSED] drm_test_format_block_height_three_plane
[16:16:22] [PASSED] drm_test_format_block_height_tiled
[16:16:22] [PASSED] drm_test_format_min_pitch_invalid
[16:16:22] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[16:16:22] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[16:16:22] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[16:16:22] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[16:16:22] [PASSED] drm_test_format_min_pitch_two_plane
[16:16:22] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[16:16:22] [PASSED] drm_test_format_min_pitch_tiled
[16:16:22] =================== [PASSED] drm_format ====================
[16:16:22] ============== drm_framebuffer (10 subtests) ===============
[16:16:22] ========== drm_test_framebuffer_check_src_coords  ==========
[16:16:22] [PASSED] Success: source fits into fb
[16:16:22] [PASSED] Fail: overflowing fb with x-axis coordinate
[16:16:22] [PASSED] Fail: overflowing fb with y-axis coordinate
[16:16:22] [PASSED] Fail: overflowing fb with source width
[16:16:22] [PASSED] Fail: overflowing fb with source height
[16:16:22] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[16:16:22] [PASSED] drm_test_framebuffer_cleanup
[16:16:22] =============== drm_test_framebuffer_create  ===============
[16:16:22] [PASSED] ABGR8888 normal sizes
[16:16:22] [PASSED] ABGR8888 max sizes
[16:16:22] [PASSED] ABGR8888 pitch greater than min required
[16:16:22] [PASSED] ABGR8888 pitch less than min required
[16:16:22] [PASSED] ABGR8888 Invalid width
[16:16:22] [PASSED] ABGR8888 Invalid buffer handle
[16:16:22] [PASSED] No pixel format
[16:16:22] [PASSED] ABGR8888 Width 0
[16:16:22] [PASSED] ABGR8888 Height 0
[16:16:22] [PASSED] ABGR8888 Out of bound height * pitch combination
[16:16:22] [PASSED] ABGR8888 Large buffer offset
[16:16:22] [PASSED] ABGR8888 Buffer offset for inexistent plane
[16:16:22] [PASSED] ABGR8888 Invalid flag
[16:16:22] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[16:16:22] [PASSED] ABGR8888 Valid buffer modifier
[16:16:22] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[16:16:22] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[16:16:22] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[16:16:22] [PASSED] NV12 Normal sizes
[16:16:22] [PASSED] NV12 Max sizes
[16:16:22] [PASSED] NV12 Invalid pitch
[16:16:22] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[16:16:22] [PASSED] NV12 different  modifier per-plane
[16:16:22] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[16:16:22] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[16:16:22] [PASSED] NV12 Modifier for inexistent plane
[16:16:22] [PASSED] NV12 Handle for inexistent plane
[16:16:22] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[16:16:22] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[16:16:22] [PASSED] YVU420 Normal sizes
[16:16:22] [PASSED] YVU420 Max sizes
[16:16:22] [PASSED] YVU420 Invalid pitch
[16:16:22] [PASSED] YVU420 Different pitches
[16:16:22] [PASSED] YVU420 Different buffer offsets/pitches
[16:16:22] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[16:16:22] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[16:16:22] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[16:16:22] [PASSED] YVU420 Valid modifier
[16:16:22] [PASSED] YVU420 Different modifiers per plane
[16:16:22] [PASSED] YVU420 Modifier for inexistent plane
[16:16:22] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[16:16:22] [PASSED] X0L2 Normal sizes
[16:16:22] [PASSED] X0L2 Max sizes
[16:16:22] [PASSED] X0L2 Invalid pitch
[16:16:22] [PASSED] X0L2 Pitch greater than minimum required
[16:16:22] [PASSED] X0L2 Handle for inexistent plane
[16:16:22] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[16:16:22] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[16:16:22] [PASSED] X0L2 Valid modifier
[16:16:22] [PASSED] X0L2 Modifier for inexistent plane
[16:16:22] =========== [PASSED] drm_test_framebuffer_create ===========
[16:16:22] [PASSED] drm_test_framebuffer_free
[16:16:22] [PASSED] drm_test_framebuffer_init
[16:16:22] [PASSED] drm_test_framebuffer_init_bad_format
[16:16:22] [PASSED] drm_test_framebuffer_init_dev_mismatch
[16:16:22] [PASSED] drm_test_framebuffer_lookup
[16:16:22] [PASSED] drm_test_framebuffer_lookup_inexistent
[16:16:22] [PASSED] drm_test_framebuffer_modifiers_not_supported
[16:16:22] ================= [PASSED] drm_framebuffer =================
[16:16:22] ================ drm_gem_shmem (8 subtests) ================
[16:16:22] [PASSED] drm_gem_shmem_test_obj_create
[16:16:22] [PASSED] drm_gem_shmem_test_obj_create_private
[16:16:22] [PASSED] drm_gem_shmem_test_pin_pages
[16:16:22] [PASSED] drm_gem_shmem_test_vmap
[16:16:22] [PASSED] drm_gem_shmem_test_get_sg_table
[16:16:22] [PASSED] drm_gem_shmem_test_get_pages_sgt
[16:16:22] [PASSED] drm_gem_shmem_test_madvise
[16:16:22] [PASSED] drm_gem_shmem_test_purge
[16:16:22] ================== [PASSED] drm_gem_shmem ==================
[16:16:22] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[16:16:22] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[16:16:22] [PASSED] Automatic
[16:16:22] [PASSED] Full
[16:16:22] [PASSED] Limited 16:235
[16:16:22] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[16:16:22] [PASSED] drm_test_check_disable_connector
[16:16:22] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[16:16:22] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[16:16:22] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[16:16:22] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[16:16:22] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[16:16:22] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[16:16:22] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[16:16:22] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[16:16:22] [PASSED] drm_test_check_output_bpc_dvi
[16:16:22] [PASSED] drm_test_check_output_bpc_format_vic_1
[16:16:22] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[16:16:22] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[16:16:22] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[16:16:22] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[16:16:22] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[16:16:22] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[16:16:22] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[16:16:22] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[16:16:22] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[16:16:22] [PASSED] drm_test_check_broadcast_rgb_value
[16:16:22] [PASSED] drm_test_check_bpc_8_value
[16:16:22] [PASSED] drm_test_check_bpc_10_value
[16:16:22] [PASSED] drm_test_check_bpc_12_value
[16:16:22] [PASSED] drm_test_check_format_value
[16:16:22] [PASSED] drm_test_check_tmds_char_value
[16:16:22] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[16:16:22] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[16:16:22] [PASSED] drm_test_check_mode_valid
[16:16:22] [PASSED] drm_test_check_mode_valid_reject
[16:16:22] [PASSED] drm_test_check_mode_valid_reject_rate
[16:16:22] [PASSED] drm_test_check_mode_valid_reject_max_clock
[16:16:22] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[16:16:22] = drm_atomic_helper_connector_hdmi_infoframes (5 subtests) =
[16:16:22] [PASSED] drm_test_check_infoframes
[16:16:22] [PASSED] drm_test_check_reject_avi_infoframe
[16:16:22] [PASSED] drm_test_check_reject_hdr_infoframe_bpc_8
[16:16:22] [PASSED] drm_test_check_reject_hdr_infoframe_bpc_10
[16:16:22] [PASSED] drm_test_check_reject_audio_infoframe
[16:16:22] === [PASSED] drm_atomic_helper_connector_hdmi_infoframes ===
[16:16:22] ================= drm_managed (2 subtests) =================
[16:16:22] [PASSED] drm_test_managed_release_action
[16:16:22] [PASSED] drm_test_managed_run_action
[16:16:22] =================== [PASSED] drm_managed ===================
[16:16:22] =================== drm_mm (6 subtests) ====================
[16:16:22] [PASSED] drm_test_mm_init
[16:16:22] [PASSED] drm_test_mm_debug
[16:16:22] [PASSED] drm_test_mm_align32
[16:16:22] [PASSED] drm_test_mm_align64
[16:16:22] [PASSED] drm_test_mm_lowest
[16:16:22] [PASSED] drm_test_mm_highest
[16:16:22] ===================== [PASSED] drm_mm ======================
[16:16:22] ============= drm_modes_analog_tv (5 subtests) =============
[16:16:22] [PASSED] drm_test_modes_analog_tv_mono_576i
[16:16:22] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[16:16:22] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[16:16:22] [PASSED] drm_test_modes_analog_tv_pal_576i
[16:16:22] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[16:16:22] =============== [PASSED] drm_modes_analog_tv ===============
[16:16:22] ============== drm_plane_helper (2 subtests) ===============
[16:16:22] =============== drm_test_check_plane_state  ================
[16:16:22] [PASSED] clipping_simple
[16:16:22] [PASSED] clipping_rotate_reflect
[16:16:22] [PASSED] positioning_simple
[16:16:22] [PASSED] upscaling
[16:16:22] [PASSED] downscaling
[16:16:22] [PASSED] rounding1
[16:16:22] [PASSED] rounding2
[16:16:22] [PASSED] rounding3
[16:16:22] [PASSED] rounding4
[16:16:22] =========== [PASSED] drm_test_check_plane_state ============
[16:16:22] =========== drm_test_check_invalid_plane_state  ============
[16:16:22] [PASSED] positioning_invalid
[16:16:22] [PASSED] upscaling_invalid
[16:16:22] [PASSED] downscaling_invalid
[16:16:22] ======= [PASSED] drm_test_check_invalid_plane_state ========
[16:16:22] ================ [PASSED] drm_plane_helper =================
[16:16:22] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[16:16:22] ====== drm_test_connector_helper_tv_get_modes_check  =======
[16:16:22] [PASSED] None
[16:16:22] [PASSED] PAL
[16:16:22] [PASSED] NTSC
[16:16:22] [PASSED] Both, NTSC Default
[16:16:22] [PASSED] Both, PAL Default
[16:16:22] [PASSED] Both, NTSC Default, with PAL on command-line
[16:16:22] [PASSED] Both, PAL Default, with NTSC on command-line
[16:16:22] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[16:16:22] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[16:16:22] ================== drm_rect (9 subtests) ===================
[16:16:22] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[16:16:22] [PASSED] drm_test_rect_clip_scaled_not_clipped
[16:16:22] [PASSED] drm_test_rect_clip_scaled_clipped
[16:16:22] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[16:16:22] ================= drm_test_rect_intersect  =================
[16:16:22] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[16:16:22] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[16:16:22] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[16:16:22] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[16:16:22] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[16:16:22] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[16:16:22] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[16:16:22] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[16:16:22] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[16:16:22] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[16:16:22] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[16:16:22] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[16:16:22] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[16:16:22] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[16:16:22] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
stty: 'standard input': Inappropriate ioctl for device
[16:16:22] ============= [PASSED] drm_test_rect_intersect =============
[16:16:22] ================ drm_test_rect_calc_hscale  ================
[16:16:22] [PASSED] normal use
[16:16:22] [PASSED] out of max range
[16:16:22] [PASSED] out of min range
[16:16:22] [PASSED] zero dst
[16:16:22] [PASSED] negative src
[16:16:22] [PASSED] negative dst
[16:16:22] ============ [PASSED] drm_test_rect_calc_hscale ============
[16:16:22] ================ drm_test_rect_calc_vscale  ================
[16:16:22] [PASSED] normal use
[16:16:22] [PASSED] out of max range
[16:16:22] [PASSED] out of min range
[16:16:22] [PASSED] zero dst
[16:16:22] [PASSED] negative src
[16:16:22] [PASSED] negative dst
[16:16:22] ============ [PASSED] drm_test_rect_calc_vscale ============
[16:16:22] ================== drm_test_rect_rotate  ===================
[16:16:22] [PASSED] reflect-x
[16:16:22] [PASSED] reflect-y
[16:16:22] [PASSED] rotate-0
[16:16:22] [PASSED] rotate-90
[16:16:22] [PASSED] rotate-180
[16:16:22] [PASSED] rotate-270
[16:16:22] ============== [PASSED] drm_test_rect_rotate ===============
[16:16:22] ================ drm_test_rect_rotate_inv  =================
[16:16:22] [PASSED] reflect-x
[16:16:22] [PASSED] reflect-y
[16:16:22] [PASSED] rotate-0
[16:16:22] [PASSED] rotate-90
[16:16:22] [PASSED] rotate-180
[16:16:22] [PASSED] rotate-270
[16:16:22] ============ [PASSED] drm_test_rect_rotate_inv =============
[16:16:22] ==================== [PASSED] drm_rect =====================
[16:16:22] ============ drm_sysfb_modeset_test (1 subtest) ============
[16:16:22] ============ drm_test_sysfb_build_fourcc_list  =============
[16:16:22] [PASSED] no native formats
[16:16:22] [PASSED] XRGB8888 as native format
[16:16:22] [PASSED] remove duplicates
[16:16:22] [PASSED] convert alpha formats
[16:16:22] [PASSED] random formats
[16:16:22] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[16:16:22] ============= [PASSED] drm_sysfb_modeset_test ==============
[16:16:22] ================== drm_fixp (2 subtests) ===================
[16:16:22] [PASSED] drm_test_int2fixp
[16:16:22] [PASSED] drm_test_sm2fixp
[16:16:22] ==================== [PASSED] drm_fixp =====================
[16:16:22] ============================================================
[16:16:22] Testing complete. Ran 630 tests: passed: 630
[16:16:22] Elapsed time: 27.579s total, 1.638s configuring, 25.472s building, 0.428s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[16:16:22] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[16:16:24] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[16:16:33] Starting KUnit Kernel (1/1)...
[16:16:33] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[16:16:33] ================= ttm_device (5 subtests) ==================
[16:16:33] [PASSED] ttm_device_init_basic
[16:16:33] [PASSED] ttm_device_init_multiple
[16:16:33] [PASSED] ttm_device_fini_basic
[16:16:33] [PASSED] ttm_device_init_no_vma_man
[16:16:33] ================== ttm_device_init_pools  ==================
[16:16:33] [PASSED] No DMA allocations, no DMA32 required
[16:16:33] [PASSED] DMA allocations, DMA32 required
[16:16:33] [PASSED] No DMA allocations, DMA32 required
[16:16:33] [PASSED] DMA allocations, no DMA32 required
[16:16:33] ============== [PASSED] ttm_device_init_pools ==============
[16:16:33] =================== [PASSED] ttm_device ====================
[16:16:33] ================== ttm_pool (8 subtests) ===================
[16:16:33] ================== ttm_pool_alloc_basic  ===================
[16:16:33] [PASSED] One page
[16:16:33] [PASSED] More than one page
[16:16:33] [PASSED] Above the allocation limit
[16:16:33] [PASSED] One page, with coherent DMA mappings enabled
[16:16:33] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[16:16:33] ============== [PASSED] ttm_pool_alloc_basic ===============
[16:16:33] ============== ttm_pool_alloc_basic_dma_addr  ==============
[16:16:33] [PASSED] One page
[16:16:33] [PASSED] More than one page
[16:16:33] [PASSED] Above the allocation limit
[16:16:33] [PASSED] One page, with coherent DMA mappings enabled
[16:16:33] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[16:16:33] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[16:16:33] [PASSED] ttm_pool_alloc_order_caching_match
[16:16:33] [PASSED] ttm_pool_alloc_caching_mismatch
[16:16:33] [PASSED] ttm_pool_alloc_order_mismatch
[16:16:33] [PASSED] ttm_pool_free_dma_alloc
[16:16:33] [PASSED] ttm_pool_free_no_dma_alloc
[16:16:33] [PASSED] ttm_pool_fini_basic
[16:16:33] ==================== [PASSED] ttm_pool =====================
[16:16:33] ================ ttm_resource (8 subtests) =================
[16:16:33] ================= ttm_resource_init_basic  =================
[16:16:33] [PASSED] Init resource in TTM_PL_SYSTEM
[16:16:33] [PASSED] Init resource in TTM_PL_VRAM
[16:16:33] [PASSED] Init resource in a private placement
[16:16:33] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[16:16:33] ============= [PASSED] ttm_resource_init_basic =============
[16:16:33] [PASSED] ttm_resource_init_pinned
[16:16:33] [PASSED] ttm_resource_fini_basic
[16:16:33] [PASSED] ttm_resource_manager_init_basic
[16:16:33] [PASSED] ttm_resource_manager_usage_basic
[16:16:33] [PASSED] ttm_resource_manager_set_used_basic
[16:16:33] [PASSED] ttm_sys_man_alloc_basic
[16:16:33] [PASSED] ttm_sys_man_free_basic
[16:16:33] ================== [PASSED] ttm_resource ===================
[16:16:33] =================== ttm_tt (15 subtests) ===================
[16:16:33] ==================== ttm_tt_init_basic  ====================
[16:16:33] [PASSED] Page-aligned size
[16:16:33] [PASSED] Extra pages requested
[16:16:33] ================ [PASSED] ttm_tt_init_basic ================
[16:16:33] [PASSED] ttm_tt_init_misaligned
[16:16:33] [PASSED] ttm_tt_fini_basic
[16:16:33] [PASSED] ttm_tt_fini_sg
[16:16:33] [PASSED] ttm_tt_fini_shmem
[16:16:33] [PASSED] ttm_tt_create_basic
[16:16:33] [PASSED] ttm_tt_create_invalid_bo_type
[16:16:33] [PASSED] ttm_tt_create_ttm_exists
[16:16:33] [PASSED] ttm_tt_create_failed
[16:16:33] [PASSED] ttm_tt_destroy_basic
[16:16:33] [PASSED] ttm_tt_populate_null_ttm
[16:16:33] [PASSED] ttm_tt_populate_populated_ttm
[16:16:33] [PASSED] ttm_tt_unpopulate_basic
[16:16:33] [PASSED] ttm_tt_unpopulate_empty_ttm
[16:16:33] [PASSED] ttm_tt_swapin_basic
[16:16:33] ===================== [PASSED] ttm_tt ======================
[16:16:33] =================== ttm_bo (14 subtests) ===================
[16:16:33] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[16:16:33] [PASSED] Cannot be interrupted and sleeps
[16:16:33] [PASSED] Cannot be interrupted, locks straight away
[16:16:33] [PASSED] Can be interrupted, sleeps
[16:16:33] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[16:16:33] [PASSED] ttm_bo_reserve_locked_no_sleep
[16:16:33] [PASSED] ttm_bo_reserve_no_wait_ticket
[16:16:33] [PASSED] ttm_bo_reserve_double_resv
[16:16:33] [PASSED] ttm_bo_reserve_interrupted
[16:16:33] [PASSED] ttm_bo_reserve_deadlock
[16:16:33] [PASSED] ttm_bo_unreserve_basic
[16:16:33] [PASSED] ttm_bo_unreserve_pinned
[16:16:33] [PASSED] ttm_bo_unreserve_bulk
[16:16:33] [PASSED] ttm_bo_fini_basic
[16:16:33] [PASSED] ttm_bo_fini_shared_resv
[16:16:33] [PASSED] ttm_bo_pin_basic
[16:16:33] [PASSED] ttm_bo_pin_unpin_resource
[16:16:33] [PASSED] ttm_bo_multiple_pin_one_unpin
[16:16:33] ===================== [PASSED] ttm_bo ======================
[16:16:33] ============== ttm_bo_validate (21 subtests) ===============
[16:16:33] ============== ttm_bo_init_reserved_sys_man  ===============
[16:16:33] [PASSED] Buffer object for userspace
[16:16:33] [PASSED] Kernel buffer object
[16:16:33] [PASSED] Shared buffer object
[16:16:33] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[16:16:33] ============== ttm_bo_init_reserved_mock_man  ==============
[16:16:33] [PASSED] Buffer object for userspace
[16:16:33] [PASSED] Kernel buffer object
[16:16:33] [PASSED] Shared buffer object
[16:16:33] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[16:16:33] [PASSED] ttm_bo_init_reserved_resv
[16:16:33] ================== ttm_bo_validate_basic  ==================
[16:16:33] [PASSED] Buffer object for userspace
[16:16:33] [PASSED] Kernel buffer object
[16:16:33] [PASSED] Shared buffer object
[16:16:33] ============== [PASSED] ttm_bo_validate_basic ==============
[16:16:33] [PASSED] ttm_bo_validate_invalid_placement
[16:16:33] ============= ttm_bo_validate_same_placement  ==============
[16:16:33] [PASSED] System manager
[16:16:33] [PASSED] VRAM manager
[16:16:33] ========= [PASSED] ttm_bo_validate_same_placement ==========
[16:16:33] [PASSED] ttm_bo_validate_failed_alloc
[16:16:33] [PASSED] ttm_bo_validate_pinned
[16:16:33] [PASSED] ttm_bo_validate_busy_placement
[16:16:33] ================ ttm_bo_validate_multihop  =================
[16:16:33] [PASSED] Buffer object for userspace
[16:16:33] [PASSED] Kernel buffer object
[16:16:33] [PASSED] Shared buffer object
[16:16:33] ============ [PASSED] ttm_bo_validate_multihop =============
[16:16:33] ========== ttm_bo_validate_no_placement_signaled  ==========
[16:16:33] [PASSED] Buffer object in system domain, no page vector
[16:16:33] [PASSED] Buffer object in system domain with an existing page vector
[16:16:33] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[16:16:33] ======== ttm_bo_validate_no_placement_not_signaled  ========
[16:16:33] [PASSED] Buffer object for userspace
[16:16:33] [PASSED] Kernel buffer object
[16:16:33] [PASSED] Shared buffer object
[16:16:33] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[16:16:33] [PASSED] ttm_bo_validate_move_fence_signaled
[16:16:33] ========= ttm_bo_validate_move_fence_not_signaled  =========
[16:16:33] [PASSED] Waits for GPU
[16:16:33] [PASSED] Tries to lock straight away
[16:16:33] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[16:16:33] [PASSED] ttm_bo_validate_happy_evict
[16:16:33] [PASSED] ttm_bo_validate_all_pinned_evict
[16:16:33] [PASSED] ttm_bo_validate_allowed_only_evict
[16:16:33] [PASSED] ttm_bo_validate_deleted_evict
[16:16:33] [PASSED] ttm_bo_validate_busy_domain_evict
[16:16:33] [PASSED] ttm_bo_validate_evict_gutting
[16:16:33] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[16:16:33] ================= [PASSED] ttm_bo_validate =================
[16:16:33] ============================================================
[16:16:33] Testing complete. Ran 101 tests: passed: 101
[16:16:33] Elapsed time: 11.273s total, 1.619s configuring, 9.438s building, 0.177s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✗ CI.checksparse: warning for Introduce DRM_RAS using generic netlink for RAS (rev5)
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (6 preceding siblings ...)
  2026-02-02 16:16 ` ✓ CI.KUnit: success " Patchwork
@ 2026-02-02 16:31 ` Patchwork
  2026-02-02 16:51 ` ✓ Xe.CI.BAT: success " Patchwork
  8 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-02-02 16:31 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev5)
URL   : https://patchwork.freedesktop.org/series/155188/
State : warning

== Summary ==

+ trap cleanup EXIT
+ KERNEL=/kernel
+ MT=/root/linux/maintainer-tools
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools /root/linux/maintainer-tools
Cloning into '/root/linux/maintainer-tools'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ make -C /root/linux/maintainer-tools
make: Entering directory '/root/linux/maintainer-tools'
cc -O2 -g -Wextra -o remap-log remap-log.c
make: Leaving directory '/root/linux/maintainer-tools'
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ /root/linux/maintainer-tools/dim sparse --fast 4af5aac5c903c12b53b9cab756c49496f595a27d
Sparse version: 0.6.4 (Ubuntu: 0.6.4-4ubuntu3)
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/drm_drv.c:61:1: error: bad constant expression
+drivers/gpu/drm/drm_drv.c:62:1: error: bad constant expression
+drivers/gpu/drm/drm_drv.c:63:1: error: bad constant expression
+drivers/gpu/drm/drm_drv.c:63:1: error: bad constant expression

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✓ Xe.CI.BAT: success for Introduce DRM_RAS using generic netlink for RAS (rev5)
  2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
                   ` (7 preceding siblings ...)
  2026-02-02 16:31 ` ✗ CI.checksparse: warning " Patchwork
@ 2026-02-02 16:51 ` Patchwork
  8 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-02-02 16:51 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

== Series Details ==

Series: Introduce DRM_RAS using generic netlink for RAS (rev5)
URL   : https://patchwork.freedesktop.org/series/155188/
State : success

== Summary ==

CI Bug Log - changes from xe-4481-cd1fd615b2ba56ea3fb033262d4fbd0503055d3c_BAT -> xe-pw-155188v5_BAT
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (12 -> 12)
------------------------------

  No changes in participating hosts

Known issues
------------

  Here are the changes found in xe-pw-155188v5_BAT that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@xe_waitfence@abstime:
    - bat-dg2-oem2:       [PASS][1] -> [TIMEOUT][2] ([Intel XE#6506])
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4481-cd1fd615b2ba56ea3fb033262d4fbd0503055d3c/bat-dg2-oem2/igt@xe_waitfence@abstime.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v5/bat-dg2-oem2/igt@xe_waitfence@abstime.html

  
  [Intel XE#6506]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6506


Build changes
-------------

  * Linux: xe-4481-cd1fd615b2ba56ea3fb033262d4fbd0503055d3c -> xe-pw-155188v5

  IGT_8729: 8729
  xe-4481-cd1fd615b2ba56ea3fb033262d4fbd0503055d3c: cd1fd615b2ba56ea3fb033262d4fbd0503055d3c
  xe-pw-155188v5: 155188v5

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-155188v5/index.html

[-- Attachment #2: Type: text/html, Size: 1986 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
  2026-02-02  6:43 ` [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
  2026-02-02 10:08   ` kernel test robot
@ 2026-02-02 22:52   ` kernel test robot
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2026-02-02 22:52 UTC (permalink / raw)
  To: Riana Tauro, intel-xe, dri-devel
  Cc: oe-kbuild-all, aravind.iddamsetty, anshuman.gupta, rodrigo.vivi,
	joonas.lahtinen, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
	ravi.kishore.koppuravuri, raag.jadav, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, Jakub Kicinski, Paolo Abeni, Eric Dumazet, netdev,
	Riana Tauro

Hi Riana,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm-xe/drm-xe-next]
[also build test ERROR on drm-misc/drm-misc-next drm/drm-next linus/master v6.19-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Riana-Tauro/drm-ras-Introduce-the-DRM-RAS-infrastructure-over-generic-netlink/20260202-141553
base:   https://gitlab.freedesktop.org/drm/xe/kernel.git drm-xe-next
patch link:    https://lore.kernel.org/r/20260202064356.286243-8-riana.tauro%40intel.com
patch subject: [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink
config: csky-randconfig-002-20260203 (https://download.01.org/0day-ci/archive/20260203/202602030622.mmakbYmv-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260203/202602030622.mmakbYmv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602030622.mmakbYmv-lkp@intel.com/

All errors (new ones prefixed by >>):

   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nla_put_u32':
>> include/net/netlink.h:1459:(.text+0x12): undefined reference to `nla_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nla_put_string':
   include/net/netlink.h:1657:(.text+0x28): undefined reference to `nla_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nla_put_u32':
   include/net/netlink.h:1459:(.text+0x3a): undefined reference to `nla_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `drm_ras_node_unregister':
>> drivers/gpu/drm/drm_ras.c:350:(.text+0x10c): undefined reference to `nla_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `__genlmsg_iput':
>> include/net/genetlink.h:342:(.text+0x246): undefined reference to `genlmsg_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nla_put_u32':
   include/net/netlink.h:1459:(.text+0x262): undefined reference to `nla_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nla_put_string':
   include/net/netlink.h:1657:(.text+0x27e): undefined reference to `nla_put'
>> csky-linux-ld: include/net/netlink.h:1657:(.text+0x29a): undefined reference to `nla_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nla_put_u32':
   include/net/netlink.h:1459:(.text+0x2b6): undefined reference to `nla_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nlmsg_trim':
>> include/net/netlink.h:1108:(.text+0x2e4): undefined reference to `skb_trim'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `drm_ras_nl_list_nodes_dumpit':
>> drivers/gpu/drm/drm_ras.c:144:(.text+0x330): undefined reference to `genlmsg_put'
>> csky-linux-ld: drivers/gpu/drm/drm_ras.c:144:(.text+0x334): undefined reference to `nla_put'
>> csky-linux-ld: drivers/gpu/drm/drm_ras.c:144:(.text+0x344): undefined reference to `skb_trim'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `__genlmsg_iput':
   include/net/genetlink.h:342:(.text+0x408): undefined reference to `genlmsg_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nlmsg_trim':
   include/net/netlink.h:1108:(.text+0x45a): undefined reference to `skb_trim'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `alloc_skb':
>> include/linux/skbuff.h:1383:(.text+0x4f8): undefined reference to `__alloc_skb'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `__genlmsg_iput':
   include/net/genetlink.h:342:(.text+0x510): undefined reference to `genlmsg_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `kfree_skb_reason':
>> include/linux/skbuff.h:1322:(.text+0x51e): undefined reference to `sk_skb_reason_drop'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `doit_reply_value':
   drivers/gpu/drm/drm_ras.c:196:(.text+0x534): undefined reference to `genlmsg_put'
   csky-linux-ld: drivers/gpu/drm/drm_ras.c:196:(.text+0x544): undefined reference to `skb_trim'
>> csky-linux-ld: drivers/gpu/drm/drm_ras.c:196:(.text+0x548): undefined reference to `__alloc_skb'
>> csky-linux-ld: drivers/gpu/drm/drm_ras.c:196:(.text+0x54c): undefined reference to `sk_skb_reason_drop'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nlmsg_trim':
   include/net/netlink.h:1108:(.text+0x5a4): undefined reference to `skb_trim'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `kfree_skb_reason':
   include/linux/skbuff.h:1322:(.text+0x5b6): undefined reference to `sk_skb_reason_drop'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `nlmsg_unicast':
>> include/net/netlink.h:1198:(.text+0x5d6): undefined reference to `netlink_unicast'
   csky-linux-ld: drivers/gpu/drm/drm_ras.o: in function `drm_ras_nl_query_error_counter_doit':
>> drivers/gpu/drm/drm_ras.c:312:(.text+0x614): undefined reference to `skb_trim'
   csky-linux-ld: drivers/gpu/drm/drm_ras.c:312:(.text+0x618): undefined reference to `sk_skb_reason_drop'
>> csky-linux-ld: drivers/gpu/drm/drm_ras.c:312:(.text+0x61c): undefined reference to `init_net'
>> csky-linux-ld: drivers/gpu/drm/drm_ras.c:312:(.text+0x620): undefined reference to `netlink_unicast'
   csky-linux-ld: drivers/gpu/drm/drm_ras_genl_family.o: in function `drm_ras_genl_family_register':
>> drivers/gpu/drm/drm_ras_genl_family.c:23:(.text+0xa): undefined reference to `genl_register_family'
   csky-linux-ld: drivers/gpu/drm/drm_ras_genl_family.o: in function `drm_ras_genl_family_unregister':
>> drivers/gpu/drm/drm_ras_genl_family.c:39:(.text+0x4c): undefined reference to `genl_unregister_family'
>> csky-linux-ld: drivers/gpu/drm/drm_ras_genl_family.c:42:(.text+0x64): undefined reference to `genl_register_family'
>> csky-linux-ld: drivers/gpu/drm/drm_ras_genl_family.c:42:(.text+0x68): undefined reference to `genl_unregister_family'


vim +1459 include/net/netlink.h

24c410dce335db David S. Miller 2012-04-01  1448  
bfa83a9e03cf8d Thomas Graf     2005-11-10  1449  /**
bfa83a9e03cf8d Thomas Graf     2005-11-10  1450   * nla_put_u32 - Add a u32 netlink attribute to a socket buffer
bfa83a9e03cf8d Thomas Graf     2005-11-10  1451   * @skb: socket buffer to add attribute to
bfa83a9e03cf8d Thomas Graf     2005-11-10  1452   * @attrtype: attribute type
bfa83a9e03cf8d Thomas Graf     2005-11-10  1453   * @value: numeric value
bfa83a9e03cf8d Thomas Graf     2005-11-10  1454   */
bfa83a9e03cf8d Thomas Graf     2005-11-10  1455  static inline int nla_put_u32(struct sk_buff *skb, int attrtype, u32 value)
bfa83a9e03cf8d Thomas Graf     2005-11-10  1456  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1457  	u32 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1458  
b4391db42308c9 Arnd Bergmann   2017-09-22 @1459  	return nla_put(skb, attrtype, sizeof(u32), &tmp);
bfa83a9e03cf8d Thomas Graf     2005-11-10  1460  }
bfa83a9e03cf8d Thomas Graf     2005-11-10  1461  
374d345d9b5e13 Jakub Kicinski  2023-10-18  1462  /**
374d345d9b5e13 Jakub Kicinski  2023-10-18  1463   * nla_put_uint - Add a variable-size unsigned int to a socket buffer
374d345d9b5e13 Jakub Kicinski  2023-10-18  1464   * @skb: socket buffer to add attribute to
374d345d9b5e13 Jakub Kicinski  2023-10-18  1465   * @attrtype: attribute type
374d345d9b5e13 Jakub Kicinski  2023-10-18  1466   * @value: numeric value
374d345d9b5e13 Jakub Kicinski  2023-10-18  1467   */
374d345d9b5e13 Jakub Kicinski  2023-10-18  1468  static inline int nla_put_uint(struct sk_buff *skb, int attrtype, u64 value)
374d345d9b5e13 Jakub Kicinski  2023-10-18  1469  {
374d345d9b5e13 Jakub Kicinski  2023-10-18  1470  	u64 tmp64 = value;
374d345d9b5e13 Jakub Kicinski  2023-10-18  1471  	u32 tmp32 = value;
374d345d9b5e13 Jakub Kicinski  2023-10-18  1472  
374d345d9b5e13 Jakub Kicinski  2023-10-18  1473  	if (tmp64 == tmp32)
374d345d9b5e13 Jakub Kicinski  2023-10-18  1474  		return nla_put_u32(skb, attrtype, tmp32);
374d345d9b5e13 Jakub Kicinski  2023-10-18  1475  	return nla_put(skb, attrtype, sizeof(u64), &tmp64);
374d345d9b5e13 Jakub Kicinski  2023-10-18  1476  }
374d345d9b5e13 Jakub Kicinski  2023-10-18  1477  
bfa83a9e03cf8d Thomas Graf     2005-11-10  1478  /**
569a8fc38367df David S. Miller 2012-03-29  1479   * nla_put_be32 - Add a __be32 netlink attribute to a socket buffer
569a8fc38367df David S. Miller 2012-03-29  1480   * @skb: socket buffer to add attribute to
569a8fc38367df David S. Miller 2012-03-29  1481   * @attrtype: attribute type
569a8fc38367df David S. Miller 2012-03-29  1482   * @value: numeric value
569a8fc38367df David S. Miller 2012-03-29  1483   */
569a8fc38367df David S. Miller 2012-03-29  1484  static inline int nla_put_be32(struct sk_buff *skb, int attrtype, __be32 value)
569a8fc38367df David S. Miller 2012-03-29  1485  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1486  	__be32 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1487  
b4391db42308c9 Arnd Bergmann   2017-09-22  1488  	return nla_put(skb, attrtype, sizeof(__be32), &tmp);
569a8fc38367df David S. Miller 2012-03-29  1489  }
569a8fc38367df David S. Miller 2012-03-29  1490  
6c1dd3b6a35178 David S. Miller 2012-04-01  1491  /**
6c1dd3b6a35178 David S. Miller 2012-04-01  1492   * nla_put_net32 - Add 32-bit network byte order netlink attribute to a socket buffer
6c1dd3b6a35178 David S. Miller 2012-04-01  1493   * @skb: socket buffer to add attribute to
6c1dd3b6a35178 David S. Miller 2012-04-01  1494   * @attrtype: attribute type
6c1dd3b6a35178 David S. Miller 2012-04-01  1495   * @value: numeric value
6c1dd3b6a35178 David S. Miller 2012-04-01  1496   */
6c1dd3b6a35178 David S. Miller 2012-04-01  1497  static inline int nla_put_net32(struct sk_buff *skb, int attrtype, __be32 value)
6c1dd3b6a35178 David S. Miller 2012-04-01  1498  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1499  	__be32 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1500  
b4391db42308c9 Arnd Bergmann   2017-09-22  1501  	return nla_put_be32(skb, attrtype | NLA_F_NET_BYTEORDER, tmp);
6c1dd3b6a35178 David S. Miller 2012-04-01  1502  }
6c1dd3b6a35178 David S. Miller 2012-04-01  1503  
24c410dce335db David S. Miller 2012-04-01  1504  /**
24c410dce335db David S. Miller 2012-04-01  1505   * nla_put_le32 - Add a __le32 netlink attribute to a socket buffer
24c410dce335db David S. Miller 2012-04-01  1506   * @skb: socket buffer to add attribute to
24c410dce335db David S. Miller 2012-04-01  1507   * @attrtype: attribute type
24c410dce335db David S. Miller 2012-04-01  1508   * @value: numeric value
24c410dce335db David S. Miller 2012-04-01  1509   */
24c410dce335db David S. Miller 2012-04-01  1510  static inline int nla_put_le32(struct sk_buff *skb, int attrtype, __le32 value)
24c410dce335db David S. Miller 2012-04-01  1511  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1512  	__le32 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1513  
b4391db42308c9 Arnd Bergmann   2017-09-22  1514  	return nla_put(skb, attrtype, sizeof(__le32), &tmp);
24c410dce335db David S. Miller 2012-04-01  1515  }
24c410dce335db David S. Miller 2012-04-01  1516  
73520786b0793c Nicolas Dichtel 2016-04-22  1517  /**
73520786b0793c Nicolas Dichtel 2016-04-22  1518   * nla_put_u64_64bit - Add a u64 netlink attribute to a skb and align it
73520786b0793c Nicolas Dichtel 2016-04-22  1519   * @skb: socket buffer to add attribute to
73520786b0793c Nicolas Dichtel 2016-04-22  1520   * @attrtype: attribute type
73520786b0793c Nicolas Dichtel 2016-04-22  1521   * @value: numeric value
73520786b0793c Nicolas Dichtel 2016-04-22  1522   * @padattr: attribute type for the padding
73520786b0793c Nicolas Dichtel 2016-04-22  1523   */
73520786b0793c Nicolas Dichtel 2016-04-22  1524  static inline int nla_put_u64_64bit(struct sk_buff *skb, int attrtype,
73520786b0793c Nicolas Dichtel 2016-04-22  1525  				    u64 value, int padattr)
73520786b0793c Nicolas Dichtel 2016-04-22  1526  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1527  	u64 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1528  
b4391db42308c9 Arnd Bergmann   2017-09-22  1529  	return nla_put_64bit(skb, attrtype, sizeof(u64), &tmp, padattr);
73520786b0793c Nicolas Dichtel 2016-04-22  1530  }
73520786b0793c Nicolas Dichtel 2016-04-22  1531  
569a8fc38367df David S. Miller 2012-03-29  1532  /**
b46f6ded906ef0 Nicolas Dichtel 2016-04-22  1533   * nla_put_be64 - Add a __be64 netlink attribute to a socket buffer and align it
569a8fc38367df David S. Miller 2012-03-29  1534   * @skb: socket buffer to add attribute to
569a8fc38367df David S. Miller 2012-03-29  1535   * @attrtype: attribute type
569a8fc38367df David S. Miller 2012-03-29  1536   * @value: numeric value
b46f6ded906ef0 Nicolas Dichtel 2016-04-22  1537   * @padattr: attribute type for the padding
569a8fc38367df David S. Miller 2012-03-29  1538   */
b46f6ded906ef0 Nicolas Dichtel 2016-04-22  1539  static inline int nla_put_be64(struct sk_buff *skb, int attrtype, __be64 value,
b46f6ded906ef0 Nicolas Dichtel 2016-04-22  1540  			       int padattr)
569a8fc38367df David S. Miller 2012-03-29  1541  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1542  	__be64 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1543  
b4391db42308c9 Arnd Bergmann   2017-09-22  1544  	return nla_put_64bit(skb, attrtype, sizeof(__be64), &tmp, padattr);
569a8fc38367df David S. Miller 2012-03-29  1545  }
569a8fc38367df David S. Miller 2012-03-29  1546  
6c1dd3b6a35178 David S. Miller 2012-04-01  1547  /**
e9bbe898cbe89b Nicolas Dichtel 2016-04-22  1548   * nla_put_net64 - Add 64-bit network byte order nlattr to a skb and align it
6c1dd3b6a35178 David S. Miller 2012-04-01  1549   * @skb: socket buffer to add attribute to
6c1dd3b6a35178 David S. Miller 2012-04-01  1550   * @attrtype: attribute type
6c1dd3b6a35178 David S. Miller 2012-04-01  1551   * @value: numeric value
e9bbe898cbe89b Nicolas Dichtel 2016-04-22  1552   * @padattr: attribute type for the padding
6c1dd3b6a35178 David S. Miller 2012-04-01  1553   */
e9bbe898cbe89b Nicolas Dichtel 2016-04-22  1554  static inline int nla_put_net64(struct sk_buff *skb, int attrtype, __be64 value,
e9bbe898cbe89b Nicolas Dichtel 2016-04-22  1555  				int padattr)
6c1dd3b6a35178 David S. Miller 2012-04-01  1556  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1557  	__be64 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1558  
b4391db42308c9 Arnd Bergmann   2017-09-22  1559  	return nla_put_be64(skb, attrtype | NLA_F_NET_BYTEORDER, tmp,
e9bbe898cbe89b Nicolas Dichtel 2016-04-22  1560  			    padattr);
6c1dd3b6a35178 David S. Miller 2012-04-01  1561  }
6c1dd3b6a35178 David S. Miller 2012-04-01  1562  
24c410dce335db David S. Miller 2012-04-01  1563  /**
e7479122befd70 Nicolas Dichtel 2016-04-22  1564   * nla_put_le64 - Add a __le64 netlink attribute to a socket buffer and align it
24c410dce335db David S. Miller 2012-04-01  1565   * @skb: socket buffer to add attribute to
24c410dce335db David S. Miller 2012-04-01  1566   * @attrtype: attribute type
24c410dce335db David S. Miller 2012-04-01  1567   * @value: numeric value
e7479122befd70 Nicolas Dichtel 2016-04-22  1568   * @padattr: attribute type for the padding
24c410dce335db David S. Miller 2012-04-01  1569   */
e7479122befd70 Nicolas Dichtel 2016-04-22  1570  static inline int nla_put_le64(struct sk_buff *skb, int attrtype, __le64 value,
e7479122befd70 Nicolas Dichtel 2016-04-22  1571  			       int padattr)
24c410dce335db David S. Miller 2012-04-01  1572  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1573  	__le64 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1574  
b4391db42308c9 Arnd Bergmann   2017-09-22  1575  	return nla_put_64bit(skb, attrtype, sizeof(__le64), &tmp, padattr);
24c410dce335db David S. Miller 2012-04-01  1576  }
24c410dce335db David S. Miller 2012-04-01  1577  
4778e0be16c291 Jiri Pirko      2012-07-27  1578  /**
4778e0be16c291 Jiri Pirko      2012-07-27  1579   * nla_put_s8 - Add a s8 netlink attribute to a socket buffer
4778e0be16c291 Jiri Pirko      2012-07-27  1580   * @skb: socket buffer to add attribute to
4778e0be16c291 Jiri Pirko      2012-07-27  1581   * @attrtype: attribute type
4778e0be16c291 Jiri Pirko      2012-07-27  1582   * @value: numeric value
4778e0be16c291 Jiri Pirko      2012-07-27  1583   */
4778e0be16c291 Jiri Pirko      2012-07-27  1584  static inline int nla_put_s8(struct sk_buff *skb, int attrtype, s8 value)
4778e0be16c291 Jiri Pirko      2012-07-27  1585  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1586  	s8 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1587  
b4391db42308c9 Arnd Bergmann   2017-09-22  1588  	return nla_put(skb, attrtype, sizeof(s8), &tmp);
4778e0be16c291 Jiri Pirko      2012-07-27  1589  }
4778e0be16c291 Jiri Pirko      2012-07-27  1590  
4778e0be16c291 Jiri Pirko      2012-07-27  1591  /**
4778e0be16c291 Jiri Pirko      2012-07-27  1592   * nla_put_s16 - Add a s16 netlink attribute to a socket buffer
4778e0be16c291 Jiri Pirko      2012-07-27  1593   * @skb: socket buffer to add attribute to
4778e0be16c291 Jiri Pirko      2012-07-27  1594   * @attrtype: attribute type
4778e0be16c291 Jiri Pirko      2012-07-27  1595   * @value: numeric value
4778e0be16c291 Jiri Pirko      2012-07-27  1596   */
4778e0be16c291 Jiri Pirko      2012-07-27  1597  static inline int nla_put_s16(struct sk_buff *skb, int attrtype, s16 value)
4778e0be16c291 Jiri Pirko      2012-07-27  1598  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1599  	s16 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1600  
b4391db42308c9 Arnd Bergmann   2017-09-22  1601  	return nla_put(skb, attrtype, sizeof(s16), &tmp);
4778e0be16c291 Jiri Pirko      2012-07-27  1602  }
4778e0be16c291 Jiri Pirko      2012-07-27  1603  
4778e0be16c291 Jiri Pirko      2012-07-27  1604  /**
4778e0be16c291 Jiri Pirko      2012-07-27  1605   * nla_put_s32 - Add a s32 netlink attribute to a socket buffer
4778e0be16c291 Jiri Pirko      2012-07-27  1606   * @skb: socket buffer to add attribute to
4778e0be16c291 Jiri Pirko      2012-07-27  1607   * @attrtype: attribute type
4778e0be16c291 Jiri Pirko      2012-07-27  1608   * @value: numeric value
4778e0be16c291 Jiri Pirko      2012-07-27  1609   */
4778e0be16c291 Jiri Pirko      2012-07-27  1610  static inline int nla_put_s32(struct sk_buff *skb, int attrtype, s32 value)
4778e0be16c291 Jiri Pirko      2012-07-27  1611  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1612  	s32 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1613  
b4391db42308c9 Arnd Bergmann   2017-09-22  1614  	return nla_put(skb, attrtype, sizeof(s32), &tmp);
4778e0be16c291 Jiri Pirko      2012-07-27  1615  }
4778e0be16c291 Jiri Pirko      2012-07-27  1616  
4778e0be16c291 Jiri Pirko      2012-07-27  1617  /**
756a2f59b73cd6 Nicolas Dichtel 2016-04-22  1618   * nla_put_s64 - Add a s64 netlink attribute to a socket buffer and align it
4778e0be16c291 Jiri Pirko      2012-07-27  1619   * @skb: socket buffer to add attribute to
4778e0be16c291 Jiri Pirko      2012-07-27  1620   * @attrtype: attribute type
4778e0be16c291 Jiri Pirko      2012-07-27  1621   * @value: numeric value
756a2f59b73cd6 Nicolas Dichtel 2016-04-22  1622   * @padattr: attribute type for the padding
4778e0be16c291 Jiri Pirko      2012-07-27  1623   */
756a2f59b73cd6 Nicolas Dichtel 2016-04-22  1624  static inline int nla_put_s64(struct sk_buff *skb, int attrtype, s64 value,
756a2f59b73cd6 Nicolas Dichtel 2016-04-22  1625  			      int padattr)
4778e0be16c291 Jiri Pirko      2012-07-27  1626  {
b4391db42308c9 Arnd Bergmann   2017-09-22  1627  	s64 tmp = value;
b4391db42308c9 Arnd Bergmann   2017-09-22  1628  
b4391db42308c9 Arnd Bergmann   2017-09-22  1629  	return nla_put_64bit(skb, attrtype, sizeof(s64), &tmp, padattr);
4778e0be16c291 Jiri Pirko      2012-07-27  1630  }
4778e0be16c291 Jiri Pirko      2012-07-27  1631  
374d345d9b5e13 Jakub Kicinski  2023-10-18  1632  /**
374d345d9b5e13 Jakub Kicinski  2023-10-18  1633   * nla_put_sint - Add a variable-size signed int to a socket buffer
374d345d9b5e13 Jakub Kicinski  2023-10-18  1634   * @skb: socket buffer to add attribute to
374d345d9b5e13 Jakub Kicinski  2023-10-18  1635   * @attrtype: attribute type
374d345d9b5e13 Jakub Kicinski  2023-10-18  1636   * @value: numeric value
374d345d9b5e13 Jakub Kicinski  2023-10-18  1637   */
374d345d9b5e13 Jakub Kicinski  2023-10-18  1638  static inline int nla_put_sint(struct sk_buff *skb, int attrtype, s64 value)
374d345d9b5e13 Jakub Kicinski  2023-10-18  1639  {
374d345d9b5e13 Jakub Kicinski  2023-10-18  1640  	s64 tmp64 = value;
374d345d9b5e13 Jakub Kicinski  2023-10-18  1641  	s32 tmp32 = value;
374d345d9b5e13 Jakub Kicinski  2023-10-18  1642  
374d345d9b5e13 Jakub Kicinski  2023-10-18  1643  	if (tmp64 == tmp32)
374d345d9b5e13 Jakub Kicinski  2023-10-18  1644  		return nla_put_s32(skb, attrtype, tmp32);
374d345d9b5e13 Jakub Kicinski  2023-10-18  1645  	return nla_put(skb, attrtype, sizeof(s64), &tmp64);
374d345d9b5e13 Jakub Kicinski  2023-10-18  1646  }
374d345d9b5e13 Jakub Kicinski  2023-10-18  1647  
bfa83a9e03cf8d Thomas Graf     2005-11-10  1648  /**
bfa83a9e03cf8d Thomas Graf     2005-11-10  1649   * nla_put_string - Add a string netlink attribute to a socket buffer
bfa83a9e03cf8d Thomas Graf     2005-11-10  1650   * @skb: socket buffer to add attribute to
bfa83a9e03cf8d Thomas Graf     2005-11-10  1651   * @attrtype: attribute type
bfa83a9e03cf8d Thomas Graf     2005-11-10  1652   * @str: NUL terminated string
bfa83a9e03cf8d Thomas Graf     2005-11-10  1653   */
bfa83a9e03cf8d Thomas Graf     2005-11-10  1654  static inline int nla_put_string(struct sk_buff *skb, int attrtype,
bfa83a9e03cf8d Thomas Graf     2005-11-10  1655  				 const char *str)
bfa83a9e03cf8d Thomas Graf     2005-11-10  1656  {
bfa83a9e03cf8d Thomas Graf     2005-11-10 @1657  	return nla_put(skb, attrtype, strlen(str) + 1, str);
bfa83a9e03cf8d Thomas Graf     2005-11-10  1658  }
bfa83a9e03cf8d Thomas Graf     2005-11-10  1659  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS
  2026-02-02  6:43 ` [PATCH v5 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
@ 2026-02-03 17:58   ` Raag Jadav
  2026-02-10  4:20     ` Riana Tauro
  0 siblings, 1 reply; 24+ messages in thread
From: Raag Jadav @ 2026-02-03 17:58 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri

On Mon, Feb 02, 2026 at 12:13:58PM +0530, Riana Tauro wrote:
> Allocate correctable, uncorrectable nodes for every xe device

Punctuations.

> Each node contains error component, counters and respective
> query counter functions.

Try to utilize the full 75 characters space where possible.

> Add basic functionality to create and register drm nodes.
> Below operations can be performed using Generic netlink DRM RAS interface

Punctuations.

...

> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
> @@ -0,0 +1,184 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#include <drm/drm_managed.h>
> +#include <drm/drm_print.h>
> +#include <drm/drm_ras.h>
> +#include <linux/bitmap.h>

Linux includes usually go first.

> +#include "xe_device_types.h"
> +#include "xe_drm_ras.h"
> +
> +static const char * const errors[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;

'error_component'?

> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;

...

> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
> +{
> +	struct xe_drm_ras_counter *counter;
> +	int i;
> +
> +	counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_COMP_MAX,
> +			       sizeof(*counter), GFP_KERNEL);

Can be one line.

> +	if (!counter)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE; i < DRM_XE_RAS_ERR_COMP_MAX; i++) {
> +		if (!errors[i])
> +			continue;
> +
> +		counter[i].name = errors[i];
> +		atomic_set(&counter[i].counter, 0);

Do you need this?

> +	}
> +
> +	return counter;
> +}

...

> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
> +{
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct drm_ras_node *node;
> +	int err;
> +
> +	node = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_SEV_MAX, sizeof(*node),
> +			    GFP_KERNEL);

Can be one line.

> +	if (!node)
> +		return -ENOMEM;
> +
> +	ras->node = node;
> +
> +	err = register_nodes(xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to register DRM RAS node\n");
> +		return err;
> +	}
> +
> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
> +	if (err) {
> +		drm_err(&xe->drm, "Failed to add action for Xe DRM RAS\n");
> +		return err;
> +	}
> +
> +	return 0;
> +}

...

> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _XE_DRM_RAS_TYPES_H_
> +#define _XE_DRM_RAS_TYPES_H_
> +
> +#include <drm/xe_drm.h>
> +#include <linux/atomic.h>

Ditto for linux includes.

> +struct drm_ras_node;

Reviewed-by: Raag Jadav <raag.jadav@intel.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
  2026-02-02  6:43 ` [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
@ 2026-02-05  8:30   ` Raag Jadav
  2026-02-10  4:58     ` Riana Tauro
  0 siblings, 1 reply; 24+ messages in thread
From: Raag Jadav @ 2026-02-05  8:30 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri

On Mon, Feb 02, 2026 at 12:13:59PM +0530, Riana Tauro wrote:
> Initialize DRM RAS in hw error init. Map the UAPI error severities
> with the hardware error severities and refactor file.
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_drm_ras_types.h |  8 ++++
>  drivers/gpu/drm/xe/xe_hw_error.c      | 68 ++++++++++++++++-----------
>  2 files changed, 48 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
> index 0ac4ae324f37..beed48811d6a 100644
> --- a/drivers/gpu/drm/xe/xe_drm_ras_types.h
> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
> @@ -11,6 +11,14 @@
>  
>  struct drm_ras_node;
>  
> +/* Error categories reported by hardware */
> +enum hardware_error {
> +	HARDWARE_ERROR_CORRECTABLE = 0,
> +	HARDWARE_ERROR_NONFATAL = 1,
> +	HARDWARE_ERROR_FATAL = 2,

I'd align "= x" using tabs for readability.

> +	HARDWARE_ERROR_MAX,

Guaranteed last member, so redundant comma.

> +};

Also, just curious. Are these expected to be reused anywhere?
If not, they're probably better off in the .c file.

...

> @@ -86,8 +78,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>  		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>  		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>  			drm_err_ratelimited(&xe->drm, HW_ERR
> -					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
> -					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
> +					    "HEC FW %s error reported, bit[%d] is set\n",
> +					     hec_uncorrected_fw_errors[err_bit],

So we're dropping severity_str here? Did I miss something?

>  					     err_bit);

...

> +static int hw_error_info_init(struct xe_device *xe)
> +{
> +	int ret;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return 0;
> +
> +	ret = xe_drm_ras_allocate_nodes(xe);

Why not just

	return xe_drm_ras_allocate_nodes();

Tidy? ;)

> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
>  /*
>   * Process hardware errors during boot
>   */
> @@ -172,11 +179,16 @@ static void process_hw_errors(struct xe_device *xe)
>  void xe_hw_error_init(struct xe_device *xe)
>  {
>  	struct xe_tile *tile = xe_device_get_root_tile(xe);
> +	int ret;
>  
>  	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>  		return;
>  
>  	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>  
> +	ret = hw_error_info_init(xe);
> +	if (ret)
> +		drm_warn(&xe->drm, "Failed to allocate DRM RAS nodes\n");

This is less likely due to any hardware limitation, so I think
drm_err() would be more appropriate here.

Raag

> +
>  	process_hw_errors(xe);
>  }
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors
  2026-02-02  6:44 ` [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
@ 2026-02-05 15:30   ` Raag Jadav
  2026-02-10  5:58     ` Riana Tauro
  0 siblings, 1 reply; 24+ messages in thread
From: Raag Jadav @ 2026-02-05 15:30 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

On Mon, Feb 02, 2026 at 12:14:00PM +0530, Riana Tauro wrote:
> PVC supports GT error reporting via vector registers along with
> error status register. Add support to report these errors and
> update respective counters. Incase of Subslice error reported
> by vector register, process the error status register
> for applicable bits.
> 
> The counter is embedded in the xe drm ras structure and is
> exposed to the userspace using the drm_ras generic netlink
> interface.
> 
> $ sudo ynl --family drm_ras --do query-error-counter  --json

We usually add '\' at the end for wrapping commands so that they're easy
to apply directly (and same for all other patches where applicable).

>   '{"node-id":0, "error-id":1}'

Ditto.

> {'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
> 
> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: Add ID's and names as uAPI (Rodrigo)
>     Add documentation
>     Modify commit message
> 
> v3: remove 'error' from counters
>     use drmm_kcalloc
>     add a for_each for severity
>     differentitate error classes and severity in UAPI(Raag)
>     Use correctable and uncorrectable in uapi (Pratik / Aravind)
> 
> v4: modify enums in UAPI
>     improve comments
>     add bounds check in handler
>     add error mask macro (Raag)
>     use atomic_t
>     add null pointer checks
> ---
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  62 ++++++-
>  drivers/gpu/drm/xe/xe_hw_error.c           | 199 +++++++++++++++++++--
>  2 files changed, 241 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index c146b9ef44eb..17982a335941 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -6,15 +6,59 @@
>  #ifndef _XE_HW_ERROR_REGS_H_
>  #define _XE_HW_ERROR_REGS_H_
>  
> -#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
> -#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
> +#define HEC_UNCORR_ERR_STATUS(base)			XE_REG((base) + 0x118)
> +#define   UNCORR_FW_REPORTED_ERR			REG_BIT(6)
>  
> -#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
> +#define HEC_UNCORR_FW_ERR_DW0(base)			XE_REG((base) + 0x124)
> +
> +#define ERR_STAT_GT_COR					0x100160
> +#define   EU_GRF_COR_ERR				REG_BIT(15)
> +#define   EU_IC_COR_ERR					REG_BIT(14)
> +#define   SLM_COR_ERR					REG_BIT(13)
> +#define   GUC_COR_ERR					REG_BIT(1)
> +
> +#define ERR_STAT_GT_NONFATAL				0x100164
> +#define ERR_STAT_GT_FATAL				0x100168
> +#define   EU_GRF_FAT_ERR				REG_BIT(15)
> +#define   SLM_FAT_ERR					REG_BIT(13)
> +#define   GUC_FAT_ERR					REG_BIT(6)
> +#define   FPU_FAT_ERR					REG_BIT(3)
> +
> +#define ERR_STAT_GT_REG(x)				XE_REG(_PICK_EVEN((x), \
> +									  ERR_STAT_GT_COR, \
> +									  ERR_STAT_GT_NONFATAL))
> +
> +#define PVC_COR_ERR_MASK				(GUC_COR_ERR | SLM_COR_ERR | \
> +							 EU_IC_COR_ERR | EU_GRF_COR_ERR)
> +
> +#define PVC_FAT_ERR_MASK				(FPU_FAT_ERR | GUC_FAT_ERR | \
> +							EU_GRF_FAT_ERR | SLM_FAT_ERR)

Nit: Whitespace please!

> +#define DEV_ERR_STAT_NONFATAL				0x100178
> +#define DEV_ERR_STAT_CORRECTABLE			0x10017c
> +#define DEV_ERR_STAT_REG(x)				XE_REG(_PICK_EVEN((x), \
> +									  DEV_ERR_STAT_CORRECTABLE, \
> +									  DEV_ERR_STAT_NONFATAL))

I know it was already like this but how does this evaluate for FATAL?

> +#define   XE_CSC_ERROR					17
> +#define   XE_GT_ERROR					0
> +
> +#define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
> +#define ERR_STAT_GT_FATAL_VECTOR_1			0x100264
> +
> +#define ERR_STAT_GT_FATAL_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
> +								  ERR_STAT_GT_FATAL_VECTOR_0, \
> +								  ERR_STAT_GT_FATAL_VECTOR_1))
> +
> +#define ERR_STAT_GT_COR_VECTOR_0			0x1002a0
> +#define ERR_STAT_GT_COR_VECTOR_1			0x1002a4
> +
> +#define ERR_STAT_GT_COR_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
> +									  ERR_STAT_GT_COR_VECTOR_0, \
> +									  ERR_STAT_GT_COR_VECTOR_1))
> +
> +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)		(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
> +							ERR_STAT_GT_COR_VECTOR_REG(x) : \
> +							ERR_STAT_GT_FATAL_VECTOR_REG(x))

Ditto for whitespace.

> -#define DEV_ERR_STAT_NONFATAL			0x100178
> -#define DEV_ERR_STAT_CORRECTABLE		0x10017c
> -#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
> -								  DEV_ERR_STAT_CORRECTABLE, \
> -								  DEV_ERR_STAT_NONFATAL))
> -#define   XE_CSC_ERROR				BIT(17)
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 2019aaaa1ebe..ff31fb322c8a 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -3,6 +3,7 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <linux/bitmap.h>
>  #include <linux/fault-inject.h>
>  
>  #include "regs/xe_gsc_regs.h"
> @@ -15,7 +16,13 @@
>  #include "xe_mmio.h"
>  #include "xe_survivability_mode.h"
>  
> -#define  HEC_UNCORR_FW_ERR_BITS 4
> +#define  GT_HW_ERROR_MAX_ERR_BITS	16
> +#define  HEC_UNCORR_FW_ERR_BITS		4
> +#define  XE_RAS_REG_SIZE		32
> +
> +#define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
> +	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
> +	(BIT(err_bit) & PVC_FAT_ERR_MASK))

I'd write this as below and move it to xe_hw_error_regs.h

#define PVC_COR_ERR_MASK_SET(err_bit)			(PVC_COR_ERR_MASK & REG_BIT(err_bit))
#define PVC_FAT_ERR_MASK_SET(err_bit)			(PVC_FAT_ERR_MASK & REG_BIT(err_bit))

#define PVC_ERR_MASK_SET(hw_err, err_bit)		((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
								PVC_COR_ERR_MASK_SET(err_bit) : \
								PVC_FAT_ERR_MASK_SET(err_bit)

...

> +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> +				u32 error_id)
> +{
> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	struct xe_mmio *mmio = &tile->mmio;
> +	unsigned long err_stat = 0;
> +	int i, len;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return;
> +
> +	if (!info)
> +		return;

Since info allocation is not related to hardware, we shouldn't even be
at this point without it. So let's not hide bugs and fail probe instead.

> +	if (hw_err == HARDWARE_ERROR_NONFATAL) {
> +		atomic_inc(&info[error_id].counter);
> +		log_hw_error(tile, info[error_id].name, severity);
> +		return;
> +	}

...

>  static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>  {
>  	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>  	const char *severity_str = error_severity[severity];
>  	struct xe_device *xe = tile_to_xe(tile);
> -	unsigned long flags;
> -	u32 err_src;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	unsigned long flags, err_src;
> +	u32 err_bit;
>  
> -	if (xe->info.platform != XE_BATTLEMAGE)
> +	if (!IS_DGFX(xe))
>  		return;
>  
>  	spin_lock_irqsave(&xe->irq.lock, flags);
> @@ -108,11 +242,53 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>  		goto unlock;
>  	}
>  
> -	if (err_src & XE_CSC_ERROR)
> +	/*
> +	 * On encountering CSC firmware errors, the graphics device becomes unrecoverable
> +	 * so return immediately on error. The only way to recover from these errors is
> +	 * firmware flash. The device will enter Runtime Survivability mode when such
> +	 * errors are detected.
> +	 */
> +	if (err_src & XE_CSC_ERROR) {
>  		csc_hw_error_handler(tile, hw_err);
> +		goto clear_reg;
> +	}
>  
> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
> +	if (!info)
> +		goto clear_reg;

Same as above.

Raag

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors
  2026-02-02  6:44 ` [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
@ 2026-02-05 18:10   ` Raag Jadav
  2026-02-10  6:32     ` Riana Tauro
  0 siblings, 1 reply; 24+ messages in thread
From: Raag Jadav @ 2026-02-05 18:10 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

On Mon, Feb 02, 2026 at 12:14:01PM +0530, Riana Tauro wrote:
> Report the SoC nonfatal/fatal hardware error and update the counters.
> 
> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":2}'

Same comment as last patch.

> {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}
> 
> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> v2: Add ID's and names as uAPI (Rodrigo)
> 
> v3: reorder and align arrays
>     remove redundant string err
>     use REG_BIT
>     fix aesthic review comments (Raag)
>     use only correctable/uncorrectable error severity (Aravind)
> 
> v4: fix comments
>     use master as variable name
>     add static_assert (Raag)
> ---
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
>  drivers/gpu/drm/xe/xe_hw_error.c           | 221 ++++++++++++++++++++-
>  2 files changed, 244 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index 17982a335941..a89a07d067fc 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -41,6 +41,7 @@
>  									  DEV_ERR_STAT_NONFATAL))
>  
>  #define   XE_CSC_ERROR					17

I overlooked this in the last patch but I think this should be used as

	if (err_src & REG_BIT(XE_CSC_ERROR))

> +#define   XE_SOC_ERROR					16
>  #define   XE_GT_ERROR					0
>  
>  #define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
> @@ -61,4 +62,27 @@
>  							ERR_STAT_GT_COR_VECTOR_REG(x) : \
>  							ERR_STAT_GT_FATAL_VECTOR_REG(x))
>  
> +#define SOC_PVC_MASTER_BASE				0x282000
> +#define SOC_PVC_SLAVE_BASE				0x283000
> +
> +#define SOC_GCOERRSTS					0x200
> +#define SOC_GNFERRSTS					0x210
> +#define SOC_GLOBAL_ERR_STAT_REG(base, x)		XE_REG(_PICK_EVEN((x), \
> +									  (base) + SOC_GCOERRSTS, \
> +									  (base) + SOC_GNFERRSTS))
> +#define   SOC_SLAVE_IEH					REG_BIT(1)
> +#define   SOC_IEH0_LOCAL_ERR_STATUS			REG_BIT(0)
> +#define   SOC_IEH1_LOCAL_ERR_STATUS			REG_BIT(0)
> +
> +#define SOC_GSYSEVTCTL					0x264
> +#define SOC_GSYSEVTCTL_REG(master, slave, x)		XE_REG(_PICK_EVEN((x), \
> +									  (master) + SOC_GSYSEVTCTL, \
> +									  (slave) + SOC_GSYSEVTCTL))
> +
> +#define SOC_LERRUNCSTS					0x280
> +#define SOC_LERRCORSTS					0x294
> +#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)		XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
> +							       (base) + SOC_LERRCORSTS : \
> +							       (base) + SOC_LERRUNCSTS)
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index ff31fb322c8a..159ec796386a 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -19,6 +19,7 @@
>  #define  GT_HW_ERROR_MAX_ERR_BITS	16
>  #define  HEC_UNCORR_FW_ERR_BITS		4
>  #define  XE_RAS_REG_SIZE		32
> +#define  XE_SOC_NUM_IEH			2
>  
>  #define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
>  	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
> @@ -36,7 +37,8 @@ static const char * const hec_uncorrected_fw_errors[] = {
>  };
>  
>  static const unsigned long xe_hw_error_map[] = {
> -	[XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
> +	[XE_GT_ERROR]	= DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,

Unneeded churn, please align in the original patch.

> +	[XE_SOC_ERROR]	= DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
>  };
>  
>  enum gt_vector_regs {
> @@ -60,6 +62,102 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
>  	return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
>  }
>  
> +static const char * const pvc_master_global_err_reg[] = {
> +	[0 ... 1]	= "Undefined",
> +	[2]		= "HBM SS0: Channel0",
> +	[3]		= "HBM SS0: Channel1",
> +	[4]		= "HBM SS0: Channel2",
> +	[5]		= "HBM SS0: Channel3",
> +	[6]		= "HBM SS0: Channel4",
> +	[7]		= "HBM SS0: Channel5",
> +	[8]		= "HBM SS0: Channel6",
> +	[9]		= "HBM SS0: Channel7",
> +	[10]		= "HBM SS1: Channel0",
> +	[11]		= "HBM SS1: Channel1",
> +	[12]		= "HBM SS1: Channel2",
> +	[13]		= "HBM SS1: Channel3",
> +	[14]		= "HBM SS1: Channel4",
> +	[15]		= "HBM SS1: Channel5",
> +	[16]		= "HBM SS1: Channel6",
> +	[17]		= "HBM SS1: Channel7",
> +	[18 ... 31]	= "Undefined",
> +};
> +

Redundant blank line.

> +static_assert(ARRAY_SIZE(pvc_master_global_err_reg) == XE_RAS_REG_SIZE);
> +
> +static const char * const pvc_slave_global_err_reg[] = {
> +	[0]		= "Undefined",
> +	[1]		= "HBM SS2: Channel0",
> +	[2]		= "HBM SS2: Channel1",
> +	[3]		= "HBM SS2: Channel2",
> +	[4]		= "HBM SS2: Channel3",
> +	[5]		= "HBM SS2: Channel4",
> +	[6]		= "HBM SS2: Channel5",
> +	[7]		= "HBM SS2: Channel6",
> +	[8]		= "HBM SS2: Channel7",
> +	[9]		= "HBM SS3: Channel0",
> +	[10]		= "HBM SS3: Channel1",
> +	[11]		= "HBM SS3: Channel2",
> +	[12]		= "HBM SS3: Channel3",
> +	[13]		= "HBM SS3: Channel4",
> +	[14]		= "HBM SS3: Channel5",
> +	[15]		= "HBM SS3: Channel6",
> +	[16]		= "HBM SS3: Channel7",
> +	[17]		= "Undefined",
> +	[18]		= "ANR MDFI",
> +	[19 ... 31]	= "Undefined",
> +};
> +

Ditto.

> +static_assert(ARRAY_SIZE(pvc_slave_global_err_reg) == XE_RAS_REG_SIZE);
> +
> +static const char * const pvc_slave_local_fatal_err_reg[] = {
> +	[0]		= "Local IEH: Malformed PCIe AER",
> +	[1]		= "Local IEH: Malformed PCIe ERR",
> +	[2]		= "Local IEH: UR conditions in IEH",
> +	[3]		= "Local IEH: From SERR Sources",
> +	[4 ... 19]	= "Undefined",
> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
> +	[21 ... 31]	= "Undefined",
> +};
> +

Ditto.

> +static_assert(ARRAY_SIZE(pvc_slave_local_fatal_err_reg) == XE_RAS_REG_SIZE);
> +
> +static const char * const pvc_master_local_fatal_err_reg[] = {
> +	[0]		= "Local IEH: Malformed IOSF PCIe AER",
> +	[1]		= "Local IEH: Malformed IOSF PCIe ERR",
> +	[2]		= "Local IEH: UR RESPONSE",
> +	[3]		= "Local IEH: From SERR SPI controller",
> +	[4]		= "Base Die MDFI T2T",
> +	[5]		= "Undefined",
> +	[6]		= "Base Die MDFI T2C",
> +	[7]		= "Undefined",
> +	[8]		= "Invalid CSC PSF Command Parity",
> +	[9]		= "Invalid CSC PSF Unexpected Completion",
> +	[10]		= "Invalid CSC PSF Unsupported Request",
> +	[11]		= "Invalid PCIe PSF Command Parity",
> +	[12]		= "PCIe PSF Unexpected Completion",
> +	[13]		= "PCIe PSF Unsupported Request",
> +	[14 ... 19]	= "Undefined",
> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
> +	[21 ... 31]	= "Undefined",
> +};
> +

Ditto.

> +static_assert(ARRAY_SIZE(pvc_master_local_fatal_err_reg) == XE_RAS_REG_SIZE);
> +
> +static const char * const pvc_master_local_nonfatal_err_reg[] = {
> +	[0 ... 3]	= "Undefined",
> +	[4]		= "Base Die MDFI T2T",
> +	[5]		= "Undefined",
> +	[6]		= "Base Die MDFI T2C",
> +	[7]		= "Undefined",
> +	[8]		= "Invalid CSC PSF Command Parity",
> +	[9]		= "Invalid CSC PSF Unexpected Completion",
> +	[10]		= "Invalid PCIe PSF Command Parity",
> +	[11 ... 31]	= "Undefined",
> +};
> +

Ditto.

> +static_assert(ARRAY_SIZE(pvc_master_local_nonfatal_err_reg) == XE_RAS_REG_SIZE);
> +
>  static bool fault_inject_csc_hw_error(void)
>  {
>  	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> @@ -138,6 +236,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
>  				    name, severity_str, i, err);
>  }
>  
> +static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
> +			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
> +{
> +	const char *severity_str = error_severity[severity];
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[severity];
> +	const char *name;
> +
> +	name = reg_info[err_bit];
> +
> +	if (strcmp(name, "Undefined")) {
> +		if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
> +			drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
> +		else
> +			drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
> +		atomic_inc(&info[index].counter);
> +	}
> +}
> +
>  static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>  				u32 error_id)
>  {
> @@ -221,6 +339,104 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>  	}
>  }
>  
> +static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
> +{
> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> +	unsigned long slave_global_errstat, slave_local_errstat;
> +	struct xe_mmio *mmio = &tile->mmio;
> +	u32 regbit, slave_base;
> +
> +	slave_base = SOC_PVC_SLAVE_BASE;

Just name it 'slave' and it'll probably help remove the line wrapping below.

> +	slave_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err));
> +
> +	if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
> +		slave_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err));
> +
> +		if (hw_err == HARDWARE_ERROR_FATAL) {
> +			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
> +				log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
> +					      regbit, error_id);
> +		}
> +
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
> +				slave_local_errstat);
> +	}
> +
> +	for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
> +		log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
> +
> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err), slave_global_errstat);
> +}
> +
> +static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> +				 u32 error_id)
> +{
> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_mmio *mmio = &tile->mmio;
> +	unsigned long master_global_errstat, master_local_errstat;
> +	u32 master_base, slave_base, regbit;
> +	int i;
> +
> +	if (xe->info.platform != XE_PVC)
> +		return;
> +
> +	master_base = SOC_PVC_MASTER_BASE;
> +	slave_base = SOC_PVC_SLAVE_BASE;

Ditto. Just 'master' and 'slave' will help remove the line wrapping below.

> +	/* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
> +				~REG_BIT(hw_err));
> +
> +	if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
> +				REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
> +				REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
> +				REG_GENMASK(31, 0));
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
> +				REG_GENMASK(31, 0));
> +		goto unmask_gsysevtctl;
> +	}
> +
> +	/*
> +	 * Read the master global IEH error register if BIT(1) is set then process

Missing comma after 'register'.

> +	 * the slave IEH first. If BIT(0) in global error register is set then process
> +	 * the corresponding local error registers.
> +	 */
> +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err));
> +	if (master_global_errstat & SOC_SLAVE_IEH)
> +		soc_slave_ieh_handler(tile, hw_err, error_id);
> +
> +	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
> +		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err));
> +
> +		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
> +			const char * const *reg_info = (hw_err == HARDWARE_ERROR_FATAL) ?

This looks like it can be outside the loop.

Raag

> +						       pvc_master_local_fatal_err_reg :
> +						       pvc_master_local_nonfatal_err_reg;
> +
> +			log_soc_error(tile, reg_info, severity, regbit, error_id);
> +		}
> +
> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
> +				master_local_errstat);
> +	}
> +
> +	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
> +		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
> +
> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
> +			master_global_errstat);
> +
> +unmask_gsysevtctl:
> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
> +				(HARDWARE_ERROR_MAX << 1) + 1);
> +}
> +
>  static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>  {
>  	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> @@ -283,8 +499,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>  					    "TILE%d reported %s %s, bit[%d] is set\n",
>  					    tile->id, name, severity_str, err_bit);
>  		}
> +
>  		if (err_bit == XE_GT_ERROR)
>  			gt_hw_error_handler(tile, hw_err, error_id);
> +		if (err_bit == XE_SOC_ERROR)
> +			soc_hw_error_handler(tile, hw_err, error_id);
>  	}
>  
>  clear_reg:
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS
  2026-02-03 17:58   ` Raag Jadav
@ 2026-02-10  4:20     ` Riana Tauro
  0 siblings, 0 replies; 24+ messages in thread
From: Riana Tauro @ 2026-02-10  4:20 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri



On 2/3/2026 11:28 PM, Raag Jadav wrote:
> On Mon, Feb 02, 2026 at 12:13:58PM +0530, Riana Tauro wrote:
>> Allocate correctable, uncorrectable nodes for every xe device
> 
> Punctuations.
> 
>> Each node contains error component, counters and respective
>> query counter functions.
> 
> Try to utilize the full 75 characters space where possible.
> 
>> Add basic functionality to create and register drm nodes.
>> Below operations can be performed using Generic netlink DRM RAS interface
> 
> Punctuations.

Will fix above

> 
> ...
> 
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
>> @@ -0,0 +1,184 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_managed.h>
>> +#include <drm/drm_print.h>
>> +#include <drm/drm_ras.h>
>> +#include <linux/bitmap.h>
> 
> Linux includes usually go first.
> 
>> +#include "xe_device_types.h"
>> +#include "xe_drm_ras.h"
>> +
>> +static const char * const errors[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;
> 
> 'error_component'?

will rename.

> 
>> +static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
> 
> ...
> 
>> +static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
>> +{
>> +	struct xe_drm_ras_counter *counter;
>> +	int i;
>> +
>> +	counter = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_COMP_MAX,
>> +			       sizeof(*counter), GFP_KERNEL);
> 
> Can be one line.

yeah will make it a single line

> 
>> +	if (!counter)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	for (i = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE; i < DRM_XE_RAS_ERR_COMP_MAX; i++) {
>> +		if (!errors[i])
>> +			continue;
>> +
>> +		counter[i].name = errors[i];
>> +		atomic_set(&counter[i].counter, 0);
> 
> Do you need this?

It's clear to anyone seeing the code that we need to
initialize to 0.

> 
>> +	}
>> +
>> +	return counter;
>> +}
> 
> ...
> 
>> +int xe_drm_ras_allocate_nodes(struct xe_device *xe)
>> +{
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct drm_ras_node *node;
>> +	int err;
>> +
>> +	node = drmm_kcalloc(&xe->drm, DRM_XE_RAS_ERR_SEV_MAX, sizeof(*node),
>> +			    GFP_KERNEL);
> 
> Can be one line.
> 
>> +	if (!node)
>> +		return -ENOMEM;
>> +
>> +	ras->node = node;
>> +
>> +	err = register_nodes(xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to register DRM RAS node\n");
>> +		return err;
>> +	}
>> +
>> +	err = devm_add_action_or_reset(xe->drm.dev, xe_drm_ras_unregister_nodes, xe);
>> +	if (err) {
>> +		drm_err(&xe->drm, "Failed to add action for Xe DRM RAS\n");
>> +		return err;
>> +	}
>> +
>> +	return 0;
>> +}
> 
> ...
> 
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> @@ -0,0 +1,40 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2026 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_DRM_RAS_TYPES_H_
>> +#define _XE_DRM_RAS_TYPES_H_
>> +
>> +#include <drm/xe_drm.h>
>> +#include <linux/atomic.h>
> 
> Ditto for linux includes.

Had thought this needs to be alphabetical. Got a similar
comment in a different patch.
Will fix throughout the series


> 
>> +struct drm_ras_node;
> 

Thank you for the review

Thanks
Riana

> Reviewed-by: Raag Jadav <raag.jadav@intel.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
  2026-02-05  8:30   ` Raag Jadav
@ 2026-02-10  4:58     ` Riana Tauro
  2026-02-10  4:59       ` Riana Tauro
  0 siblings, 1 reply; 24+ messages in thread
From: Riana Tauro @ 2026-02-10  4:58 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri



On 2/5/2026 2:00 PM, Raag Jadav wrote:
> On Mon, Feb 02, 2026 at 12:13:59PM +0530, Riana Tauro wrote:
>> Initialize DRM RAS in hw error init. Map the UAPI error severities
>> with the hardware error severities and refactor file.
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_drm_ras_types.h |  8 ++++
>>   drivers/gpu/drm/xe/xe_hw_error.c      | 68 ++++++++++++++++-----------
>>   2 files changed, 48 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> index 0ac4ae324f37..beed48811d6a 100644
>> --- a/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>> @@ -11,6 +11,14 @@
>>   
>>   struct drm_ras_node;
>>   
>> +/* Error categories reported by hardware */
>> +enum hardware_error {
>> +	HARDWARE_ERROR_CORRECTABLE = 0,
>> +	HARDWARE_ERROR_NONFATAL = 1,
>> +	HARDWARE_ERROR_FATAL = 2,
> 
> I'd align "= x" using tabs for readability.

Will remove the values except the start

> 
>> +	HARDWARE_ERROR_MAX,
> 
> Guaranteed last member, so redundant comma.

Will fix it

> 
>> +};
> 
> Also, just curious. Are these expected to be reused anywhere?
> If not, they're probably better off in the .c file.
> 
> ...
> 
>> @@ -86,8 +78,8 @@ static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>>   		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>>   		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>>   			drm_err_ratelimited(&xe->drm, HW_ERR
>> -					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
>> -					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
>> +					    "HEC FW %s error reported, bit[%d] is set\n",
>> +					     hec_uncorrected_fw_errors[err_bit],
> 
> So we're dropping severity_str here? Did I miss something?

I removed it because uncorrected was mentioned in log. But removed that 
also by mistake

Will fix this. Thanks for catching this

> 
>>   					     err_bit);
> 
> ...
> 
>> +static int hw_error_info_init(struct xe_device *xe)
>> +{
>> +	int ret;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return 0;
>> +
>> +	ret = xe_drm_ras_allocate_nodes(xe);
> 
> Why not just
> 
> 	return xe_drm_ras_allocate_nodes();
> 
> Tidy? ;)

okay

> 
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>> +}
>> +
>>   /*
>>    * Process hardware errors during boot
>>    */
>> @@ -172,11 +179,16 @@ static void process_hw_errors(struct xe_device *xe)
>>   void xe_hw_error_init(struct xe_device *xe)
>>   {
>>   	struct xe_tile *tile = xe_device_get_root_tile(xe);
>> +	int ret;
>>   
>>   	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>   		return;
>>   
>>   	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>>   
>> +	ret = hw_error_info_init(xe);
>> +	if (ret)
>> +		drm_warn(&xe->drm, "Failed to allocate DRM RAS nodes\n");
> 
> This is less likely due to any hardware limitation, so I think
> drm_err() would be more appropriate here.

okay will fix it

Thanks
Riana

> 
> Raag
> 
>> +
>>   	process_hw_errors(xe);
>>   }
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling
  2026-02-10  4:58     ` Riana Tauro
@ 2026-02-10  4:59       ` Riana Tauro
  0 siblings, 0 replies; 24+ messages in thread
From: Riana Tauro @ 2026-02-10  4:59 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri



On 2/10/2026 10:28 AM, Riana Tauro wrote:
> 
> 
> On 2/5/2026 2:00 PM, Raag Jadav wrote:
>> On Mon, Feb 02, 2026 at 12:13:59PM +0530, Riana Tauro wrote:
>>> Initialize DRM RAS in hw error init. Map the UAPI error severities
>>> with the hardware error severities and refactor file.
>>>
>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_drm_ras_types.h |  8 ++++
>>>   drivers/gpu/drm/xe/xe_hw_error.c      | 68 ++++++++++++++++-----------
>>>   2 files changed, 48 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/ 
>>> xe/xe_drm_ras_types.h
>>> index 0ac4ae324f37..beed48811d6a 100644
>>> --- a/drivers/gpu/drm/xe/xe_drm_ras_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
>>> @@ -11,6 +11,14 @@
>>>   struct drm_ras_node;
>>> +/* Error categories reported by hardware */
>>> +enum hardware_error {
>>> +    HARDWARE_ERROR_CORRECTABLE = 0,
>>> +    HARDWARE_ERROR_NONFATAL = 1,
>>> +    HARDWARE_ERROR_FATAL = 2,
>>
>> I'd align "= x" using tabs for readability.
> 
> Will remove the values except the start
> 
>>
>>> +    HARDWARE_ERROR_MAX,
>>
>> Guaranteed last member, so redundant comma.
> 
> Will fix it
> 
>>
>>> +};
>>
>> Also, just curious. Are these expected to be reused anywhere?
>> If not, they're probably better off in the .c file.

These are used in register header files along with c. So added it here

>>
>> ...
>>
>>> @@ -86,8 +78,8 @@ static void csc_hw_error_handler(struct xe_tile 
>>> *tile, const enum hardware_error
>>>           fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>>>           for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>>>               drm_err_ratelimited(&xe->drm, HW_ERR
>>> -                        "%s: HEC Uncorrected FW %s error reported, 
>>> bit[%d] is set\n",
>>> -                         hw_err_str, 
>>> hec_uncorrected_fw_errors[err_bit],
>>> +                        "HEC FW %s error reported, bit[%d] is set\n",
>>> +                         hec_uncorrected_fw_errors[err_bit],
>>
>> So we're dropping severity_str here? Did I miss something?
> 
> I removed it because uncorrected was mentioned in log. But removed that 
> also by mistake
> 
> Will fix this. Thanks for catching this
> 
>>
>>>                            err_bit);
>>
>> ...
>>
>>> +static int hw_error_info_init(struct xe_device *xe)
>>> +{
>>> +    int ret;
>>> +
>>> +    if (xe->info.platform != XE_PVC)
>>> +        return 0;
>>> +
>>> +    ret = xe_drm_ras_allocate_nodes(xe);
>>
>> Why not just
>>
>>     return xe_drm_ras_allocate_nodes();
>>
>> Tidy? ;)
> 
> okay
> 
>>
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   /*
>>>    * Process hardware errors during boot
>>>    */
>>> @@ -172,11 +179,16 @@ static void process_hw_errors(struct xe_device 
>>> *xe)
>>>   void xe_hw_error_init(struct xe_device *xe)
>>>   {
>>>       struct xe_tile *tile = xe_device_get_root_tile(xe);
>>> +    int ret;
>>>       if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>>           return;
>>>       INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>>> +    ret = hw_error_info_init(xe);
>>> +    if (ret)
>>> +        drm_warn(&xe->drm, "Failed to allocate DRM RAS nodes\n");
>>
>> This is less likely due to any hardware limitation, so I think
>> drm_err() would be more appropriate here.
> 
> okay will fix it
> 
> Thanks
> Riana
> 
>>
>> Raag
>>
>>> +
>>>       process_hw_errors(xe);
>>>   }
>>> -- 
>>> 2.47.1
>>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors
  2026-02-05 15:30   ` Raag Jadav
@ 2026-02-10  5:58     ` Riana Tauro
  2026-02-10 11:45       ` Raag Jadav
  0 siblings, 1 reply; 24+ messages in thread
From: Riana Tauro @ 2026-02-10  5:58 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray



On 2/5/2026 9:00 PM, Raag Jadav wrote:
> On Mon, Feb 02, 2026 at 12:14:00PM +0530, Riana Tauro wrote:
>> PVC supports GT error reporting via vector registers along with
>> error status register. Add support to report these errors and
>> update respective counters. Incase of Subslice error reported
>> by vector register, process the error status register
>> for applicable bits.
>>
>> The counter is embedded in the xe drm ras structure and is
>> exposed to the userspace using the drm_ras generic netlink
>> interface.
>>
>> $ sudo ynl --family drm_ras --do query-error-counter  --json
> 
> We usually add '\' at the end for wrapping commands so that they're easy
> to apply directly (and same for all other patches where applicable).
> 
>>    '{"node-id":0, "error-id":1}'
> 
> Ditto.

Will fix this

> 
>> {'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
>>
>> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: Add ID's and names as uAPI (Rodrigo)
>>      Add documentation
>>      Modify commit message
>>
>> v3: remove 'error' from counters
>>      use drmm_kcalloc
>>      add a for_each for severity
>>      differentitate error classes and severity in UAPI(Raag)
>>      Use correctable and uncorrectable in uapi (Pratik / Aravind)
>>
>> v4: modify enums in UAPI
>>      improve comments
>>      add bounds check in handler
>>      add error mask macro (Raag)
>>      use atomic_t
>>      add null pointer checks
>> ---
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  62 ++++++-
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 199 +++++++++++++++++++--
>>   2 files changed, 241 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> index c146b9ef44eb..17982a335941 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -6,15 +6,59 @@
>>   #ifndef _XE_HW_ERROR_REGS_H_
>>   #define _XE_HW_ERROR_REGS_H_
>>   
>> -#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
>> -#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>> +#define HEC_UNCORR_ERR_STATUS(base)			XE_REG((base) + 0x118)
>> +#define   UNCORR_FW_REPORTED_ERR			REG_BIT(6)
>>   
>> -#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
>> +#define HEC_UNCORR_FW_ERR_DW0(base)			XE_REG((base) + 0x124)
>> +
>> +#define ERR_STAT_GT_COR					0x100160
>> +#define   EU_GRF_COR_ERR				REG_BIT(15)
>> +#define   EU_IC_COR_ERR					REG_BIT(14)
>> +#define   SLM_COR_ERR					REG_BIT(13)
>> +#define   GUC_COR_ERR					REG_BIT(1)
>> +
>> +#define ERR_STAT_GT_NONFATAL				0x100164
>> +#define ERR_STAT_GT_FATAL				0x100168
>> +#define   EU_GRF_FAT_ERR				REG_BIT(15)
>> +#define   SLM_FAT_ERR					REG_BIT(13)
>> +#define   GUC_FAT_ERR					REG_BIT(6)
>> +#define   FPU_FAT_ERR					REG_BIT(3)
>> +
>> +#define ERR_STAT_GT_REG(x)				XE_REG(_PICK_EVEN((x), \
>> +									  ERR_STAT_GT_COR, \
>> +									  ERR_STAT_GT_NONFATAL))
>> +
>> +#define PVC_COR_ERR_MASK				(GUC_COR_ERR | SLM_COR_ERR | \
>> +							 EU_IC_COR_ERR | EU_GRF_COR_ERR)
>> +
>> +#define PVC_FAT_ERR_MASK				(FPU_FAT_ERR | GUC_FAT_ERR | \
>> +							EU_GRF_FAT_ERR | SLM_FAT_ERR)
> 
> Nit: Whitespace please!

alignment?

> 
>> +#define DEV_ERR_STAT_NONFATAL				0x100178
>> +#define DEV_ERR_STAT_CORRECTABLE			0x10017c
>> +#define DEV_ERR_STAT_REG(x)				XE_REG(_PICK_EVEN((x), \
>> +									  DEV_ERR_STAT_CORRECTABLE, \
>> +									  DEV_ERR_STAT_NONFATAL))
> 
> I know it was already like this but how does this evaluate for FATAL?

#define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a)))
(index, 0x10017c, 0x100178)  = (0x10017c + index * (0x100178 - 0x10017c));
0 =  0x10017c
1 =  0x100178
2 =  0x100174

> 
>> +#define   XE_CSC_ERROR					17
>> +#define   XE_GT_ERROR					0
>> +
>> +#define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
>> +#define ERR_STAT_GT_FATAL_VECTOR_1			0x100264
>> +
>> +#define ERR_STAT_GT_FATAL_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
>> +								  ERR_STAT_GT_FATAL_VECTOR_0, \
>> +								  ERR_STAT_GT_FATAL_VECTOR_1))
>> +
>> +#define ERR_STAT_GT_COR_VECTOR_0			0x1002a0
>> +#define ERR_STAT_GT_COR_VECTOR_1			0x1002a4
>> +
>> +#define ERR_STAT_GT_COR_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
>> +									  ERR_STAT_GT_COR_VECTOR_0, \
>> +									  ERR_STAT_GT_COR_VECTOR_1))
>> +
>> +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)		(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
>> +							ERR_STAT_GT_COR_VECTOR_REG(x) : \
>> +							ERR_STAT_GT_FATAL_VECTOR_REG(x))
> 
> Ditto for whitespace.
> 
>> -#define DEV_ERR_STAT_NONFATAL			0x100178
>> -#define DEV_ERR_STAT_CORRECTABLE		0x10017c
>> -#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>> -								  DEV_ERR_STAT_CORRECTABLE, \
>> -								  DEV_ERR_STAT_NONFATAL))
>> -#define   XE_CSC_ERROR				BIT(17)
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index 2019aaaa1ebe..ff31fb322c8a 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,6 +3,7 @@
>>    * Copyright © 2025 Intel Corporation
>>    */
>>   
>> +#include <linux/bitmap.h>
>>   #include <linux/fault-inject.h>
>>   
>>   #include "regs/xe_gsc_regs.h"
>> @@ -15,7 +16,13 @@
>>   #include "xe_mmio.h"
>>   #include "xe_survivability_mode.h"
>>   
>> -#define  HEC_UNCORR_FW_ERR_BITS 4
>> +#define  GT_HW_ERROR_MAX_ERR_BITS	16
>> +#define  HEC_UNCORR_FW_ERR_BITS		4
>> +#define  XE_RAS_REG_SIZE		32
>> +
>> +#define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
>> +	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
>> +	(BIT(err_bit) & PVC_FAT_ERR_MASK))
> 
> I'd write this as below and move it to xe_hw_error_regs.h

This is not specific to register selection or defining bits. It's 
related to mask. So .c should be the right place

> 
> #define PVC_COR_ERR_MASK_SET(err_bit)			(PVC_COR_ERR_MASK & REG_BIT(err_bit))
> #define PVC_FAT_ERR_MASK_SET(err_bit)			(PVC_FAT_ERR_MASK & REG_BIT(err_bit))
> 
> #define PVC_ERR_MASK_SET(hw_err, err_bit)		((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
> 								PVC_COR_ERR_MASK_SET(err_bit) : \
> 								PVC_FAT_ERR_MASK_SET(err_bit)
> 
> ...
> 
>> +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>> +				u32 error_id)
>> +{
>> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	unsigned long err_stat = 0;
>> +	int i, len;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return;
>> +
>> +	if (!info)
>> +		return;
> 
> Since info allocation is not related to hardware, we shouldn't even be
> at this point without it. So let's not hide bugs and fail probe instead.

yes currently it is supported only on PVC. I can remove this here as 
there is a PVC check but cannot remove the one suggested below.

Thanks
Riana

> 
>> +	if (hw_err == HARDWARE_ERROR_NONFATAL) {
>> +		atomic_inc(&info[error_id].counter);
>> +		log_hw_error(tile, info[error_id].name, severity);
>> +		return;
>> +	}
> 
> ...
> 
>>   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>   {
>>   	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>>   	const char *severity_str = error_severity[severity];
>>   	struct xe_device *xe = tile_to_xe(tile);
>> -	unsigned long flags;
>> -	u32 err_src;
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	unsigned long flags, err_src;
>> +	u32 err_bit;
>>   
>> -	if (xe->info.platform != XE_BATTLEMAGE)
>> +	if (!IS_DGFX(xe))
>>   		return;
>>   
>>   	spin_lock_irqsave(&xe->irq.lock, flags);
>> @@ -108,11 +242,53 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>   		goto unlock;
>>   	}
>>   
>> -	if (err_src & XE_CSC_ERROR)
>> +	/*
>> +	 * On encountering CSC firmware errors, the graphics device becomes unrecoverable
>> +	 * so return immediately on error. The only way to recover from these errors is
>> +	 * firmware flash. The device will enter Runtime Survivability mode when such
>> +	 * errors are detected.
>> +	 */
>> +	if (err_src & XE_CSC_ERROR) {
>>   		csc_hw_error_handler(tile, hw_err);
>> +		goto clear_reg;
>> +	}
>>   
>> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>> +	if (!info)
>> +		goto clear_reg;
> 
> Same as above.
> 
> Raag


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors
  2026-02-05 18:10   ` Raag Jadav
@ 2026-02-10  6:32     ` Riana Tauro
  2026-02-10 11:52       ` Raag Jadav
  0 siblings, 1 reply; 24+ messages in thread
From: Riana Tauro @ 2026-02-10  6:32 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray



On 2/5/2026 11:40 PM, Raag Jadav wrote:
> On Mon, Feb 02, 2026 at 12:14:01PM +0530, Riana Tauro wrote:
>> Report the SoC nonfatal/fatal hardware error and update the counters.
>>
>> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":2}'
> 
> Same comment as last patch.

the second line is the output so not necessary

> 
>> {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}
>>
>> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: Add ID's and names as uAPI (Rodrigo)
>>
>> v3: reorder and align arrays
>>      remove redundant string err
>>      use REG_BIT
>>      fix aesthic review comments (Raag)
>>      use only correctable/uncorrectable error severity (Aravind)
>>
>> v4: fix comments
>>      use master as variable name
>>      add static_assert (Raag)
>> ---
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 221 ++++++++++++++++++++-
>>   2 files changed, 244 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> index 17982a335941..a89a07d067fc 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -41,6 +41,7 @@
>>   									  DEV_ERR_STAT_NONFATAL))
>>   
>>   #define   XE_CSC_ERROR					17
> 
> I overlooked this in the last patch but I think this should be used as
> 
> 	if (err_src & REG_BIT(XE_CSC_ERROR))

Thanks for catching this. When i changed the bit i retained previous 
code. Will fix this and test

> 
>> +#define   XE_SOC_ERROR					16
>>   #define   XE_GT_ERROR					0
>>   
>>   #define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
>> @@ -61,4 +62,27 @@
>>   							ERR_STAT_GT_COR_VECTOR_REG(x) : \
>>   							ERR_STAT_GT_FATAL_VECTOR_REG(x))
>>   
>> +#define SOC_PVC_MASTER_BASE				0x282000
>> +#define SOC_PVC_SLAVE_BASE				0x283000
>> +
>> +#define SOC_GCOERRSTS					0x200
>> +#define SOC_GNFERRSTS					0x210
>> +#define SOC_GLOBAL_ERR_STAT_REG(base, x)		XE_REG(_PICK_EVEN((x), \
>> +									  (base) + SOC_GCOERRSTS, \
>> +									  (base) + SOC_GNFERRSTS))
>> +#define   SOC_SLAVE_IEH					REG_BIT(1)
>> +#define   SOC_IEH0_LOCAL_ERR_STATUS			REG_BIT(0)
>> +#define   SOC_IEH1_LOCAL_ERR_STATUS			REG_BIT(0)
>> +
>> +#define SOC_GSYSEVTCTL					0x264
>> +#define SOC_GSYSEVTCTL_REG(master, slave, x)		XE_REG(_PICK_EVEN((x), \
>> +									  (master) + SOC_GSYSEVTCTL, \
>> +									  (slave) + SOC_GSYSEVTCTL))
>> +
>> +#define SOC_LERRUNCSTS					0x280
>> +#define SOC_LERRCORSTS					0x294
>> +#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)		XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
>> +							       (base) + SOC_LERRCORSTS : \
>> +							       (base) + SOC_LERRUNCSTS)
>> +
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index ff31fb322c8a..159ec796386a 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -19,6 +19,7 @@
>>   #define  GT_HW_ERROR_MAX_ERR_BITS	16
>>   #define  HEC_UNCORR_FW_ERR_BITS		4
>>   #define  XE_RAS_REG_SIZE		32
>> +#define  XE_SOC_NUM_IEH			2
>>   
>>   #define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
>>   	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
>> @@ -36,7 +37,8 @@ static const char * const hec_uncorrected_fw_errors[] = {
>>   };
>>   
>>   static const unsigned long xe_hw_error_map[] = {
>> -	[XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
>> +	[XE_GT_ERROR]	= DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
> 
> Unneeded churn, please align in the original patch.
> 
>> +	[XE_SOC_ERROR]	= DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
>>   };
>>   
>>   enum gt_vector_regs {
>> @@ -60,6 +62,102 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
>>   	return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
>>   }
>>   
>> +static const char * const pvc_master_global_err_reg[] = {
>> +	[0 ... 1]	= "Undefined",
>> +	[2]		= "HBM SS0: Channel0",
>> +	[3]		= "HBM SS0: Channel1",
>> +	[4]		= "HBM SS0: Channel2",
>> +	[5]		= "HBM SS0: Channel3",
>> +	[6]		= "HBM SS0: Channel4",
>> +	[7]		= "HBM SS0: Channel5",
>> +	[8]		= "HBM SS0: Channel6",
>> +	[9]		= "HBM SS0: Channel7",
>> +	[10]		= "HBM SS1: Channel0",
>> +	[11]		= "HBM SS1: Channel1",
>> +	[12]		= "HBM SS1: Channel2",
>> +	[13]		= "HBM SS1: Channel3",
>> +	[14]		= "HBM SS1: Channel4",
>> +	[15]		= "HBM SS1: Channel5",
>> +	[16]		= "HBM SS1: Channel6",
>> +	[17]		= "HBM SS1: Channel7",
>> +	[18 ... 31]	= "Undefined",
>> +};
>> +
> 
> Redundant blank line.
> 

Then checkpatch complains :(

>> +static_assert(ARRAY_SIZE(pvc_master_global_err_reg) == XE_RAS_REG_SIZE);
>> +
>> +static const char * const pvc_slave_global_err_reg[] = {
>> +	[0]		= "Undefined",
>> +	[1]		= "HBM SS2: Channel0",
>> +	[2]		= "HBM SS2: Channel1",
>> +	[3]		= "HBM SS2: Channel2",
>> +	[4]		= "HBM SS2: Channel3",
>> +	[5]		= "HBM SS2: Channel4",
>> +	[6]		= "HBM SS2: Channel5",
>> +	[7]		= "HBM SS2: Channel6",
>> +	[8]		= "HBM SS2: Channel7",
>> +	[9]		= "HBM SS3: Channel0",
>> +	[10]		= "HBM SS3: Channel1",
>> +	[11]		= "HBM SS3: Channel2",
>> +	[12]		= "HBM SS3: Channel3",
>> +	[13]		= "HBM SS3: Channel4",
>> +	[14]		= "HBM SS3: Channel5",
>> +	[15]		= "HBM SS3: Channel6",
>> +	[16]		= "HBM SS3: Channel7",
>> +	[17]		= "Undefined",
>> +	[18]		= "ANR MDFI",
>> +	[19 ... 31]	= "Undefined",
>> +};
>> +
> 
> Ditto.
> 
>> +static_assert(ARRAY_SIZE(pvc_slave_global_err_reg) == XE_RAS_REG_SIZE);
>> +
>> +static const char * const pvc_slave_local_fatal_err_reg[] = {
>> +	[0]		= "Local IEH: Malformed PCIe AER",
>> +	[1]		= "Local IEH: Malformed PCIe ERR",
>> +	[2]		= "Local IEH: UR conditions in IEH",
>> +	[3]		= "Local IEH: From SERR Sources",
>> +	[4 ... 19]	= "Undefined",
>> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
>> +	[21 ... 31]	= "Undefined",
>> +};
>> +
> 
> Ditto.
> 
>> +static_assert(ARRAY_SIZE(pvc_slave_local_fatal_err_reg) == XE_RAS_REG_SIZE);
>> +
>> +static const char * const pvc_master_local_fatal_err_reg[] = {
>> +	[0]		= "Local IEH: Malformed IOSF PCIe AER",
>> +	[1]		= "Local IEH: Malformed IOSF PCIe ERR",
>> +	[2]		= "Local IEH: UR RESPONSE",
>> +	[3]		= "Local IEH: From SERR SPI controller",
>> +	[4]		= "Base Die MDFI T2T",
>> +	[5]		= "Undefined",
>> +	[6]		= "Base Die MDFI T2C",
>> +	[7]		= "Undefined",
>> +	[8]		= "Invalid CSC PSF Command Parity",
>> +	[9]		= "Invalid CSC PSF Unexpected Completion",
>> +	[10]		= "Invalid CSC PSF Unsupported Request",
>> +	[11]		= "Invalid PCIe PSF Command Parity",
>> +	[12]		= "PCIe PSF Unexpected Completion",
>> +	[13]		= "PCIe PSF Unsupported Request",
>> +	[14 ... 19]	= "Undefined",
>> +	[20]		= "Malformed MCA error packet (HBM/Punit)",
>> +	[21 ... 31]	= "Undefined",
>> +};
>> +
> 
> Ditto.
> 
>> +static_assert(ARRAY_SIZE(pvc_master_local_fatal_err_reg) == XE_RAS_REG_SIZE);
>> +
>> +static const char * const pvc_master_local_nonfatal_err_reg[] = {
>> +	[0 ... 3]	= "Undefined",
>> +	[4]		= "Base Die MDFI T2T",
>> +	[5]		= "Undefined",
>> +	[6]		= "Base Die MDFI T2C",
>> +	[7]		= "Undefined",
>> +	[8]		= "Invalid CSC PSF Command Parity",
>> +	[9]		= "Invalid CSC PSF Unexpected Completion",
>> +	[10]		= "Invalid PCIe PSF Command Parity",
>> +	[11 ... 31]	= "Undefined",
>> +};
>> +
> 
> Ditto.
> 
>> +static_assert(ARRAY_SIZE(pvc_master_local_nonfatal_err_reg) == XE_RAS_REG_SIZE);
>> +
>>   static bool fault_inject_csc_hw_error(void)
>>   {
>>   	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
>> @@ -138,6 +236,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
>>   				    name, severity_str, i, err);
>>   }
>>   
>> +static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
>> +			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
>> +{
>> +	const char *severity_str = error_severity[severity];
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_drm_ras *ras = &xe->ras;
>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>> +	const char *name;
>> +
>> +	name = reg_info[err_bit];
>> +
>> +	if (strcmp(name, "Undefined")) {
>> +		if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
>> +			drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
>> +		else
>> +			drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
>> +		atomic_inc(&info[index].counter);
>> +	}
>> +}
>> +
>>   static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>>   				u32 error_id)
>>   {
>> @@ -221,6 +339,104 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
>>   	}
>>   }
>>   
>> +static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
>> +{
>> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>> +	unsigned long slave_global_errstat, slave_local_errstat;
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	u32 regbit, slave_base;
>> +
>> +	slave_base = SOC_PVC_SLAVE_BASE;
> 
> Just name it 'slave' and it'll probably help remove the line wrapping below.

There is no wrapping except the log_soc_error. This change won't fix it

> 
>> +	slave_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err));
>> +
>> +	if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
>> +		slave_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err));
>> +
>> +		if (hw_err == HARDWARE_ERROR_FATAL) {
>> +			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
>> +				log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
>> +					      regbit, error_id);
>> +		}
>> +
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
>> +				slave_local_errstat);
>> +	}
>> +
>> +	for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
>> +		log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
>> +
>> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err), slave_global_errstat);
>> +}
>> +
>> +static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>> +				 u32 error_id)
>> +{
>> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	unsigned long master_global_errstat, master_local_errstat;
>> +	u32 master_base, slave_base, regbit;
>> +	int i;
>> +
>> +	if (xe->info.platform != XE_PVC)
>> +		return;
>> +
>> +	master_base = SOC_PVC_MASTER_BASE;
>> +	slave_base = SOC_PVC_SLAVE_BASE;
> 
> Ditto. Just 'master' and 'slave' will help remove the line wrapping below.

Yeah this will help. Will add this change

> 
>> +	/* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
>> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
>> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
>> +				~REG_BIT(hw_err));
>> +
>> +	if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
>> +				REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
>> +				REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
>> +				REG_GENMASK(31, 0));
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
>> +				REG_GENMASK(31, 0));
>> +		goto unmask_gsysevtctl;
>> +	}
>> +
>> +	/*
>> +	 * Read the master global IEH error register if BIT(1) is set then process
> 
> Missing comma after 'register'.
> 
>> +	 * the slave IEH first. If BIT(0) in global error register is set then process
>> +	 * the corresponding local error registers.
>> +	 */
>> +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err));
>> +	if (master_global_errstat & SOC_SLAVE_IEH)
>> +		soc_slave_ieh_handler(tile, hw_err, error_id);
>> +
>> +	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
>> +		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err));
>> +
>> +		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
>> +			const char * const *reg_info = (hw_err == HARDWARE_ERROR_FATAL) ?
> 
> This looks like it can be outside the loop.

will fix it.


Thanks
Riana


> 
> Raag
> 
>> +						       pvc_master_local_fatal_err_reg :
>> +						       pvc_master_local_nonfatal_err_reg;
>> +
>> +			log_soc_error(tile, reg_info, severity, regbit, error_id);
>> +		}
>> +
>> +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
>> +				master_local_errstat);
>> +	}
>> +
>> +	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
>> +		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
>> +
>> +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
>> +			master_global_errstat);
>> +
>> +unmask_gsysevtctl:
>> +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
>> +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
>> +				(HARDWARE_ERROR_MAX << 1) + 1);
>> +}
>> +
>>   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>   {
>>   	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>> @@ -283,8 +499,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>   					    "TILE%d reported %s %s, bit[%d] is set\n",
>>   					    tile->id, name, severity_str, err_bit);
>>   		}
>> +
>>   		if (err_bit == XE_GT_ERROR)
>>   			gt_hw_error_handler(tile, hw_err, error_id);
>> +		if (err_bit == XE_SOC_ERROR)
>> +			soc_hw_error_handler(tile, hw_err, error_id);
>>   	}
>>   
>>   clear_reg:
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors
  2026-02-10  5:58     ` Riana Tauro
@ 2026-02-10 11:45       ` Raag Jadav
  2026-02-12  3:25         ` Riana Tauro
  0 siblings, 1 reply; 24+ messages in thread
From: Raag Jadav @ 2026-02-10 11:45 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

On Tue, Feb 10, 2026 at 11:28:39AM +0530, Riana Tauro wrote:
> On 2/5/2026 9:00 PM, Raag Jadav wrote:
> > On Mon, Feb 02, 2026 at 12:14:00PM +0530, Riana Tauro wrote:
> > > PVC supports GT error reporting via vector registers along with
> > > error status register. Add support to report these errors and
> > > update respective counters. Incase of Subslice error reported
> > > by vector register, process the error status register
> > > for applicable bits.
> > > 
> > > The counter is embedded in the xe drm ras structure and is
> > > exposed to the userspace using the drm_ras generic netlink
> > > interface.
> > > 
> > > $ sudo ynl --family drm_ras --do query-error-counter  --json
> > 
> > We usually add '\' at the end for wrapping commands so that they're easy
> > to apply directly (and same for all other patches where applicable).
> > 
> > >    '{"node-id":0, "error-id":1}'
> > 
> > Ditto.
> 
> Will fix this

Thank you.

> > > {'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
> > > 
> > > Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > > v2: Add ID's and names as uAPI (Rodrigo)
> > >      Add documentation
> > >      Modify commit message
> > > 
> > > v3: remove 'error' from counters
> > >      use drmm_kcalloc
> > >      add a for_each for severity
> > >      differentitate error classes and severity in UAPI(Raag)
> > >      Use correctable and uncorrectable in uapi (Pratik / Aravind)
> > > 
> > > v4: modify enums in UAPI
> > >      improve comments
> > >      add bounds check in handler
> > >      add error mask macro (Raag)
> > >      use atomic_t
> > >      add null pointer checks
> > > ---
> > >   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  62 ++++++-
> > >   drivers/gpu/drm/xe/xe_hw_error.c           | 199 +++++++++++++++++++--
> > >   2 files changed, 241 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > index c146b9ef44eb..17982a335941 100644
> > > --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > @@ -6,15 +6,59 @@
> > >   #ifndef _XE_HW_ERROR_REGS_H_
> > >   #define _XE_HW_ERROR_REGS_H_
> > > -#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
> > > -#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
> > > +#define HEC_UNCORR_ERR_STATUS(base)			XE_REG((base) + 0x118)
> > > +#define   UNCORR_FW_REPORTED_ERR			REG_BIT(6)
> > > -#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
> > > +#define HEC_UNCORR_FW_ERR_DW0(base)			XE_REG((base) + 0x124)
> > > +
> > > +#define ERR_STAT_GT_COR					0x100160
> > > +#define   EU_GRF_COR_ERR				REG_BIT(15)
> > > +#define   EU_IC_COR_ERR					REG_BIT(14)
> > > +#define   SLM_COR_ERR					REG_BIT(13)
> > > +#define   GUC_COR_ERR					REG_BIT(1)
> > > +
> > > +#define ERR_STAT_GT_NONFATAL				0x100164
> > > +#define ERR_STAT_GT_FATAL				0x100168
> > > +#define   EU_GRF_FAT_ERR				REG_BIT(15)
> > > +#define   SLM_FAT_ERR					REG_BIT(13)
> > > +#define   GUC_FAT_ERR					REG_BIT(6)
> > > +#define   FPU_FAT_ERR					REG_BIT(3)
> > > +
> > > +#define ERR_STAT_GT_REG(x)				XE_REG(_PICK_EVEN((x), \
> > > +									  ERR_STAT_GT_COR, \
> > > +									  ERR_STAT_GT_NONFATAL))
> > > +
> > > +#define PVC_COR_ERR_MASK				(GUC_COR_ERR | SLM_COR_ERR | \
> > > +							 EU_IC_COR_ERR | EU_GRF_COR_ERR)
> > > +
> > > +#define PVC_FAT_ERR_MASK				(FPU_FAT_ERR | GUC_FAT_ERR | \
> > > +							EU_GRF_FAT_ERR | SLM_FAT_ERR)
> > 
> > Nit: Whitespace please!
> 
> alignment?

Yes please!

> > > +#define DEV_ERR_STAT_NONFATAL				0x100178
> > > +#define DEV_ERR_STAT_CORRECTABLE			0x10017c
> > > +#define DEV_ERR_STAT_REG(x)				XE_REG(_PICK_EVEN((x), \
> > > +									  DEV_ERR_STAT_CORRECTABLE, \
> > > +									  DEV_ERR_STAT_NONFATAL))
> > 
> > I know it was already like this but how does this evaluate for FATAL?
> 
> #define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a)))
> (index, 0x10017c, 0x100178)  = (0x10017c + index * (0x100178 - 0x10017c));
> 0 =  0x10017c
> 1 =  0x100178
> 2 =  0x100174

The addresses are usually unsigned, so I got lost there a bit.

> > > +#define   XE_CSC_ERROR					17
> > > +#define   XE_GT_ERROR					0
> > > +
> > > +#define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
> > > +#define ERR_STAT_GT_FATAL_VECTOR_1			0x100264
> > > +
> > > +#define ERR_STAT_GT_FATAL_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
> > > +								  ERR_STAT_GT_FATAL_VECTOR_0, \
> > > +								  ERR_STAT_GT_FATAL_VECTOR_1))
> > > +
> > > +#define ERR_STAT_GT_COR_VECTOR_0			0x1002a0
> > > +#define ERR_STAT_GT_COR_VECTOR_1			0x1002a4
> > > +
> > > +#define ERR_STAT_GT_COR_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
> > > +									  ERR_STAT_GT_COR_VECTOR_0, \
> > > +									  ERR_STAT_GT_COR_VECTOR_1))
> > > +
> > > +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)		(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
> > > +							ERR_STAT_GT_COR_VECTOR_REG(x) : \
> > > +							ERR_STAT_GT_FATAL_VECTOR_REG(x))
> > 
> > Ditto for whitespace.
> > 
> > > -#define DEV_ERR_STAT_NONFATAL			0x100178
> > > -#define DEV_ERR_STAT_CORRECTABLE		0x10017c
> > > -#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
> > > -								  DEV_ERR_STAT_CORRECTABLE, \
> > > -								  DEV_ERR_STAT_NONFATAL))
> > > -#define   XE_CSC_ERROR				BIT(17)
> > >   #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> > > index 2019aaaa1ebe..ff31fb322c8a 100644
> > > --- a/drivers/gpu/drm/xe/xe_hw_error.c
> > > +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> > > @@ -3,6 +3,7 @@
> > >    * Copyright © 2025 Intel Corporation
> > >    */
> > > +#include <linux/bitmap.h>
> > >   #include <linux/fault-inject.h>
> > >   #include "regs/xe_gsc_regs.h"
> > > @@ -15,7 +16,13 @@
> > >   #include "xe_mmio.h"
> > >   #include "xe_survivability_mode.h"
> > > -#define  HEC_UNCORR_FW_ERR_BITS 4
> > > +#define  GT_HW_ERROR_MAX_ERR_BITS	16
> > > +#define  HEC_UNCORR_FW_ERR_BITS		4
> > > +#define  XE_RAS_REG_SIZE		32
> > > +
> > > +#define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
> > > +	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
> > > +	(BIT(err_bit) & PVC_FAT_ERR_MASK))
> > 
> > I'd write this as below and move it to xe_hw_error_regs.h
> 
> This is not specific to register selection or defining bits. It's related to
> mask. So .c should be the right place

Don't the mask bits come from register def?

> > #define PVC_COR_ERR_MASK_SET(err_bit)			(PVC_COR_ERR_MASK & REG_BIT(err_bit))
> > #define PVC_FAT_ERR_MASK_SET(err_bit)			(PVC_FAT_ERR_MASK & REG_BIT(err_bit))
> > 
> > #define PVC_ERR_MASK_SET(hw_err, err_bit)		((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
> > 								PVC_COR_ERR_MASK_SET(err_bit) : \
> > 								PVC_FAT_ERR_MASK_SET(err_bit)
> > 
> > ...
> > 
> > > +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> > > +				u32 error_id)
> > > +{
> > > +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	struct xe_drm_ras_counter *info = ras->info[severity];
> > > +	struct xe_mmio *mmio = &tile->mmio;
> > > +	unsigned long err_stat = 0;
> > > +	int i, len;
> > > +
> > > +	if (xe->info.platform != XE_PVC)
> > > +		return;
> > > +
> > > +	if (!info)
> > > +		return;
> > 
> > Since info allocation is not related to hardware, we shouldn't even be
> > at this point without it. So let's not hide bugs and fail probe instead.
> 
> yes currently it is supported only on PVC. I can remove this here as there
> is a PVC check but cannot remove the one suggested below.

Fair, but please also return the allocation failure. With that perhaps
xe_hw_error_init() will be int now.

Raag

> > > +	if (hw_err == HARDWARE_ERROR_NONFATAL) {
> > > +		atomic_inc(&info[error_id].counter);
> > > +		log_hw_error(tile, info[error_id].name, severity);
> > > +		return;
> > > +	}
> > 
> > ...
> > 
> > >   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> > >   {
> > >   	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> > >   	const char *severity_str = error_severity[severity];
> > >   	struct xe_device *xe = tile_to_xe(tile);
> > > -	unsigned long flags;
> > > -	u32 err_src;
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	struct xe_drm_ras_counter *info = ras->info[severity];
> > > +	unsigned long flags, err_src;
> > > +	u32 err_bit;
> > > -	if (xe->info.platform != XE_BATTLEMAGE)
> > > +	if (!IS_DGFX(xe))
> > >   		return;
> > >   	spin_lock_irqsave(&xe->irq.lock, flags);
> > > @@ -108,11 +242,53 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
> > >   		goto unlock;
> > >   	}
> > > -	if (err_src & XE_CSC_ERROR)
> > > +	/*
> > > +	 * On encountering CSC firmware errors, the graphics device becomes unrecoverable
> > > +	 * so return immediately on error. The only way to recover from these errors is
> > > +	 * firmware flash. The device will enter Runtime Survivability mode when such
> > > +	 * errors are detected.
> > > +	 */
> > > +	if (err_src & XE_CSC_ERROR) {
> > >   		csc_hw_error_handler(tile, hw_err);
> > > +		goto clear_reg;
> > > +	}
> > > -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
> > > +	if (!info)
> > > +		goto clear_reg;
> > 
> > Same as above.
> > 
> > Raag

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors
  2026-02-10  6:32     ` Riana Tauro
@ 2026-02-10 11:52       ` Raag Jadav
  0 siblings, 0 replies; 24+ messages in thread
From: Raag Jadav @ 2026-02-10 11:52 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray

On Tue, Feb 10, 2026 at 12:02:11PM +0530, Riana Tauro wrote:
> On 2/5/2026 11:40 PM, Raag Jadav wrote:
> > On Mon, Feb 02, 2026 at 12:14:01PM +0530, Riana Tauro wrote:
> > > Report the SoC nonfatal/fatal hardware error and update the counters.
> > > 
> > > $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":0, "error-id":2}'
> > 
> > Same comment as last patch.
> 
> the second line is the output so not necessary

Read it wrong, my bad.

> > > {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}
> > > 
> > > Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > > v2: Add ID's and names as uAPI (Rodrigo)
> > > 
> > > v3: reorder and align arrays
> > >      remove redundant string err
> > >      use REG_BIT
> > >      fix aesthic review comments (Raag)
> > >      use only correctable/uncorrectable error severity (Aravind)
> > > 
> > > v4: fix comments
> > >      use master as variable name
> > >      add static_assert (Raag)
> > > ---
> > >   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  24 +++
> > >   drivers/gpu/drm/xe/xe_hw_error.c           | 221 ++++++++++++++++++++-
> > >   2 files changed, 244 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > index 17982a335941..a89a07d067fc 100644
> > > --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> > > @@ -41,6 +41,7 @@
> > >   									  DEV_ERR_STAT_NONFATAL))
> > >   #define   XE_CSC_ERROR					17
> > 
> > I overlooked this in the last patch but I think this should be used as
> > 
> > 	if (err_src & REG_BIT(XE_CSC_ERROR))
> 
> Thanks for catching this. When i changed the bit i retained previous code.
> Will fix this and test

Yep.

> > > +#define   XE_SOC_ERROR					16
> > >   #define   XE_GT_ERROR					0
> > >   #define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
> > > @@ -61,4 +62,27 @@
> > >   							ERR_STAT_GT_COR_VECTOR_REG(x) : \
> > >   							ERR_STAT_GT_FATAL_VECTOR_REG(x))
> > > +#define SOC_PVC_MASTER_BASE				0x282000
> > > +#define SOC_PVC_SLAVE_BASE				0x283000
> > > +
> > > +#define SOC_GCOERRSTS					0x200
> > > +#define SOC_GNFERRSTS					0x210
> > > +#define SOC_GLOBAL_ERR_STAT_REG(base, x)		XE_REG(_PICK_EVEN((x), \
> > > +									  (base) + SOC_GCOERRSTS, \
> > > +									  (base) + SOC_GNFERRSTS))
> > > +#define   SOC_SLAVE_IEH					REG_BIT(1)
> > > +#define   SOC_IEH0_LOCAL_ERR_STATUS			REG_BIT(0)
> > > +#define   SOC_IEH1_LOCAL_ERR_STATUS			REG_BIT(0)
> > > +
> > > +#define SOC_GSYSEVTCTL					0x264
> > > +#define SOC_GSYSEVTCTL_REG(master, slave, x)		XE_REG(_PICK_EVEN((x), \
> > > +									  (master) + SOC_GSYSEVTCTL, \
> > > +									  (slave) + SOC_GSYSEVTCTL))
> > > +
> > > +#define SOC_LERRUNCSTS					0x280
> > > +#define SOC_LERRCORSTS					0x294
> > > +#define SOC_LOCAL_ERR_STAT_REG(base, hw_err)		XE_REG(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
> > > +							       (base) + SOC_LERRCORSTS : \
> > > +							       (base) + SOC_LERRUNCSTS)
> > > +
> > >   #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> > > index ff31fb322c8a..159ec796386a 100644
> > > --- a/drivers/gpu/drm/xe/xe_hw_error.c
> > > +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> > > @@ -19,6 +19,7 @@
> > >   #define  GT_HW_ERROR_MAX_ERR_BITS	16
> > >   #define  HEC_UNCORR_FW_ERR_BITS		4
> > >   #define  XE_RAS_REG_SIZE		32
> > > +#define  XE_SOC_NUM_IEH			2
> > >   #define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
> > >   	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
> > > @@ -36,7 +37,8 @@ static const char * const hec_uncorrected_fw_errors[] = {
> > >   };
> > >   static const unsigned long xe_hw_error_map[] = {
> > > -	[XE_GT_ERROR] = DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
> > > +	[XE_GT_ERROR]	= DRM_XE_RAS_ERR_COMP_CORE_COMPUTE,
> > 
> > Unneeded churn, please align in the original patch.
> > 
> > > +	[XE_SOC_ERROR]	= DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
> > >   };
> > >   enum gt_vector_regs {
> > > @@ -60,6 +62,102 @@ static enum drm_xe_ras_error_severity hw_err_to_severity(enum hardware_error hw_
> > >   	return DRM_XE_RAS_ERR_SEV_UNCORRECTABLE;
> > >   }
> > > +static const char * const pvc_master_global_err_reg[] = {
> > > +	[0 ... 1]	= "Undefined",
> > > +	[2]		= "HBM SS0: Channel0",
> > > +	[3]		= "HBM SS0: Channel1",
> > > +	[4]		= "HBM SS0: Channel2",
> > > +	[5]		= "HBM SS0: Channel3",
> > > +	[6]		= "HBM SS0: Channel4",
> > > +	[7]		= "HBM SS0: Channel5",
> > > +	[8]		= "HBM SS0: Channel6",
> > > +	[9]		= "HBM SS0: Channel7",
> > > +	[10]		= "HBM SS1: Channel0",
> > > +	[11]		= "HBM SS1: Channel1",
> > > +	[12]		= "HBM SS1: Channel2",
> > > +	[13]		= "HBM SS1: Channel3",
> > > +	[14]		= "HBM SS1: Channel4",
> > > +	[15]		= "HBM SS1: Channel5",
> > > +	[16]		= "HBM SS1: Channel6",
> > > +	[17]		= "HBM SS1: Channel7",
> > > +	[18 ... 31]	= "Undefined",
> > > +};
> > > +
> > 
> > Redundant blank line.
> > 
> 
> Then checkpatch complains :(

I think we can ignore it here but I'll leave it upto you.

> > > +static_assert(ARRAY_SIZE(pvc_master_global_err_reg) == XE_RAS_REG_SIZE);
> > > +
> > > +static const char * const pvc_slave_global_err_reg[] = {
> > > +	[0]		= "Undefined",
> > > +	[1]		= "HBM SS2: Channel0",
> > > +	[2]		= "HBM SS2: Channel1",
> > > +	[3]		= "HBM SS2: Channel2",
> > > +	[4]		= "HBM SS2: Channel3",
> > > +	[5]		= "HBM SS2: Channel4",
> > > +	[6]		= "HBM SS2: Channel5",
> > > +	[7]		= "HBM SS2: Channel6",
> > > +	[8]		= "HBM SS2: Channel7",
> > > +	[9]		= "HBM SS3: Channel0",
> > > +	[10]		= "HBM SS3: Channel1",
> > > +	[11]		= "HBM SS3: Channel2",
> > > +	[12]		= "HBM SS3: Channel3",
> > > +	[13]		= "HBM SS3: Channel4",
> > > +	[14]		= "HBM SS3: Channel5",
> > > +	[15]		= "HBM SS3: Channel6",
> > > +	[16]		= "HBM SS3: Channel7",
> > > +	[17]		= "Undefined",
> > > +	[18]		= "ANR MDFI",
> > > +	[19 ... 31]	= "Undefined",
> > > +};
> > > +
> > 
> > Ditto.
> > 
> > > +static_assert(ARRAY_SIZE(pvc_slave_global_err_reg) == XE_RAS_REG_SIZE);
> > > +
> > > +static const char * const pvc_slave_local_fatal_err_reg[] = {
> > > +	[0]		= "Local IEH: Malformed PCIe AER",
> > > +	[1]		= "Local IEH: Malformed PCIe ERR",
> > > +	[2]		= "Local IEH: UR conditions in IEH",
> > > +	[3]		= "Local IEH: From SERR Sources",
> > > +	[4 ... 19]	= "Undefined",
> > > +	[20]		= "Malformed MCA error packet (HBM/Punit)",
> > > +	[21 ... 31]	= "Undefined",
> > > +};
> > > +
> > 
> > Ditto.
> > 
> > > +static_assert(ARRAY_SIZE(pvc_slave_local_fatal_err_reg) == XE_RAS_REG_SIZE);
> > > +
> > > +static const char * const pvc_master_local_fatal_err_reg[] = {
> > > +	[0]		= "Local IEH: Malformed IOSF PCIe AER",
> > > +	[1]		= "Local IEH: Malformed IOSF PCIe ERR",
> > > +	[2]		= "Local IEH: UR RESPONSE",
> > > +	[3]		= "Local IEH: From SERR SPI controller",
> > > +	[4]		= "Base Die MDFI T2T",
> > > +	[5]		= "Undefined",
> > > +	[6]		= "Base Die MDFI T2C",
> > > +	[7]		= "Undefined",
> > > +	[8]		= "Invalid CSC PSF Command Parity",
> > > +	[9]		= "Invalid CSC PSF Unexpected Completion",
> > > +	[10]		= "Invalid CSC PSF Unsupported Request",
> > > +	[11]		= "Invalid PCIe PSF Command Parity",
> > > +	[12]		= "PCIe PSF Unexpected Completion",
> > > +	[13]		= "PCIe PSF Unsupported Request",
> > > +	[14 ... 19]	= "Undefined",
> > > +	[20]		= "Malformed MCA error packet (HBM/Punit)",
> > > +	[21 ... 31]	= "Undefined",
> > > +};
> > > +
> > 
> > Ditto.
> > 
> > > +static_assert(ARRAY_SIZE(pvc_master_local_fatal_err_reg) == XE_RAS_REG_SIZE);
> > > +
> > > +static const char * const pvc_master_local_nonfatal_err_reg[] = {
> > > +	[0 ... 3]	= "Undefined",
> > > +	[4]		= "Base Die MDFI T2T",
> > > +	[5]		= "Undefined",
> > > +	[6]		= "Base Die MDFI T2C",
> > > +	[7]		= "Undefined",
> > > +	[8]		= "Invalid CSC PSF Command Parity",
> > > +	[9]		= "Invalid CSC PSF Unexpected Completion",
> > > +	[10]		= "Invalid PCIe PSF Command Parity",
> > > +	[11 ... 31]	= "Undefined",
> > > +};
> > > +
> > 
> > Ditto.
> > 
> > > +static_assert(ARRAY_SIZE(pvc_master_local_nonfatal_err_reg) == XE_RAS_REG_SIZE);
> > > +
> > >   static bool fault_inject_csc_hw_error(void)
> > >   {
> > >   	return IS_ENABLED(CONFIG_DEBUG_FS) && should_fail(&inject_csc_hw_error, 1);
> > > @@ -138,6 +236,26 @@ static void log_gt_err(struct xe_tile *tile, const char *name, int i, u32 err,
> > >   				    name, severity_str, i, err);
> > >   }
> > > +static void log_soc_error(struct xe_tile *tile, const char * const *reg_info,
> > > +			  const enum drm_xe_ras_error_severity severity, u32 err_bit, u32 index)
> > > +{
> > > +	const char *severity_str = error_severity[severity];
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +	struct xe_drm_ras *ras = &xe->ras;
> > > +	struct xe_drm_ras_counter *info = ras->info[severity];
> > > +	const char *name;
> > > +
> > > +	name = reg_info[err_bit];
> > > +
> > > +	if (strcmp(name, "Undefined")) {
> > > +		if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
> > > +			drm_warn(&xe->drm, "%s SOC %s detected", name, severity_str);
> > > +		else
> > > +			drm_err_ratelimited(&xe->drm, "%s SOC %s detected", name, severity_str);
> > > +		atomic_inc(&info[index].counter);
> > > +	}
> > > +}
> > > +
> > >   static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> > >   				u32 error_id)
> > >   {
> > > @@ -221,6 +339,104 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
> > >   	}
> > >   }
> > > +static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
> > > +{
> > > +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> > > +	unsigned long slave_global_errstat, slave_local_errstat;
> > > +	struct xe_mmio *mmio = &tile->mmio;
> > > +	u32 regbit, slave_base;
> > > +
> > > +	slave_base = SOC_PVC_SLAVE_BASE;
> > 
> > Just name it 'slave' and it'll probably help remove the line wrapping below.
> 
> There is no wrapping except the log_soc_error. This change won't fix it

Yes, just to make it consistent with below.

> > > +	slave_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err));
> > > +
> > > +	if (slave_global_errstat & SOC_IEH1_LOCAL_ERR_STATUS) {
> > > +		slave_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err));
> > > +
> > > +		if (hw_err == HARDWARE_ERROR_FATAL) {
> > > +			for_each_set_bit(regbit, &slave_local_errstat, XE_RAS_REG_SIZE)
> > > +				log_soc_error(tile, pvc_slave_local_fatal_err_reg, severity,
> > > +					      regbit, error_id);
> > > +		}
> > > +
> > > +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
> > > +				slave_local_errstat);
> > > +	}
> > > +
> > > +	for_each_set_bit(regbit, &slave_global_errstat, XE_RAS_REG_SIZE)
> > > +		log_soc_error(tile, pvc_slave_global_err_reg, severity, regbit, error_id);
> > > +
> > > +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err), slave_global_errstat);
> > > +}
> > > +
> > > +static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
> > > +				 u32 error_id)
> > > +{
> > > +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +	struct xe_mmio *mmio = &tile->mmio;
> > > +	unsigned long master_global_errstat, master_local_errstat;
> > > +	u32 master_base, slave_base, regbit;
> > > +	int i;
> > > +
> > > +	if (xe->info.platform != XE_PVC)
> > > +		return;
> > > +
> > > +	master_base = SOC_PVC_MASTER_BASE;
> > > +	slave_base = SOC_PVC_SLAVE_BASE;
> > 
> > Ditto. Just 'master' and 'slave' will help remove the line wrapping below.
> 
> Yeah this will help. Will add this change

Thank you.

Raag

> > > +	/* Mask error type in GSYSEVTCTL so that no new errors of the type will be reported */
> > > +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> > > +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
> > > +				~REG_BIT(hw_err));
> > > +
> > > +	if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
> > > +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
> > > +				REG_GENMASK(31, 0));
> > > +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
> > > +				REG_GENMASK(31, 0));
> > > +		xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(slave_base, hw_err),
> > > +				REG_GENMASK(31, 0));
> > > +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(slave_base, hw_err),
> > > +				REG_GENMASK(31, 0));
> > > +		goto unmask_gsysevtctl;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Read the master global IEH error register if BIT(1) is set then process
> > 
> > Missing comma after 'register'.
> > 
> > > +	 * the slave IEH first. If BIT(0) in global error register is set then process
> > > +	 * the corresponding local error registers.
> > > +	 */
> > > +	master_global_errstat = xe_mmio_read32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err));
> > > +	if (master_global_errstat & SOC_SLAVE_IEH)
> > > +		soc_slave_ieh_handler(tile, hw_err, error_id);
> > > +
> > > +	if (master_global_errstat & SOC_IEH0_LOCAL_ERR_STATUS) {
> > > +		master_local_errstat = xe_mmio_read32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err));
> > > +
> > > +		for_each_set_bit(regbit, &master_local_errstat, XE_RAS_REG_SIZE) {
> > > +			const char * const *reg_info = (hw_err == HARDWARE_ERROR_FATAL) ?
> > 
> > This looks like it can be outside the loop.
> 
> will fix it.
> 
> 
> Thanks
> Riana
> 
> 
> > 
> > Raag
> > 
> > > +						       pvc_master_local_fatal_err_reg :
> > > +						       pvc_master_local_nonfatal_err_reg;
> > > +
> > > +			log_soc_error(tile, reg_info, severity, regbit, error_id);
> > > +		}
> > > +
> > > +		xe_mmio_write32(mmio, SOC_LOCAL_ERR_STAT_REG(master_base, hw_err),
> > > +				master_local_errstat);
> > > +	}
> > > +
> > > +	for_each_set_bit(regbit, &master_global_errstat, XE_RAS_REG_SIZE)
> > > +		log_soc_error(tile, pvc_master_global_err_reg, severity, regbit, error_id);
> > > +
> > > +	xe_mmio_write32(mmio, SOC_GLOBAL_ERR_STAT_REG(master_base, hw_err),
> > > +			master_global_errstat);
> > > +
> > > +unmask_gsysevtctl:
> > > +	for (i = 0; i < XE_SOC_NUM_IEH; i++)
> > > +		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master_base, slave_base, i),
> > > +				(HARDWARE_ERROR_MAX << 1) + 1);
> > > +}
> > > +
> > >   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> > >   {
> > >   	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
> > > @@ -283,8 +499,11 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
> > >   					    "TILE%d reported %s %s, bit[%d] is set\n",
> > >   					    tile->id, name, severity_str, err_bit);
> > >   		}
> > > +
> > >   		if (err_bit == XE_GT_ERROR)
> > >   			gt_hw_error_handler(tile, hw_err, error_id);
> > > +		if (err_bit == XE_SOC_ERROR)
> > > +			soc_hw_error_handler(tile, hw_err, error_id);
> > >   	}
> > >   clear_reg:
> > > -- 
> > > 2.47.1
> > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors
  2026-02-10 11:45       ` Raag Jadav
@ 2026-02-12  3:25         ` Riana Tauro
  0 siblings, 0 replies; 24+ messages in thread
From: Riana Tauro @ 2026-02-12  3:25 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, dri-devel, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, Himal Prasad Ghimiray



On 2/10/2026 5:15 PM, Raag Jadav wrote:
> On Tue, Feb 10, 2026 at 11:28:39AM +0530, Riana Tauro wrote:
>> On 2/5/2026 9:00 PM, Raag Jadav wrote:
>>> On Mon, Feb 02, 2026 at 12:14:00PM +0530, Riana Tauro wrote:
>>>> PVC supports GT error reporting via vector registers along with
>>>> error status register. Add support to report these errors and
>>>> update respective counters. Incase of Subslice error reported
>>>> by vector register, process the error status register
>>>> for applicable bits.
>>>>
>>>> The counter is embedded in the xe drm ras structure and is
>>>> exposed to the userspace using the drm_ras generic netlink
>>>> interface.
>>>>
>>>> $ sudo ynl --family drm_ras --do query-error-counter  --json
>>>
>>> We usually add '\' at the end for wrapping commands so that they're easy
>>> to apply directly (and same for all other patches where applicable).
>>>
>>>>     '{"node-id":0, "error-id":1}'
>>>
>>> Ditto.
>>
>> Will fix this
> 
> Thank you.
> 
>>>> {'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}
>>>>
>>>> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>> ---
>>>> v2: Add ID's and names as uAPI (Rodrigo)
>>>>       Add documentation
>>>>       Modify commit message
>>>>
>>>> v3: remove 'error' from counters
>>>>       use drmm_kcalloc
>>>>       add a for_each for severity
>>>>       differentitate error classes and severity in UAPI(Raag)
>>>>       Use correctable and uncorrectable in uapi (Pratik / Aravind)
>>>>
>>>> v4: modify enums in UAPI
>>>>       improve comments
>>>>       add bounds check in handler
>>>>       add error mask macro (Raag)
>>>>       use atomic_t
>>>>       add null pointer checks
>>>> ---
>>>>    drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  62 ++++++-
>>>>    drivers/gpu/drm/xe/xe_hw_error.c           | 199 +++++++++++++++++++--
>>>>    2 files changed, 241 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>>> index c146b9ef44eb..17982a335941 100644
>>>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>>> @@ -6,15 +6,59 @@
>>>>    #ifndef _XE_HW_ERROR_REGS_H_
>>>>    #define _XE_HW_ERROR_REGS_H_
>>>> -#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
>>>> -#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>>>> +#define HEC_UNCORR_ERR_STATUS(base)			XE_REG((base) + 0x118)
>>>> +#define   UNCORR_FW_REPORTED_ERR			REG_BIT(6)
>>>> -#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
>>>> +#define HEC_UNCORR_FW_ERR_DW0(base)			XE_REG((base) + 0x124)
>>>> +
>>>> +#define ERR_STAT_GT_COR					0x100160
>>>> +#define   EU_GRF_COR_ERR				REG_BIT(15)
>>>> +#define   EU_IC_COR_ERR					REG_BIT(14)
>>>> +#define   SLM_COR_ERR					REG_BIT(13)
>>>> +#define   GUC_COR_ERR					REG_BIT(1)
>>>> +
>>>> +#define ERR_STAT_GT_NONFATAL				0x100164
>>>> +#define ERR_STAT_GT_FATAL				0x100168
>>>> +#define   EU_GRF_FAT_ERR				REG_BIT(15)
>>>> +#define   SLM_FAT_ERR					REG_BIT(13)
>>>> +#define   GUC_FAT_ERR					REG_BIT(6)
>>>> +#define   FPU_FAT_ERR					REG_BIT(3)
>>>> +
>>>> +#define ERR_STAT_GT_REG(x)				XE_REG(_PICK_EVEN((x), \
>>>> +									  ERR_STAT_GT_COR, \
>>>> +									  ERR_STAT_GT_NONFATAL))
>>>> +
>>>> +#define PVC_COR_ERR_MASK				(GUC_COR_ERR | SLM_COR_ERR | \
>>>> +							 EU_IC_COR_ERR | EU_GRF_COR_ERR)
>>>> +
>>>> +#define PVC_FAT_ERR_MASK				(FPU_FAT_ERR | GUC_FAT_ERR | \
>>>> +							EU_GRF_FAT_ERR | SLM_FAT_ERR)
>>>
>>> Nit: Whitespace please!
>>
>> alignment?
> 
> Yes please!
> 
>>>> +#define DEV_ERR_STAT_NONFATAL				0x100178
>>>> +#define DEV_ERR_STAT_CORRECTABLE			0x10017c
>>>> +#define DEV_ERR_STAT_REG(x)				XE_REG(_PICK_EVEN((x), \
>>>> +									  DEV_ERR_STAT_CORRECTABLE, \
>>>> +									  DEV_ERR_STAT_NONFATAL))
>>>
>>> I know it was already like this but how does this evaluate for FATAL?
>>
>> #define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a)))
>> (index, 0x10017c, 0x100178)  = (0x10017c + index * (0x100178 - 0x10017c));
>> 0 =  0x10017c
>> 1 =  0x100178
>> 2 =  0x100174
> 
> The addresses are usually unsigned, so I got lost there a bit.
> 
>>>> +#define   XE_CSC_ERROR					17
>>>> +#define   XE_GT_ERROR					0
>>>> +
>>>> +#define ERR_STAT_GT_FATAL_VECTOR_0			0x100260
>>>> +#define ERR_STAT_GT_FATAL_VECTOR_1			0x100264
>>>> +
>>>> +#define ERR_STAT_GT_FATAL_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
>>>> +								  ERR_STAT_GT_FATAL_VECTOR_0, \
>>>> +								  ERR_STAT_GT_FATAL_VECTOR_1))
>>>> +
>>>> +#define ERR_STAT_GT_COR_VECTOR_0			0x1002a0
>>>> +#define ERR_STAT_GT_COR_VECTOR_1			0x1002a4
>>>> +
>>>> +#define ERR_STAT_GT_COR_VECTOR_REG(x)			XE_REG(_PICK_EVEN((x), \
>>>> +									  ERR_STAT_GT_COR_VECTOR_0, \
>>>> +									  ERR_STAT_GT_COR_VECTOR_1))
>>>> +
>>>> +#define ERR_STAT_GT_VECTOR_REG(hw_err, x)		(hw_err == HARDWARE_ERROR_CORRECTABLE ? \
>>>> +							ERR_STAT_GT_COR_VECTOR_REG(x) : \
>>>> +							ERR_STAT_GT_FATAL_VECTOR_REG(x))
>>>
>>> Ditto for whitespace.
>>>
>>>> -#define DEV_ERR_STAT_NONFATAL			0x100178
>>>> -#define DEV_ERR_STAT_CORRECTABLE		0x10017c
>>>> -#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>>>> -								  DEV_ERR_STAT_CORRECTABLE, \
>>>> -								  DEV_ERR_STAT_NONFATAL))
>>>> -#define   XE_CSC_ERROR				BIT(17)
>>>>    #endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>>>> index 2019aaaa1ebe..ff31fb322c8a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>>>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>>>> @@ -3,6 +3,7 @@
>>>>     * Copyright © 2025 Intel Corporation
>>>>     */
>>>> +#include <linux/bitmap.h>
>>>>    #include <linux/fault-inject.h>
>>>>    #include "regs/xe_gsc_regs.h"
>>>> @@ -15,7 +16,13 @@
>>>>    #include "xe_mmio.h"
>>>>    #include "xe_survivability_mode.h"
>>>> -#define  HEC_UNCORR_FW_ERR_BITS 4
>>>> +#define  GT_HW_ERROR_MAX_ERR_BITS	16
>>>> +#define  HEC_UNCORR_FW_ERR_BITS		4
>>>> +#define  XE_RAS_REG_SIZE		32
>>>> +
>>>> +#define  PVC_ERROR_MASK_SET(hw_err, err_bit) \
>>>> +	((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \
>>>> +	(BIT(err_bit) & PVC_FAT_ERR_MASK))
>>>
>>> I'd write this as below and move it to xe_hw_error_regs.h
>>
>> This is not specific to register selection or defining bits. It's related to
>> mask. So .c should be the right place
> 
> Don't the mask bits come from register def?

yeah masks are already defined in register header. But this is a 
comparison. I don't see these in the register headers

> 
>>> #define PVC_COR_ERR_MASK_SET(err_bit)			(PVC_COR_ERR_MASK & REG_BIT(err_bit))
>>> #define PVC_FAT_ERR_MASK_SET(err_bit)			(PVC_FAT_ERR_MASK & REG_BIT(err_bit))
>>>
>>> #define PVC_ERR_MASK_SET(hw_err, err_bit)		((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \
>>> 								PVC_COR_ERR_MASK_SET(err_bit) : \
>>> 								PVC_FAT_ERR_MASK_SET(err_bit)
>>>
>>> ...
>>>
>>>> +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err,
>>>> +				u32 error_id)
>>>> +{
>>>> +	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>>>> +	struct xe_device *xe = tile_to_xe(tile);
>>>> +	struct xe_drm_ras *ras = &xe->ras;
>>>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>>>> +	struct xe_mmio *mmio = &tile->mmio;
>>>> +	unsigned long err_stat = 0;
>>>> +	int i, len;
>>>> +
>>>> +	if (xe->info.platform != XE_PVC)
>>>> +		return;
>>>> +
>>>> +	if (!info)
>>>> +		return;
>>>
>>> Since info allocation is not related to hardware, we shouldn't even be
>>> at this point without it. So let's not hide bugs and fail probe instead.
>>
>> yes currently it is supported only on PVC. I can remove this here as there
>> is a PVC check but cannot remove the one suggested below.
> 
> Fair, but please also return the allocation failure. With that perhaps
> xe_hw_error_init() will be int now.

That's present. We just log it as an error and not fail probe.

Thanks
Riana

> 
> Raag
> 
>>>> +	if (hw_err == HARDWARE_ERROR_NONFATAL) {
>>>> +		atomic_inc(&info[error_id].counter);
>>>> +		log_hw_error(tile, info[error_id].name, severity);
>>>> +		return;
>>>> +	}
>>>
>>> ...
>>>
>>>>    static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>>>    {
>>>>    	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
>>>>    	const char *severity_str = error_severity[severity];
>>>>    	struct xe_device *xe = tile_to_xe(tile);
>>>> -	unsigned long flags;
>>>> -	u32 err_src;
>>>> +	struct xe_drm_ras *ras = &xe->ras;
>>>> +	struct xe_drm_ras_counter *info = ras->info[severity];
>>>> +	unsigned long flags, err_src;
>>>> +	u32 err_bit;
>>>> -	if (xe->info.platform != XE_BATTLEMAGE)
>>>> +	if (!IS_DGFX(xe))
>>>>    		return;
>>>>    	spin_lock_irqsave(&xe->irq.lock, flags);
>>>> @@ -108,11 +242,53 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>>>    		goto unlock;
>>>>    	}
>>>> -	if (err_src & XE_CSC_ERROR)
>>>> +	/*
>>>> +	 * On encountering CSC firmware errors, the graphics device becomes unrecoverable
>>>> +	 * so return immediately on error. The only way to recover from these errors is
>>>> +	 * firmware flash. The device will enter Runtime Survivability mode when such
>>>> +	 * errors are detected.
>>>> +	 */
>>>> +	if (err_src & XE_CSC_ERROR) {
>>>>    		csc_hw_error_handler(tile, hw_err);
>>>> +		goto clear_reg;
>>>> +	}
>>>> -	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>>>> +	if (!info)
>>>> +		goto clear_reg;
>>>
>>> Same as above.
>>>
>>> Raag


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-02-12  3:25 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-02  6:43 [PATCH v5 0/5] Introduce DRM_RAS using generic netlink for RAS Riana Tauro
2026-02-02  6:43 ` [PATCH v5 1/5] drm/ras: Introduce the DRM RAS infrastructure over generic netlink Riana Tauro
2026-02-02 10:08   ` kernel test robot
2026-02-02 22:52   ` kernel test robot
2026-02-02  6:43 ` [PATCH v5 2/5] drm/xe/xe_drm_ras: Add support for XE DRM RAS Riana Tauro
2026-02-03 17:58   ` Raag Jadav
2026-02-10  4:20     ` Riana Tauro
2026-02-02  6:43 ` [PATCH v5 3/5] drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling Riana Tauro
2026-02-05  8:30   ` Raag Jadav
2026-02-10  4:58     ` Riana Tauro
2026-02-10  4:59       ` Riana Tauro
2026-02-02  6:44 ` [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Riana Tauro
2026-02-05 15:30   ` Raag Jadav
2026-02-10  5:58     ` Riana Tauro
2026-02-10 11:45       ` Raag Jadav
2026-02-12  3:25         ` Riana Tauro
2026-02-02  6:44 ` [PATCH v5 5/5] drm/xe/xe_hw_error: Add support for PVC SoC errors Riana Tauro
2026-02-05 18:10   ` Raag Jadav
2026-02-10  6:32     ` Riana Tauro
2026-02-10 11:52       ` Raag Jadav
2026-02-02 16:15 ` ✗ CI.checkpatch: warning for Introduce DRM_RAS using generic netlink for RAS (rev5) Patchwork
2026-02-02 16:16 ` ✓ CI.KUnit: success " Patchwork
2026-02-02 16:31 ` ✗ CI.checksparse: warning " Patchwork
2026-02-02 16:51 ` ✓ Xe.CI.BAT: success " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox