public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Add support for clear counter and error event in DRM RAS
@ 2026-03-11 10:29 Riana Tauro
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Clear Error Counter : Add clear-error-counter command to DRM RAS to clear
a specific error counter of a node. Implement the callback in XE driver
to demonstrate usage.

Usage with both get-error-counter and clear-error-counter:

$ sudo ynl --family drm_ras  --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 3}]

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":2}'
None

$ sudo ynl --family drm_ras  --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]

Error Event Support:  Introduce `error-event` support in DRM RAS to notify
userspace whenever an error occurs.

Each notification includes the node-id and error-id to identify
the source and type of the error. To receive notifications,
userspace must subscribe to the 'error-notify' multicast group.

Userspace can receive the event by subscribing to multicast group.

$ sudo ynl --family drm_ras --subscribe error-notify
{'msg': {'error-id': 2, 'node-id': 1}, 'name': 'error-event'}

Riana Tauro (4):
  drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
  drm/drm_ras: Add DRM RAS netlink error event notification
  drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS

 Documentation/gpu/drm-ras.rst            | 17 +++++
 Documentation/netlink/specs/drm_ras.yaml | 27 ++++++-
 drivers/gpu/drm/drm_ras.c                | 91 +++++++++++++++++++++++-
 drivers/gpu/drm/drm_ras_nl.c             | 19 +++++
 drivers/gpu/drm/drm_ras_nl.h             |  6 ++
 drivers/gpu/drm/xe/xe_drm_ras.c          | 52 +++++++++++++-
 drivers/gpu/drm/xe/xe_drm_ras.h          |  7 ++
 drivers/gpu/drm/xe/xe_hw_error.c         |  5 ++
 include/drm/drm_ras.h                    | 13 ++++
 include/uapi/drm/drm_ras.h               |  4 ++
 10 files changed, 237 insertions(+), 4 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  2026-03-12  0:29   ` Jakub Kicinski
  2026-03-25 12:40   ` Raag Jadav
  2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 9+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
	Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet

Introduce a new 'clear-error-counter' DRM RAS command to reset the counter
value for a specific error counter of a given node.

The command is a 'do' netlink request with 'node-id' and 'error-id'
as parameters with no additional response payload.

Usage

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-ras.rst            |  8 +++++
 Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
 drivers/gpu/drm/drm_ras.c                | 43 +++++++++++++++++++++++-
 drivers/gpu/drm/drm_ras_nl.c             | 13 +++++++
 drivers/gpu/drm/drm_ras_nl.h             |  2 ++
 include/drm/drm_ras.h                    | 11 ++++++
 include/uapi/drm/drm_ras.h               |  1 +
 7 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 70b246a78fc8..4636e68f5678 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -52,6 +52,8 @@ User space tools can:
   as a parameter.
 * Query specific error counter values with the ``get-error-counter`` command, using both
   ``node-id`` and ``error-id`` as parameters.
+* Clear specific error counters with the ``clear-error-counter`` command, using both
+  ``node-id`` and ``error-id`` as parameters.
 
 YAML-based Interface
 --------------------
@@ -101,3 +103,9 @@ Example: Query an error counter for a given node
     sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
     {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
 
+Example: Clear an error counter for a given node
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
+    None
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index 79af25dac3c5..e113056f8c01 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -99,7 +99,7 @@ operations:
       flags: [admin-perm]
       do:
         request:
-          attributes:
+          attributes: &id-attrs
             - node-id
             - error-id
         reply:
@@ -113,3 +113,14 @@ operations:
             - node-id
         reply:
           attributes: *errorinfo
+    -
+      name: clear-error-counter
+      doc: >-
+           Clear error counter for a given node.
+           The request includes the error-id and node-id of the
+           counter to be cleared.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes: *id-attrs
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index b2fa5ab86d87..d6eab29a1394 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -26,7 +26,7 @@
  * efficient lookup by ID. Nodes can be registered or unregistered
  * dynamically at runtime.
  *
- * A Generic Netlink family `drm_ras` exposes two main operations to
+ * A Generic Netlink family `drm_ras` exposes the below operations to
  * userspace:
  *
  * 1. LIST_NODES: Dump all currently registered RAS nodes.
@@ -37,6 +37,10 @@
  *    Returns all counters of a node if only Node ID is provided or specific
  *    error counters.
  *
+ * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
+ *    Userspace must provide Node ID, Error ID.
+ *    Clears specific error counter of a node if supported.
+ *
  * Node registration:
  *
  * - drm_ras_node_register(): Registers a new node and assigns
@@ -66,6 +70,8 @@
  *   operation, fetching all counters from a specific node.
  * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
  *   operation, fetching a counter value from a specific node.
+ * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
+ *   operation, clearing a counter value from a specific node.
  */
 
 static DEFINE_XARRAY_ALLOC(drm_ras_xa);
@@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
 	return doit_reply_value(info, node_id, error_id);
 }
 
+/**
+ * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * clears the current value.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct drm_ras_node *node;
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node || !node->clear_error_counter)
+		return -ENOENT;
+
+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->clear_error_counter(node, error_id);
+}
+
 /**
  * drm_ras_node_register() - Register a new RAS node
  * @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index 16803d0c4a44..dea1c1b2494e 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
 	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
 };
 
+/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
 /* Ops table for drm_ras */
 static const struct genl_split_ops drm_ras_nl_ops[] = {
 	{
@@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
 		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+		.doit		= drm_ras_nl_clear_error_counter_doit,
+		.policy		= drm_ras_clear_error_counter_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 struct genl_family drm_ras_nl_family __ro_after_init = {
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index 06ccd9342773..a398643572a5 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
 				      struct genl_info *info);
 int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
 					struct netlink_callback *cb);
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info);
 
 extern struct genl_family drm_ras_nl_family;
 
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index 5d50209e51db..f2a787bc4f64 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -58,6 +58,17 @@ struct drm_ras_node {
 	int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
 				   const char **name, u32 *val);
 
+	/**
+	 * @clear_error_counter:
+	 *
+	 * This callback is used by drm_ras to clear a specific error counter.
+	 * Driver should implement this callback to support clearing error counters
+	 * of a node.
+	 *
+	 * Returns: 0 on success, negative error code on failure.
+	 */
+	int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
+
 	/** @priv: Driver private data */
 	void *priv;
 };
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 5f40fa5b869d..218a3ee86805 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -41,6 +41,7 @@ enum {
 enum {
 	DRM_RAS_CMD_LIST_NODES = 1,
 	DRM_RAS_CMD_GET_ERROR_COUNTER,
+	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
 
 	__DRM_RAS_CMD_MAX,
 	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  2026-03-12 10:17   ` Raag Jadav
  2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
  2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
  3 siblings, 1 reply; 9+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Add support for clear-error-counter command in XE DRM RAS.
This resets the counter value.

Usage:

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index e07dc23a155e..c21c8b428de6 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -27,6 +27,16 @@ static int hw_query_error_counter(struct xe_drm_ras_counter *info,
 	return 0;
 }
 
+static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
+{
+	if (!info || !info[error_id].name)
+		return -ENOENT;
+
+	atomic_set(&info[error_id].counter, 0);
+
+	return 0;
+}
+
 static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
 					     const char **name, u32 *val)
 {
@@ -37,6 +47,15 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
 	return hw_query_error_counter(info, error_id, name, val);
 }
 
+static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+	struct xe_device *xe = node->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
+
+	return hw_clear_error_counter(info, error_id);
+}
+
 static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
 					   const char **name, u32 *val)
 {
@@ -47,6 +66,15 @@ static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id
 	return hw_query_error_counter(info, error_id, name, val);
 }
 
+static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+	struct xe_device *xe = node->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+	return hw_clear_error_counter(info, error_id);
+}
+
 static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
 {
 	struct xe_drm_ras_counter *counter;
@@ -92,10 +120,13 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
 	if (IS_ERR(ras->info[severity]))
 		return PTR_ERR(ras->info[severity]);
 
-	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
 		node->query_error_counter = query_correctable_error_counter;
-	else
+		node->clear_error_counter = clear_correctable_error_counter;
+	} else {
 		node->query_error_counter = query_uncorrectable_error_counter;
+		node->clear_error_counter = clear_uncorrectable_error_counter;
+	}
 
 	return 0;
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
  2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  2026-03-25 13:31   ` Raag Jadav
  2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
  3 siblings, 1 reply; 9+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
	Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet

Add support for asynchronous error notifications in drm_ras.

Define a new `error-event` netlink event and a new multicast
group `error-notify` in drm_ras spec. Each event contains
a node-id and error-id to identify the type and source
of error.

Add drm_ras_error_notify() to trigger this event from drivers.
Userspace can receive this event by subscribing to the
multicast group error-notify.

Example: Using ynl tool

$ sudo ynl --family drm_ras --subscribe error-notify

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-ras.rst            |  9 +++++
 Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
 drivers/gpu/drm/drm_ras.c                | 48 ++++++++++++++++++++++++
 drivers/gpu/drm/drm_ras_nl.c             |  6 +++
 drivers/gpu/drm/drm_ras_nl.h             |  4 ++
 include/drm/drm_ras.h                    |  2 +
 include/uapi/drm/drm_ras.h               |  3 ++
 7 files changed, 86 insertions(+)

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 4636e68f5678..09b2918f67bd 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -54,6 +54,8 @@ User space tools can:
   ``node-id`` and ``error-id`` as parameters.
 * Clear specific error counters with the ``clear-error-counter`` command, using both
   ``node-id`` and ``error-id`` as parameters.
+* Listen to ``error-event`` notifications for error events by subscribing to the
+  ``error-notify`` multicast group.
 
 YAML-based Interface
 --------------------
@@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
 
     sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
     None
+
+Example: Listen to error events
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras --subscribe error-notify
+    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index e113056f8c01..4dc047be59e9 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -124,3 +124,17 @@ operations:
       do:
         request:
           attributes: *id-attrs
+    -
+      name: error-event
+      doc: >-
+           Notify userspace of an error event.
+           The event includes the error-id and node-id of the error
+           that triggered the event.
+      attribute-set: error-counter-attrs
+      event:
+        attributes: *id-attrs
+
+mcast-groups:
+  list:
+    -
+      name: error-notify
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index d6eab29a1394..36a3a79cbbea 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -41,6 +41,10 @@
  *    Userspace must provide Node ID, Error ID.
  *    Clears specific error counter of a node if supported.
  *
+ * 4. ERROR_EVENT: Notify userspace of an error event.
+ *    The event includes the error-id and node-id of the error
+ *    that triggered the event.
+ *
  * Node registration:
  *
  * - drm_ras_node_register(): Registers a new node and assigns
@@ -355,6 +359,50 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
 	return node->clear_error_counter(node, error_id);
 }
 
+/**
+ * drm_ras_error_notify() - Notify userspace of an error event
+ * @node: Node structure
+ * @error_id: ID of the error counter that triggered the event
+ * @flags: GFP flags for memory allocation
+ *
+ * Notifies userspace of an error event related to a specific RAS node and error counter.
+ */
+void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags)
+{
+	struct genl_info info;
+	struct sk_buff *msg;
+	struct nlattr *hdr;
+	int ret;
+
+	genl_info_init_ntf(&info, &drm_ras_nl_family, DRM_RAS_CMD_ERROR_EVENT);
+
+	msg = genlmsg_new(NLMSG_GOODSIZE, flags);
+	if (!msg)
+		return;
+
+	hdr = genlmsg_iput(msg, &info);
+	if (!hdr)
+		goto err_free;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID, node->id);
+	if (ret)
+		goto err_cancel;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+	if (ret)
+		goto err_cancel;
+
+	genlmsg_end(msg, hdr);
+	genlmsg_multicast(&drm_ras_nl_family, msg, 0, DRM_RAS_NLGRP_ERROR_NOTIFY, flags);
+	return;
+
+err_cancel:
+	genlmsg_cancel(msg, hdr);
+err_free:
+	nlmsg_free(msg);
+}
+EXPORT_SYMBOL(drm_ras_error_notify);
+
 /**
  * drm_ras_node_register() - Register a new RAS node
  * @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index dea1c1b2494e..ac724bb87a3b 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -58,6 +58,10 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
 	},
 };
 
+static const struct genl_multicast_group drm_ras_nl_mcgrps[] = {
+	[DRM_RAS_NLGRP_ERROR_NOTIFY] = { "error-notify", },
+};
+
 struct genl_family drm_ras_nl_family __ro_after_init = {
 	.name		= DRM_RAS_FAMILY_NAME,
 	.version	= DRM_RAS_FAMILY_VERSION,
@@ -66,4 +70,6 @@ struct genl_family drm_ras_nl_family __ro_after_init = {
 	.module		= THIS_MODULE,
 	.split_ops	= drm_ras_nl_ops,
 	.n_split_ops	= ARRAY_SIZE(drm_ras_nl_ops),
+	.mcgrps		= drm_ras_nl_mcgrps,
+	.n_mcgrps	= ARRAY_SIZE(drm_ras_nl_mcgrps),
 };
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index a398643572a5..17e1af8cc3b3 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -21,6 +21,10 @@ int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
 int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
 					struct genl_info *info);
 
+enum {
+	DRM_RAS_NLGRP_ERROR_NOTIFY,
+};
+
 extern struct genl_family drm_ras_nl_family;
 
 #endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index f2a787bc4f64..a2d4f257c9c2 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -78,9 +78,11 @@ struct drm_device;
 #if IS_ENABLED(CONFIG_DRM_RAS)
 int drm_ras_node_register(struct drm_ras_node *node);
 void drm_ras_node_unregister(struct drm_ras_node *node);
+void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags);
 #else
 static inline int drm_ras_node_register(struct drm_ras_node *node) { return 0; }
 static inline void drm_ras_node_unregister(struct drm_ras_node *node) { }
+static inline void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags) { }
 #endif
 
 #endif
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 218a3ee86805..47fafeff93e7 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -42,9 +42,12 @@ enum {
 	DRM_RAS_CMD_LIST_NODES = 1,
 	DRM_RAS_CMD_GET_ERROR_COUNTER,
 	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+	DRM_RAS_CMD_ERROR_EVENT,
 
 	__DRM_RAS_CMD_MAX,
 	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
 };
 
+#define DRM_RAS_MCGRP_ERROR_NOTIFY	"error-notify"
+
 #endif /* _UAPI_LINUX_DRM_RAS_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS
  2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
                   ` (2 preceding siblings ...)
  2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
  3 siblings, 0 replies; 9+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Add error-event support in XE DRM RAS to notify userspace
whenever a GT or SoC error occurs.

$ sudo ynl --family drm_ras --subscribe error-notify
{'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_drm_ras.c  | 17 +++++++++++++++++
 drivers/gpu/drm/xe/xe_drm_ras.h  |  7 +++++++
 drivers/gpu/drm/xe/xe_hw_error.c |  5 +++++
 3 files changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index c21c8b428de6..47c040c80175 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -181,6 +181,23 @@ static void xe_drm_ras_unregister_nodes(struct drm_device *device, void *arg)
 	}
 }
 
+/**
+ * xe_drm_ras_notify() - Notify userspace of an error event
+ * @ras:  ras structure
+ * @error_id: error id
+ * @severity: error severity
+ * @flags: flags for allocation
+ *
+ * Notifies userspace of an error.
+ */
+void xe_drm_ras_notify(struct xe_drm_ras *ras, u32 error_id,
+		       const enum drm_xe_ras_error_severity severity, gfp_t flags)
+{
+	struct drm_ras_node *node = &ras->node[severity];
+
+	drm_ras_error_notify(node, error_id, flags);
+}
+
 /**
  * xe_drm_ras_init() - Initialize DRM RAS
  * @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
index 5cc8f0124411..ac347d0d63eb 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.h
+++ b/drivers/gpu/drm/xe/xe_drm_ras.h
@@ -5,11 +5,18 @@
 #ifndef XE_DRM_RAS_H_
 #define XE_DRM_RAS_H_
 
+#include <linux/types.h>
+
+#include <drm/xe_drm.h>
+
 struct xe_device;
+struct xe_drm_ras;
 
 #define for_each_error_severity(i)	\
 	for (i = 0; i < DRM_XE_RAS_ERR_SEV_MAX; i++)
 
 int xe_drm_ras_init(struct xe_device *xe);
+void xe_drm_ras_notify(struct xe_drm_ras *ras, u32 error_id,
+		       const enum drm_xe_ras_error_severity severity, gfp_t flags);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 2a31b430570e..17424e07e72c 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -332,6 +332,8 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 
 		xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
 	}
+
+	xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
 }
 
 static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
@@ -368,6 +370,7 @@ static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 {
 	const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
 	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_drm_ras *ras = &xe->ras;
 	struct xe_mmio *mmio = &tile->mmio;
 	unsigned long master_global_errstat, master_local_errstat;
 	u32 master, slave, regbit;
@@ -418,6 +421,8 @@ static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
 	for (i = 0; i < XE_SOC_NUM_IEH; i++)
 		xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master, slave, i),
 				(HARDWARE_ERROR_MAX << 1) + 1);
+
+	xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
 }
 
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
@ 2026-03-12  0:29   ` Jakub Kicinski
  2026-03-25 12:40   ` Raag Jadav
  1 sibling, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2026-03-12  0:29 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, raag.jadav,
	anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet

On Wed, 11 Mar 2026 15:59:15 +0530 Riana Tauro wrote:
>  Documentation/gpu/drm-ras.rst            |  8 +++++
>  Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-

Reviewed-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
  2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
@ 2026-03-12 10:17   ` Raag Jadav
  0 siblings, 0 replies; 9+ messages in thread
From: Raag Jadav @ 2026-03-12 10:17 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst

On Wed, Mar 11, 2026 at 03:59:16PM +0530, Riana Tauro wrote:
> Add support for clear-error-counter command in XE DRM RAS.
> This resets the counter value.
> 
> Usage:
> 
> $ sudo ynl --family drm_ras  --do clear-error-counter --json \
> '{"node-id":1, "error-id":1}'
> None
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
> index e07dc23a155e..c21c8b428de6 100644
> --- a/drivers/gpu/drm/xe/xe_drm_ras.c
> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
> @@ -27,6 +27,16 @@ static int hw_query_error_counter(struct xe_drm_ras_counter *info,
>  	return 0;
>  }
>  
> +static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
> +{
> +	if (!info || !info[error_id].name)
> +		return -ENOENT;
> +
> +	atomic_set(&info[error_id].counter, 0);
> +
> +	return 0;
> +}
> +
>  static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
>  					     const char **name, u32 *val)
>  {
> @@ -37,6 +47,15 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
>  	return hw_query_error_counter(info, error_id, name, val);
>  }
>  
> +static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
> +{
> +	struct xe_device *xe = node->priv;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
> +
> +	return hw_clear_error_counter(info, error_id);
> +}
> +
>  static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
>  					   const char **name, u32 *val)
>  {
> @@ -47,6 +66,15 @@ static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id
>  	return hw_query_error_counter(info, error_id, name, val);
>  }
>  
> +static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
> +{
> +	struct xe_device *xe = node->priv;
> +	struct xe_drm_ras *ras = &xe->ras;
> +	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
> +
> +	return hw_clear_error_counter(info, error_id);
> +}

This would've been much simpler if we had per node info, but for now

Reviewed-by: Raag Jadav <raag.jadav@intel.com>

>  static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
>  {
>  	struct xe_drm_ras_counter *counter;
> @@ -92,10 +120,13 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
>  	if (IS_ERR(ras->info[severity]))
>  		return PTR_ERR(ras->info[severity]);
>  
> -	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
> +	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
>  		node->query_error_counter = query_correctable_error_counter;
> -	else
> +		node->clear_error_counter = clear_correctable_error_counter;
> +	} else {
>  		node->query_error_counter = query_uncorrectable_error_counter;
> +		node->clear_error_counter = clear_uncorrectable_error_counter;
> +	}
>  
>  	return 0;
>  }
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
  2026-03-12  0:29   ` Jakub Kicinski
@ 2026-03-25 12:40   ` Raag Jadav
  1 sibling, 0 replies; 9+ messages in thread
From: Raag Jadav @ 2026-03-25 12:40 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst, Jakub Kicinski, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet

On Wed, Mar 11, 2026 at 03:59:15PM +0530, Riana Tauro wrote:
> Introduce a new 'clear-error-counter' DRM RAS command to reset the counter
> value for a specific error counter of a given node.
> 
> The command is a 'do' netlink request with 'node-id' and 'error-id'
> as parameters with no additional response payload.
> 
> Usage

Missing ":"

> $ sudo ynl --family drm_ras  --do clear-error-counter --json \
> '{"node-id":1, "error-id":1}'
> None
> 
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>

Reviewed-by: Raag Jadav <raag.jadav@intel.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
  2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
@ 2026-03-25 13:31   ` Raag Jadav
  0 siblings, 0 replies; 9+ messages in thread
From: Raag Jadav @ 2026-03-25 13:31 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
	maarten.lankhorst, Jakub Kicinski, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet

On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
> Add support for asynchronous error notifications in drm_ras.

It's either drm_ras or DRM RAS, make it consistent in all patches
(both commit message and subject).

> Define a new `error-event` netlink event and a new multicast
> group `error-notify` in drm_ras spec. Each event contains
> a node-id and error-id to identify the type and source
> of error.
> 
> Add drm_ras_error_notify() to trigger this event from drivers.
> Userspace can receive this event by subscribing to the
> multicast group error-notify.
> 
> Example: Using ynl tool

Ditto. Either Usage or Example, make it consistent in all patches.

Also, please utilize the full 75 character space where possible.

> $ sudo ynl --family drm_ras --subscribe error-notify
> 
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  Documentation/gpu/drm-ras.rst            |  9 +++++
>  Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
>  drivers/gpu/drm/drm_ras.c                | 48 ++++++++++++++++++++++++
>  drivers/gpu/drm/drm_ras_nl.c             |  6 +++
>  drivers/gpu/drm/drm_ras_nl.h             |  4 ++
>  include/drm/drm_ras.h                    |  2 +
>  include/uapi/drm/drm_ras.h               |  3 ++
>  7 files changed, 86 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> index 4636e68f5678..09b2918f67bd 100644
> --- a/Documentation/gpu/drm-ras.rst
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -54,6 +54,8 @@ User space tools can:
>    ``node-id`` and ``error-id`` as parameters.
>  * Clear specific error counters with the ``clear-error-counter`` command, using both
>    ``node-id`` and ``error-id`` as parameters.
> +* Listen to ``error-event`` notifications for error events by subscribing to the
> +  ``error-notify`` multicast group.
>  
>  YAML-based Interface
>  --------------------
> @@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
>  
>      sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
>      None
> +
> +Example: Listen to error events
> +
> +.. code-block:: bash
> +
> +    sudo ynl --family drm_ras --subscribe error-notify
> +    {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}

Can we also have error-name and node-name? I'd be pulling my hair off
if I need to remember all the ids.

On that note, I think it'll be good to have them as part of request
attributes as an alternative to ids (also for existing commands) but
that can done as a follow up.

Also, what if I have multiple devices with multiple nodes. Do they need
separate subscription?

Raag

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-03-25 13:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
2026-03-12  0:29   ` Jakub Kicinski
2026-03-25 12:40   ` Raag Jadav
2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
2026-03-12 10:17   ` Raag Jadav
2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
2026-03-25 13:31   ` Raag Jadav
2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox