* [PATCH 0/4] Add support for clear counter and error event in DRM RAS
@ 2026-03-11 10:29 Riana Tauro
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro
Clear Error Counter : Add clear-error-counter command to DRM RAS to clear
a specific error counter of a node. Implement the callback in XE driver
to demonstrate usage.
Usage with both get-error-counter and clear-error-counter:
$ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 3}]
$ sudo ynl --family drm_ras --do clear-error-counter --json \
'{"node-id":1, "error-id":2}'
None
$ sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]
Error Event Support: Introduce `error-event` support in DRM RAS to notify
userspace whenever an error occurs.
Each notification includes the node-id and error-id to identify
the source and type of the error. To receive notifications,
userspace must subscribe to the 'error-notify' multicast group.
Userspace can receive the event by subscribing to multicast group.
$ sudo ynl --family drm_ras --subscribe error-notify
{'msg': {'error-id': 2, 'node-id': 1}, 'name': 'error-event'}
Riana Tauro (4):
drm/drm_ras: Add clear-error-counter netlink command to drm_ras
drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
drm/drm_ras: Add DRM RAS netlink error event notification
drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS
Documentation/gpu/drm-ras.rst | 17 +++++
Documentation/netlink/specs/drm_ras.yaml | 27 ++++++-
drivers/gpu/drm/drm_ras.c | 91 +++++++++++++++++++++++-
drivers/gpu/drm/drm_ras_nl.c | 19 +++++
drivers/gpu/drm/drm_ras_nl.h | 6 ++
drivers/gpu/drm/xe/xe_drm_ras.c | 52 +++++++++++++-
drivers/gpu/drm/xe/xe_drm_ras.h | 7 ++
drivers/gpu/drm/xe/xe_hw_error.c | 5 ++
include/drm/drm_ras.h | 13 ++++
include/uapi/drm/drm_ras.h | 4 ++
10 files changed, 237 insertions(+), 4 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
2026-03-12 0:29 ` Jakub Kicinski
2026-03-25 12:40 ` Raag Jadav
2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
` (2 subsequent siblings)
3 siblings, 2 replies; 11+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet
Introduce a new 'clear-error-counter' DRM RAS command to reset the counter
value for a specific error counter of a given node.
The command is a 'do' netlink request with 'node-id' and 'error-id'
as parameters with no additional response payload.
Usage
$ sudo ynl --family drm_ras --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
Documentation/gpu/drm-ras.rst | 8 +++++
Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
drivers/gpu/drm/drm_ras.c | 43 +++++++++++++++++++++++-
drivers/gpu/drm/drm_ras_nl.c | 13 +++++++
drivers/gpu/drm/drm_ras_nl.h | 2 ++
include/drm/drm_ras.h | 11 ++++++
include/uapi/drm/drm_ras.h | 1 +
7 files changed, 89 insertions(+), 2 deletions(-)
diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 70b246a78fc8..4636e68f5678 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -52,6 +52,8 @@ User space tools can:
as a parameter.
* Query specific error counter values with the ``get-error-counter`` command, using both
``node-id`` and ``error-id`` as parameters.
+* Clear specific error counters with the ``clear-error-counter`` command, using both
+ ``node-id`` and ``error-id`` as parameters.
YAML-based Interface
--------------------
@@ -101,3 +103,9 @@ Example: Query an error counter for a given node
sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
+Example: Clear an error counter for a given node
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
+ None
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index 79af25dac3c5..e113056f8c01 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -99,7 +99,7 @@ operations:
flags: [admin-perm]
do:
request:
- attributes:
+ attributes: &id-attrs
- node-id
- error-id
reply:
@@ -113,3 +113,14 @@ operations:
- node-id
reply:
attributes: *errorinfo
+ -
+ name: clear-error-counter
+ doc: >-
+ Clear error counter for a given node.
+ The request includes the error-id and node-id of the
+ counter to be cleared.
+ attribute-set: error-counter-attrs
+ flags: [admin-perm]
+ do:
+ request:
+ attributes: *id-attrs
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index b2fa5ab86d87..d6eab29a1394 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -26,7 +26,7 @@
* efficient lookup by ID. Nodes can be registered or unregistered
* dynamically at runtime.
*
- * A Generic Netlink family `drm_ras` exposes two main operations to
+ * A Generic Netlink family `drm_ras` exposes the below operations to
* userspace:
*
* 1. LIST_NODES: Dump all currently registered RAS nodes.
@@ -37,6 +37,10 @@
* Returns all counters of a node if only Node ID is provided or specific
* error counters.
*
+ * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
+ * Userspace must provide Node ID, Error ID.
+ * Clears specific error counter of a node if supported.
+ *
* Node registration:
*
* - drm_ras_node_register(): Registers a new node and assigns
@@ -66,6 +70,8 @@
* operation, fetching all counters from a specific node.
* - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
* operation, fetching a counter value from a specific node.
+ * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
+ * operation, clearing a counter value from a specific node.
*/
static DEFINE_XARRAY_ALLOC(drm_ras_xa);
@@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
return doit_reply_value(info, node_id, error_id);
}
+/**
+ * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * clears the current value.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+ struct genl_info *info)
+{
+ struct drm_ras_node *node;
+ u32 node_id, error_id;
+
+ if (!info->attrs ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
+ return -EINVAL;
+
+ node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+ error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+ node = xa_load(&drm_ras_xa, node_id);
+ if (!node || !node->clear_error_counter)
+ return -ENOENT;
+
+ if (error_id < node->error_counter_range.first ||
+ error_id > node->error_counter_range.last)
+ return -EINVAL;
+
+ return node->clear_error_counter(node, error_id);
+}
+
/**
* drm_ras_node_register() - Register a new RAS node
* @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index 16803d0c4a44..dea1c1b2494e 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
};
+/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
/* Ops table for drm_ras */
static const struct genl_split_ops drm_ras_nl_ops[] = {
{
@@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
.maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
},
+ {
+ .cmd = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+ .doit = drm_ras_nl_clear_error_counter_doit,
+ .policy = drm_ras_clear_error_counter_nl_policy,
+ .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+ },
};
struct genl_family drm_ras_nl_family __ro_after_init = {
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index 06ccd9342773..a398643572a5 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
struct genl_info *info);
int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+ struct genl_info *info);
extern struct genl_family drm_ras_nl_family;
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index 5d50209e51db..f2a787bc4f64 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -58,6 +58,17 @@ struct drm_ras_node {
int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
const char **name, u32 *val);
+ /**
+ * @clear_error_counter:
+ *
+ * This callback is used by drm_ras to clear a specific error counter.
+ * Driver should implement this callback to support clearing error counters
+ * of a node.
+ *
+ * Returns: 0 on success, negative error code on failure.
+ */
+ int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
+
/** @priv: Driver private data */
void *priv;
};
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 5f40fa5b869d..218a3ee86805 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -41,6 +41,7 @@ enum {
enum {
DRM_RAS_CMD_LIST_NODES = 1,
DRM_RAS_CMD_GET_ERROR_COUNTER,
+ DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
__DRM_RAS_CMD_MAX,
DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
--
2.47.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
2026-03-12 10:17 ` Raag Jadav
2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
3 siblings, 1 reply; 11+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro
Add support for clear-error-counter command in XE DRM RAS.
This resets the counter value.
Usage:
$ sudo ynl --family drm_ras --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++++++++++++++++--
1 file changed, 33 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index e07dc23a155e..c21c8b428de6 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -27,6 +27,16 @@ static int hw_query_error_counter(struct xe_drm_ras_counter *info,
return 0;
}
+static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
+{
+ if (!info || !info[error_id].name)
+ return -ENOENT;
+
+ atomic_set(&info[error_id].counter, 0);
+
+ return 0;
+}
+
static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
const char **name, u32 *val)
{
@@ -37,6 +47,15 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
return hw_query_error_counter(info, error_id, name, val);
}
+static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+ struct xe_device *xe = node->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
+
+ return hw_clear_error_counter(info, error_id);
+}
+
static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
const char **name, u32 *val)
{
@@ -47,6 +66,15 @@ static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id
return hw_query_error_counter(info, error_id, name, val);
}
+static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+ struct xe_device *xe = node->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+ return hw_clear_error_counter(info, error_id);
+}
+
static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
{
struct xe_drm_ras_counter *counter;
@@ -92,10 +120,13 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
if (IS_ERR(ras->info[severity]))
return PTR_ERR(ras->info[severity]);
- if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+ if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
node->query_error_counter = query_correctable_error_counter;
- else
+ node->clear_error_counter = clear_correctable_error_counter;
+ } else {
node->query_error_counter = query_uncorrectable_error_counter;
+ node->clear_error_counter = clear_uncorrectable_error_counter;
+ }
return 0;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
2026-03-25 13:31 ` Raag Jadav
2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
3 siblings, 1 reply; 11+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet
Add support for asynchronous error notifications in drm_ras.
Define a new `error-event` netlink event and a new multicast
group `error-notify` in drm_ras spec. Each event contains
a node-id and error-id to identify the type and source
of error.
Add drm_ras_error_notify() to trigger this event from drivers.
Userspace can receive this event by subscribing to the
multicast group error-notify.
Example: Using ynl tool
$ sudo ynl --family drm_ras --subscribe error-notify
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
Documentation/gpu/drm-ras.rst | 9 +++++
Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
drivers/gpu/drm/drm_ras.c | 48 ++++++++++++++++++++++++
drivers/gpu/drm/drm_ras_nl.c | 6 +++
drivers/gpu/drm/drm_ras_nl.h | 4 ++
include/drm/drm_ras.h | 2 +
include/uapi/drm/drm_ras.h | 3 ++
7 files changed, 86 insertions(+)
diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 4636e68f5678..09b2918f67bd 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -54,6 +54,8 @@ User space tools can:
``node-id`` and ``error-id`` as parameters.
* Clear specific error counters with the ``clear-error-counter`` command, using both
``node-id`` and ``error-id`` as parameters.
+* Listen to ``error-event`` notifications for error events by subscribing to the
+ ``error-notify`` multicast group.
YAML-based Interface
--------------------
@@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
None
+
+Example: Listen to error events
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --subscribe error-notify
+ {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index e113056f8c01..4dc047be59e9 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -124,3 +124,17 @@ operations:
do:
request:
attributes: *id-attrs
+ -
+ name: error-event
+ doc: >-
+ Notify userspace of an error event.
+ The event includes the error-id and node-id of the error
+ that triggered the event.
+ attribute-set: error-counter-attrs
+ event:
+ attributes: *id-attrs
+
+mcast-groups:
+ list:
+ -
+ name: error-notify
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index d6eab29a1394..36a3a79cbbea 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -41,6 +41,10 @@
* Userspace must provide Node ID, Error ID.
* Clears specific error counter of a node if supported.
*
+ * 4. ERROR_EVENT: Notify userspace of an error event.
+ * The event includes the error-id and node-id of the error
+ * that triggered the event.
+ *
* Node registration:
*
* - drm_ras_node_register(): Registers a new node and assigns
@@ -355,6 +359,50 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
return node->clear_error_counter(node, error_id);
}
+/**
+ * drm_ras_error_notify() - Notify userspace of an error event
+ * @node: Node structure
+ * @error_id: ID of the error counter that triggered the event
+ * @flags: GFP flags for memory allocation
+ *
+ * Notifies userspace of an error event related to a specific RAS node and error counter.
+ */
+void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags)
+{
+ struct genl_info info;
+ struct sk_buff *msg;
+ struct nlattr *hdr;
+ int ret;
+
+ genl_info_init_ntf(&info, &drm_ras_nl_family, DRM_RAS_CMD_ERROR_EVENT);
+
+ msg = genlmsg_new(NLMSG_GOODSIZE, flags);
+ if (!msg)
+ return;
+
+ hdr = genlmsg_iput(msg, &info);
+ if (!hdr)
+ goto err_free;
+
+ ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID, node->id);
+ if (ret)
+ goto err_cancel;
+
+ ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+ if (ret)
+ goto err_cancel;
+
+ genlmsg_end(msg, hdr);
+ genlmsg_multicast(&drm_ras_nl_family, msg, 0, DRM_RAS_NLGRP_ERROR_NOTIFY, flags);
+ return;
+
+err_cancel:
+ genlmsg_cancel(msg, hdr);
+err_free:
+ nlmsg_free(msg);
+}
+EXPORT_SYMBOL(drm_ras_error_notify);
+
/**
* drm_ras_node_register() - Register a new RAS node
* @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index dea1c1b2494e..ac724bb87a3b 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -58,6 +58,10 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
},
};
+static const struct genl_multicast_group drm_ras_nl_mcgrps[] = {
+ [DRM_RAS_NLGRP_ERROR_NOTIFY] = { "error-notify", },
+};
+
struct genl_family drm_ras_nl_family __ro_after_init = {
.name = DRM_RAS_FAMILY_NAME,
.version = DRM_RAS_FAMILY_VERSION,
@@ -66,4 +70,6 @@ struct genl_family drm_ras_nl_family __ro_after_init = {
.module = THIS_MODULE,
.split_ops = drm_ras_nl_ops,
.n_split_ops = ARRAY_SIZE(drm_ras_nl_ops),
+ .mcgrps = drm_ras_nl_mcgrps,
+ .n_mcgrps = ARRAY_SIZE(drm_ras_nl_mcgrps),
};
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index a398643572a5..17e1af8cc3b3 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -21,6 +21,10 @@ int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
struct genl_info *info);
+enum {
+ DRM_RAS_NLGRP_ERROR_NOTIFY,
+};
+
extern struct genl_family drm_ras_nl_family;
#endif /* _LINUX_DRM_RAS_GEN_H */
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index f2a787bc4f64..a2d4f257c9c2 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -78,9 +78,11 @@ struct drm_device;
#if IS_ENABLED(CONFIG_DRM_RAS)
int drm_ras_node_register(struct drm_ras_node *node);
void drm_ras_node_unregister(struct drm_ras_node *node);
+void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags);
#else
static inline int drm_ras_node_register(struct drm_ras_node *node) { return 0; }
static inline void drm_ras_node_unregister(struct drm_ras_node *node) { }
+static inline void drm_ras_error_notify(struct drm_ras_node *node, u32 error_id, gfp_t flags) { }
#endif
#endif
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 218a3ee86805..47fafeff93e7 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -42,9 +42,12 @@ enum {
DRM_RAS_CMD_LIST_NODES = 1,
DRM_RAS_CMD_GET_ERROR_COUNTER,
DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+ DRM_RAS_CMD_ERROR_EVENT,
__DRM_RAS_CMD_MAX,
DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
};
+#define DRM_RAS_MCGRP_ERROR_NOTIFY "error-notify"
+
#endif /* _UAPI_LINUX_DRM_RAS_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
` (2 preceding siblings ...)
2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
@ 2026-03-11 10:29 ` Riana Tauro
3 siblings, 0 replies; 11+ messages in thread
From: Riana Tauro @ 2026-03-11 10:29 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro
Add error-event support in XE DRM RAS to notify userspace
whenever a GT or SoC error occurs.
$ sudo ynl --family drm_ras --subscribe error-notify
{'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
drivers/gpu/drm/xe/xe_drm_ras.c | 17 +++++++++++++++++
drivers/gpu/drm/xe/xe_drm_ras.h | 7 +++++++
drivers/gpu/drm/xe/xe_hw_error.c | 5 +++++
3 files changed, 29 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index c21c8b428de6..47c040c80175 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -181,6 +181,23 @@ static void xe_drm_ras_unregister_nodes(struct drm_device *device, void *arg)
}
}
+/**
+ * xe_drm_ras_notify() - Notify userspace of an error event
+ * @ras: ras structure
+ * @error_id: error id
+ * @severity: error severity
+ * @flags: flags for allocation
+ *
+ * Notifies userspace of an error.
+ */
+void xe_drm_ras_notify(struct xe_drm_ras *ras, u32 error_id,
+ const enum drm_xe_ras_error_severity severity, gfp_t flags)
+{
+ struct drm_ras_node *node = &ras->node[severity];
+
+ drm_ras_error_notify(node, error_id, flags);
+}
+
/**
* xe_drm_ras_init() - Initialize DRM RAS
* @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.h b/drivers/gpu/drm/xe/xe_drm_ras.h
index 5cc8f0124411..ac347d0d63eb 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.h
+++ b/drivers/gpu/drm/xe/xe_drm_ras.h
@@ -5,11 +5,18 @@
#ifndef XE_DRM_RAS_H_
#define XE_DRM_RAS_H_
+#include <linux/types.h>
+
+#include <drm/xe_drm.h>
+
struct xe_device;
+struct xe_drm_ras;
#define for_each_error_severity(i) \
for (i = 0; i < DRM_XE_RAS_ERR_SEV_MAX; i++)
int xe_drm_ras_init(struct xe_device *xe);
+void xe_drm_ras_notify(struct xe_drm_ras *ras, u32 error_id,
+ const enum drm_xe_ras_error_severity severity, gfp_t flags);
#endif
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 2a31b430570e..17424e07e72c 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -332,6 +332,8 @@ static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error
xe_mmio_write32(mmio, ERR_STAT_GT_VECTOR_REG(hw_err, i), vector);
}
+
+ xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
}
static void soc_slave_ieh_handler(struct xe_tile *tile, const enum hardware_error hw_err, u32 error_id)
@@ -368,6 +370,7 @@ static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
{
const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err);
struct xe_device *xe = tile_to_xe(tile);
+ struct xe_drm_ras *ras = &xe->ras;
struct xe_mmio *mmio = &tile->mmio;
unsigned long master_global_errstat, master_local_errstat;
u32 master, slave, regbit;
@@ -418,6 +421,8 @@ static void soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error
for (i = 0; i < XE_SOC_NUM_IEH; i++)
xe_mmio_write32(mmio, SOC_GSYSEVTCTL_REG(master, slave, i),
(HARDWARE_ERROR_MAX << 1) + 1);
+
+ xe_drm_ras_notify(ras, error_id, severity, GFP_ATOMIC);
}
static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
--
2.47.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
@ 2026-03-12 0:29 ` Jakub Kicinski
2026-03-25 12:40 ` Raag Jadav
1 sibling, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2026-03-12 0:29 UTC (permalink / raw)
To: Riana Tauro
Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, raag.jadav,
anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet
On Wed, 11 Mar 2026 15:59:15 +0530 Riana Tauro wrote:
> Documentation/gpu/drm-ras.rst | 8 +++++
> Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS
2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
@ 2026-03-12 10:17 ` Raag Jadav
0 siblings, 0 replies; 11+ messages in thread
From: Raag Jadav @ 2026-03-12 10:17 UTC (permalink / raw)
To: Riana Tauro
Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
maarten.lankhorst
On Wed, Mar 11, 2026 at 03:59:16PM +0530, Riana Tauro wrote:
> Add support for clear-error-counter command in XE DRM RAS.
> This resets the counter value.
>
> Usage:
>
> $ sudo ynl --family drm_ras --do clear-error-counter --json \
> '{"node-id":1, "error-id":1}'
> None
>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++++++++++++++++--
> 1 file changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
> index e07dc23a155e..c21c8b428de6 100644
> --- a/drivers/gpu/drm/xe/xe_drm_ras.c
> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
> @@ -27,6 +27,16 @@ static int hw_query_error_counter(struct xe_drm_ras_counter *info,
> return 0;
> }
>
> +static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
> +{
> + if (!info || !info[error_id].name)
> + return -ENOENT;
> +
> + atomic_set(&info[error_id].counter, 0);
> +
> + return 0;
> +}
> +
> static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
> const char **name, u32 *val)
> {
> @@ -37,6 +47,15 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
> return hw_query_error_counter(info, error_id, name, val);
> }
>
> +static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
> +{
> + struct xe_device *xe = node->priv;
> + struct xe_drm_ras *ras = &xe->ras;
> + struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
> +
> + return hw_clear_error_counter(info, error_id);
> +}
> +
> static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
> const char **name, u32 *val)
> {
> @@ -47,6 +66,15 @@ static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id
> return hw_query_error_counter(info, error_id, name, val);
> }
>
> +static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
> +{
> + struct xe_device *xe = node->priv;
> + struct xe_drm_ras *ras = &xe->ras;
> + struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
> +
> + return hw_clear_error_counter(info, error_id);
> +}
This would've been much simpler if we had per node info, but for now
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
> static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
> {
> struct xe_drm_ras_counter *counter;
> @@ -92,10 +120,13 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
> if (IS_ERR(ras->info[severity]))
> return PTR_ERR(ras->info[severity]);
>
> - if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
> + if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
> node->query_error_counter = query_correctable_error_counter;
> - else
> + node->clear_error_counter = clear_correctable_error_counter;
> + } else {
> node->query_error_counter = query_uncorrectable_error_counter;
> + node->clear_error_counter = clear_uncorrectable_error_counter;
> + }
>
> return 0;
> }
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
2026-03-12 0:29 ` Jakub Kicinski
@ 2026-03-25 12:40 ` Raag Jadav
1 sibling, 0 replies; 11+ messages in thread
From: Raag Jadav @ 2026-03-25 12:40 UTC (permalink / raw)
To: Riana Tauro
Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
maarten.lankhorst, Jakub Kicinski, Zack McKevitt, Lijo Lazar,
Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet
On Wed, Mar 11, 2026 at 03:59:15PM +0530, Riana Tauro wrote:
> Introduce a new 'clear-error-counter' DRM RAS command to reset the counter
> value for a specific error counter of a given node.
>
> The command is a 'do' netlink request with 'node-id' and 'error-id'
> as parameters with no additional response payload.
>
> Usage
Missing ":"
> $ sudo ynl --family drm_ras --do clear-error-counter --json \
> '{"node-id":1, "error-id":1}'
> None
>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
@ 2026-03-25 13:31 ` Raag Jadav
2026-04-08 14:29 ` Tauro, Riana
0 siblings, 1 reply; 11+ messages in thread
From: Raag Jadav @ 2026-03-25 13:31 UTC (permalink / raw)
To: Riana Tauro
Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
rodrigo.vivi, joonas.lahtinen, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
maarten.lankhorst, Jakub Kicinski, Zack McKevitt, Lijo Lazar,
Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet
On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
> Add support for asynchronous error notifications in drm_ras.
It's either drm_ras or DRM RAS, make it consistent in all patches
(both commit message and subject).
> Define a new `error-event` netlink event and a new multicast
> group `error-notify` in drm_ras spec. Each event contains
> a node-id and error-id to identify the type and source
> of error.
>
> Add drm_ras_error_notify() to trigger this event from drivers.
> Userspace can receive this event by subscribing to the
> multicast group error-notify.
>
> Example: Using ynl tool
Ditto. Either Usage or Example, make it consistent in all patches.
Also, please utilize the full 75 character space where possible.
> $ sudo ynl --family drm_ras --subscribe error-notify
>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> Documentation/gpu/drm-ras.rst | 9 +++++
> Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
> drivers/gpu/drm/drm_ras.c | 48 ++++++++++++++++++++++++
> drivers/gpu/drm/drm_ras_nl.c | 6 +++
> drivers/gpu/drm/drm_ras_nl.h | 4 ++
> include/drm/drm_ras.h | 2 +
> include/uapi/drm/drm_ras.h | 3 ++
> 7 files changed, 86 insertions(+)
>
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> index 4636e68f5678..09b2918f67bd 100644
> --- a/Documentation/gpu/drm-ras.rst
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -54,6 +54,8 @@ User space tools can:
> ``node-id`` and ``error-id`` as parameters.
> * Clear specific error counters with the ``clear-error-counter`` command, using both
> ``node-id`` and ``error-id`` as parameters.
> +* Listen to ``error-event`` notifications for error events by subscribing to the
> + ``error-notify`` multicast group.
>
> YAML-based Interface
> --------------------
> @@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
>
> sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
> None
> +
> +Example: Listen to error events
> +
> +.. code-block:: bash
> +
> + sudo ynl --family drm_ras --subscribe error-notify
> + {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
Can we also have error-name and node-name? I'd be pulling my hair off
if I need to remember all the ids.
On that note, I think it'll be good to have them as part of request
attributes as an alternative to ids (also for existing commands) but
that can done as a follow up.
Also, what if I have multiple devices with multiple nodes. Do they need
separate subscription?
Raag
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
2026-03-25 13:31 ` Raag Jadav
@ 2026-04-08 14:29 ` Tauro, Riana
2026-04-09 5:35 ` Raag Jadav
0 siblings, 1 reply; 11+ messages in thread
From: Tauro, Riana @ 2026-04-08 14:29 UTC (permalink / raw)
To: Raag Jadav, aravind.iddamsetty, rodrigo.vivi
Cc: intel-xe, dri-devel, netdev, anshuman.gupta, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet,
Jakub Kicinski
On 3/25/2026 7:01 PM, Raag Jadav wrote:
> On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
>> Add support for asynchronous error notifications in drm_ras.
> It's either drm_ras or DRM RAS, make it consistent in all patches
> (both commit message and subject).
Sure.
>
>> Define a new `error-event` netlink event and a new multicast
>> group `error-notify` in drm_ras spec. Each event contains
>> a node-id and error-id to identify the type and source
>> of error.
>>
>> Add drm_ras_error_notify() to trigger this event from drivers.
>> Userspace can receive this event by subscribing to the
>> multicast group error-notify.
>>
>> Example: Using ynl tool
> Ditto. Either Usage or Example, make it consistent in all patches.
>
> Also, please utilize the full 75 character space where possible.
Will fix.
>
>> $ sudo ynl --family drm_ras --subscribe error-notify
>>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> Documentation/gpu/drm-ras.rst | 9 +++++
>> Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
>> drivers/gpu/drm/drm_ras.c | 48 ++++++++++++++++++++++++
>> drivers/gpu/drm/drm_ras_nl.c | 6 +++
>> drivers/gpu/drm/drm_ras_nl.h | 4 ++
>> include/drm/drm_ras.h | 2 +
>> include/uapi/drm/drm_ras.h | 3 ++
>> 7 files changed, 86 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
>> index 4636e68f5678..09b2918f67bd 100644
>> --- a/Documentation/gpu/drm-ras.rst
>> +++ b/Documentation/gpu/drm-ras.rst
>> @@ -54,6 +54,8 @@ User space tools can:
>> ``node-id`` and ``error-id`` as parameters.
>> * Clear specific error counters with the ``clear-error-counter`` command, using both
>> ``node-id`` and ``error-id`` as parameters.
>> +* Listen to ``error-event`` notifications for error events by subscribing to the
>> + ``error-notify`` multicast group.
>>
>> YAML-based Interface
>> --------------------
>> @@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
>>
>> sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
>> None
>> +
>> +Example: Listen to error events
>> +
>> +.. code-block:: bash
>> +
>> + sudo ynl --family drm_ras --subscribe error-notify
>> + {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
> Can we also have error-name and node-name? I'd be pulling my hair off
> if I need to remember all the ids.
Yeah makes sense. We can add the node_name, error_name.
Adding device_name would also be more useful in the event.
@Rodrigo/@aravind thoughts?
>
> On that note, I think it'll be good to have them as part of request
> attributes as an alternative to ids (also for existing commands) but
> that can done as a follow up.
>
We cannot use names as alternative because it won't work for multiple cards.
example in xe: Suppose there are 2 cards and each has 2 nodes. We cannot
query using node_name+error_name.
Also most of the netlink implementations use id's as unique identifiers.
$ sudo ./cli.py --family drm_ras --dump list-nodes
[{'device-name': 'bdf_1', 'node-id': 0, 'node-name':
'correctable-errors', 'node-type': 'error-counter'},
{'device-name': 'bdf_1, 'node-id': 1, 'node-name':
'uncorrectable-errors', 'node-type': 'error-counter'},
{'device-name': 'bdf_2', 'node-id': 2, 'node-name':
'correctable-errors', 'node-type': 'error-counter'},
{'device-name': 'bdf_2', 'node-id': 3, 'node-name':
'uncorrectable-errors', 'node-type': 'error-counter'}]
>
> Also, what if I have multiple devices with multiple nodes. Do they need
> separate subscription?
>
No, we subscribe only to the group not the nodes. In this case the group
is 'error-notify'
$ sudo ./cli.py --family drm_ras --subscribe error-notify
{'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
{'msg': {'error-id': 1, 'node-id': 3}, 'name': 'error-event'}
Thanks
Riana
>
> Raag
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
2026-04-08 14:29 ` Tauro, Riana
@ 2026-04-09 5:35 ` Raag Jadav
0 siblings, 0 replies; 11+ messages in thread
From: Raag Jadav @ 2026-04-09 5:35 UTC (permalink / raw)
To: Tauro, Riana
Cc: aravind.iddamsetty, rodrigo.vivi, intel-xe, dri-devel, netdev,
anshuman.gupta, joonas.lahtinen, simona.vetter, airlied,
pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
shubham.kumar, ravi.kishore.koppuravuri, anvesh.bakwad,
maarten.lankhorst, Zack McKevitt, Lijo Lazar, Hawking Zhang,
David S. Miller, Paolo Abeni, Eric Dumazet, Jakub Kicinski
On Wed, Apr 08, 2026 at 07:59:33PM +0530, Tauro, Riana wrote:
> On 3/25/2026 7:01 PM, Raag Jadav wrote:
> > On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
...
> > > +Example: Listen to error events
> > > +
> > > +.. code-block:: bash
> > > +
> > > + sudo ynl --family drm_ras --subscribe error-notify
> > > + {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
> > Can we also have error-name and node-name? I'd be pulling my hair off
> > if I need to remember all the ids.
>
> Yeah makes sense. We can add the node_name, error_name.
> Adding device_name would also be more useful in the event.
>
> @Rodrigo/@aravind thoughts?
>
> >
> > On that note, I think it'll be good to have them as part of request
> > attributes as an alternative to ids (also for existing commands) but
> > that can done as a follow up.
> >
> We cannot use names as alternative because it won't work for multiple cards.
> example in xe: Suppose there are 2 cards and each has 2 nodes. We cannot
> query using node_name+error_name.
> Also most of the netlink implementations use id's as unique identifiers.
>
> $ sudo ./cli.py --family drm_ras --dump list-nodes
> [{'device-name': 'bdf_1', 'node-id': 0, 'node-name': 'correctable-errors',
> 'node-type': 'error-counter'},
> {'device-name': 'bdf_1, 'node-id': 1, 'node-name': 'uncorrectable-errors',
> 'node-type': 'error-counter'},
> {'device-name': 'bdf_2', 'node-id': 2, 'node-name': 'correctable-errors',
> 'node-type': 'error-counter'},
> {'device-name': 'bdf_2', 'node-id': 3, 'node-name': 'uncorrectable-errors',
> 'node-type': 'error-counter'}]
This means they don't persist the user needs to figures out all the ids before
anything can happen. In device node world we have /dev/dri/by-path/<bdf> which
makes it much easier.
Also, I'm not much informed about the history and it's still unclear to me what
problem did netlink solve here that cannot be solved by anything else? But we're
too late for that discussion, and again, not my call.
> > Also, what if I have multiple devices with multiple nodes. Do they need
> > separate subscription?
> >
> No, we subscribe only to the group not the nodes. In this case the group is
> 'error-notify'
>
> $ sudo ./cli.py --family drm_ras --subscribe error-notify
> {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
> {'msg': {'error-id': 1, 'node-id': 3}, 'name': 'error-event'}
Hm, perhaps I need to spend some time wrapping my head around the new concept.
Let's catch up sometime this week.
Raag
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-04-09 5:35 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 10:29 [PATCH 0/4] Add support for clear counter and error event in DRM RAS Riana Tauro
2026-03-11 10:29 ` [PATCH 1/4] drm/drm_ras: Add clear-error-counter netlink command to drm_ras Riana Tauro
2026-03-12 0:29 ` Jakub Kicinski
2026-03-25 12:40 ` Raag Jadav
2026-03-11 10:29 ` [PATCH 2/4] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE DRM RAS Riana Tauro
2026-03-12 10:17 ` Raag Jadav
2026-03-11 10:29 ` [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification Riana Tauro
2026-03-25 13:31 ` Raag Jadav
2026-04-08 14:29 ` Tauro, Riana
2026-04-09 5:35 ` Raag Jadav
2026-03-11 10:29 ` [PATCH 4/4] drm/xe/xe_drm_ras: Add error-event support in XE DRM RAS Riana Tauro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox