public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-04-09  7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
@ 2026-04-09  7:21   ` Tauro, Riana
  2026-04-09 13:37     ` Rodrigo Vivi
  0 siblings, 1 reply; 5+ messages in thread
From: Tauro, Riana @ 2026-04-09  7:21 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev, rodrigo.vivi, Zack McKevitt,
	joonas.lahtinen, aravind.iddamsetty
  Cc: anshuman.gupta, simona.vetter, airlied, pratik.bari,
	joshua.santosh.ranjan, ashwin.kumar.kulkarni, shubham.kumar,
	ravi.kishore.koppuravuri, raag.jadav, anvesh.bakwad,
	maarten.lankhorst, Jakub Kicinski, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet

Hi Zack

Could you please take a look at this patch if applicable to your 
usecase. Please let me know if any
changes are required

@Rodrigo This is already reviewed by Jakub and Raag.
If there are no opens, can this be merged via drm_misc

Thanks
Riana

On 4/9/2026 1:03 PM, Riana Tauro wrote:
> Introduce a new 'clear-error-counter' drm_ras command to reset the counter
> value for a specific error counter of a given node.
>
> The command is a 'do' netlink request with 'node-id' and 'error-id'
> as parameters with no response payload.
>
> Usage:
>
> $ sudo ynl --family drm_ras  --do clear-error-counter --json \
> '{"node-id":1, "error-id":1}'
> None
>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
> Reviewed-by: Raag Jadav <raag.jadav@intel.com>
> ---
>   Documentation/gpu/drm-ras.rst            |  8 +++++
>   Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
>   drivers/gpu/drm/drm_ras.c                | 43 +++++++++++++++++++++++-
>   drivers/gpu/drm/drm_ras_nl.c             | 13 +++++++
>   drivers/gpu/drm/drm_ras_nl.h             |  2 ++
>   include/drm/drm_ras.h                    | 11 ++++++
>   include/uapi/drm/drm_ras.h               |  1 +
>   7 files changed, 89 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> index 70b246a78fc8..4636e68f5678 100644
> --- a/Documentation/gpu/drm-ras.rst
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -52,6 +52,8 @@ User space tools can:
>     as a parameter.
>   * Query specific error counter values with the ``get-error-counter`` command, using both
>     ``node-id`` and ``error-id`` as parameters.
> +* Clear specific error counters with the ``clear-error-counter`` command, using both
> +  ``node-id`` and ``error-id`` as parameters.
>   
>   YAML-based Interface
>   --------------------
> @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
>       sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
>       {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
>   
> +Example: Clear an error counter for a given node
> +
> +.. code-block:: bash
> +
> +    sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
> +    None
> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> index 79af25dac3c5..e113056f8c01 100644
> --- a/Documentation/netlink/specs/drm_ras.yaml
> +++ b/Documentation/netlink/specs/drm_ras.yaml
> @@ -99,7 +99,7 @@ operations:
>         flags: [admin-perm]
>         do:
>           request:
> -          attributes:
> +          attributes: &id-attrs
>               - node-id
>               - error-id
>           reply:
> @@ -113,3 +113,14 @@ operations:
>               - node-id
>           reply:
>             attributes: *errorinfo
> +    -
> +      name: clear-error-counter
> +      doc: >-
> +           Clear error counter for a given node.
> +           The request includes the error-id and node-id of the
> +           counter to be cleared.
> +      attribute-set: error-counter-attrs
> +      flags: [admin-perm]
> +      do:
> +        request:
> +          attributes: *id-attrs
> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> index b2fa5ab86d87..d6eab29a1394 100644
> --- a/drivers/gpu/drm/drm_ras.c
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -26,7 +26,7 @@
>    * efficient lookup by ID. Nodes can be registered or unregistered
>    * dynamically at runtime.
>    *
> - * A Generic Netlink family `drm_ras` exposes two main operations to
> + * A Generic Netlink family `drm_ras` exposes the below operations to
>    * userspace:
>    *
>    * 1. LIST_NODES: Dump all currently registered RAS nodes.
> @@ -37,6 +37,10 @@
>    *    Returns all counters of a node if only Node ID is provided or specific
>    *    error counters.
>    *
> + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
> + *    Userspace must provide Node ID, Error ID.
> + *    Clears specific error counter of a node if supported.
> + *
>    * Node registration:
>    *
>    * - drm_ras_node_register(): Registers a new node and assigns
> @@ -66,6 +70,8 @@
>    *   operation, fetching all counters from a specific node.
>    * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
>    *   operation, fetching a counter value from a specific node.
> + * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
> + *   operation, clearing a counter value from a specific node.
>    */
>   
>   static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
>   	return doit_reply_value(info, node_id, error_id);
>   }
>   
> +/**
> + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
> + * @skb: Netlink message buffer
> + * @info: Generic Netlink info containing attributes of the request
> + *
> + * Extracts the node ID and error ID from the netlink attributes and
> + * clears the current value.
> + *
> + * Return: 0 on success, or negative errno on failure.
> + */
> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> +					struct genl_info *info)
> +{
> +	struct drm_ras_node *node;
> +	u32 node_id, error_id;
> +
> +	if (!info->attrs ||
> +	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
> +	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
> +		return -EINVAL;
> +
> +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> +	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> +
> +	node = xa_load(&drm_ras_xa, node_id);
> +	if (!node || !node->clear_error_counter)
> +		return -ENOENT;
> +
> +	if (error_id < node->error_counter_range.first ||
> +	    error_id > node->error_counter_range.last)
> +		return -EINVAL;
> +
> +	return node->clear_error_counter(node, error_id);
> +}
> +
>   /**
>    * drm_ras_node_register() - Register a new RAS node
>    * @node: Node structure to register
> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> index 16803d0c4a44..dea1c1b2494e 100644
> --- a/drivers/gpu/drm/drm_ras_nl.c
> +++ b/drivers/gpu/drm/drm_ras_nl.c
> @@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
>   	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>   };
>   
> +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
> +static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> +};
> +
>   /* Ops table for drm_ras */
>   static const struct genl_split_ops drm_ras_nl_ops[] = {
>   	{
> @@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
>   		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>   		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>   	},
> +	{
> +		.cmd		= DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> +		.doit		= drm_ras_nl_clear_error_counter_doit,
> +		.policy		= drm_ras_clear_error_counter_nl_policy,
> +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> +	},
>   };
>   
>   struct genl_family drm_ras_nl_family __ro_after_init = {
> diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
> index 06ccd9342773..a398643572a5 100644
> --- a/drivers/gpu/drm/drm_ras_nl.h
> +++ b/drivers/gpu/drm/drm_ras_nl.h
> @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
>   				      struct genl_info *info);
>   int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
>   					struct netlink_callback *cb);
> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> +					struct genl_info *info);
>   
>   extern struct genl_family drm_ras_nl_family;
>   
> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> index 5d50209e51db..f2a787bc4f64 100644
> --- a/include/drm/drm_ras.h
> +++ b/include/drm/drm_ras.h
> @@ -58,6 +58,17 @@ struct drm_ras_node {
>   	int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
>   				   const char **name, u32 *val);
>   
> +	/**
> +	 * @clear_error_counter:
> +	 *
> +	 * This callback is used by drm_ras to clear a specific error counter.
> +	 * Driver should implement this callback to support clearing error counters
> +	 * of a node.
> +	 *
> +	 * Returns: 0 on success, negative error code on failure.
> +	 */
> +	int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
> +
>   	/** @priv: Driver private data */
>   	void *priv;
>   };
> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> index 5f40fa5b869d..218a3ee86805 100644
> --- a/include/uapi/drm/drm_ras.h
> +++ b/include/uapi/drm/drm_ras.h
> @@ -41,6 +41,7 @@ enum {
>   enum {
>   	DRM_RAS_CMD_LIST_NODES = 1,
>   	DRM_RAS_CMD_GET_ERROR_COUNTER,
> +	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>   
>   	__DRM_RAS_CMD_MAX,
>   	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 0/2] Add clear-error-counter command to drm_ras
@ 2026-04-09  7:33 Riana Tauro
  2026-04-09  7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
  2026-04-09  7:33 ` [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras Riana Tauro
  0 siblings, 2 replies; 5+ messages in thread
From: Riana Tauro @ 2026-04-09  7:33 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Add clear-error-counter command to drm_ras to clear a specific error
counter of a node. The request parameters for this command are
node-id and error-id and no response payload.
Implement the callback in XE driver to demonstrate usage.

Usage:

$ sudo ynl --family drm_ras  --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 3}]

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":2}'
None

$ sudo ynl --family drm_ras  --dump get-error-counter --json '{"node-id":1}'
[{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
 {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}]

Rev2: Split patches

Riana Tauro (2):
  drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras

 Documentation/gpu/drm-ras.rst            |  8 +++++
 Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
 drivers/gpu/drm/drm_ras.c                | 43 +++++++++++++++++++++++-
 drivers/gpu/drm/drm_ras_nl.c             | 13 +++++++
 drivers/gpu/drm/drm_ras_nl.h             |  2 ++
 drivers/gpu/drm/xe/xe_drm_ras.c          | 35 +++++++++++++++++--
 include/drm/drm_ras.h                    | 11 ++++++
 include/uapi/drm/drm_ras.h               |  1 +
 8 files changed, 122 insertions(+), 4 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-04-09  7:33 [PATCH v2 0/2] Add clear-error-counter command to drm_ras Riana Tauro
@ 2026-04-09  7:33 ` Riana Tauro
  2026-04-09  7:21   ` Tauro, Riana
  2026-04-09  7:33 ` [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras Riana Tauro
  1 sibling, 1 reply; 5+ messages in thread
From: Riana Tauro @ 2026-04-09  7:33 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro,
	Jakub Kicinski, Zack McKevitt, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet

Introduce a new 'clear-error-counter' drm_ras command to reset the counter
value for a specific error counter of a given node.

The command is a 'do' netlink request with 'node-id' and 'error-id'
as parameters with no response payload.

Usage:

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
 Documentation/gpu/drm-ras.rst            |  8 +++++
 Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
 drivers/gpu/drm/drm_ras.c                | 43 +++++++++++++++++++++++-
 drivers/gpu/drm/drm_ras_nl.c             | 13 +++++++
 drivers/gpu/drm/drm_ras_nl.h             |  2 ++
 include/drm/drm_ras.h                    | 11 ++++++
 include/uapi/drm/drm_ras.h               |  1 +
 7 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 70b246a78fc8..4636e68f5678 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -52,6 +52,8 @@ User space tools can:
   as a parameter.
 * Query specific error counter values with the ``get-error-counter`` command, using both
   ``node-id`` and ``error-id`` as parameters.
+* Clear specific error counters with the ``clear-error-counter`` command, using both
+  ``node-id`` and ``error-id`` as parameters.
 
 YAML-based Interface
 --------------------
@@ -101,3 +103,9 @@ Example: Query an error counter for a given node
     sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
     {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
 
+Example: Clear an error counter for a given node
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
+    None
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index 79af25dac3c5..e113056f8c01 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -99,7 +99,7 @@ operations:
       flags: [admin-perm]
       do:
         request:
-          attributes:
+          attributes: &id-attrs
             - node-id
             - error-id
         reply:
@@ -113,3 +113,14 @@ operations:
             - node-id
         reply:
           attributes: *errorinfo
+    -
+      name: clear-error-counter
+      doc: >-
+           Clear error counter for a given node.
+           The request includes the error-id and node-id of the
+           counter to be cleared.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes: *id-attrs
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index b2fa5ab86d87..d6eab29a1394 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -26,7 +26,7 @@
  * efficient lookup by ID. Nodes can be registered or unregistered
  * dynamically at runtime.
  *
- * A Generic Netlink family `drm_ras` exposes two main operations to
+ * A Generic Netlink family `drm_ras` exposes the below operations to
  * userspace:
  *
  * 1. LIST_NODES: Dump all currently registered RAS nodes.
@@ -37,6 +37,10 @@
  *    Returns all counters of a node if only Node ID is provided or specific
  *    error counters.
  *
+ * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
+ *    Userspace must provide Node ID, Error ID.
+ *    Clears specific error counter of a node if supported.
+ *
  * Node registration:
  *
  * - drm_ras_node_register(): Registers a new node and assigns
@@ -66,6 +70,8 @@
  *   operation, fetching all counters from a specific node.
  * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
  *   operation, fetching a counter value from a specific node.
+ * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
+ *   operation, clearing a counter value from a specific node.
  */
 
 static DEFINE_XARRAY_ALLOC(drm_ras_xa);
@@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
 	return doit_reply_value(info, node_id, error_id);
 }
 
+/**
+ * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the node ID and error ID from the netlink attributes and
+ * clears the current value.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct drm_ras_node *node;
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node || !node->clear_error_counter)
+		return -ENOENT;
+
+	if (error_id < node->error_counter_range.first ||
+	    error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->clear_error_counter(node, error_id);
+}
+
 /**
  * drm_ras_node_register() - Register a new RAS node
  * @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index 16803d0c4a44..dea1c1b2494e 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
 	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
 };
 
+/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
+static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
 /* Ops table for drm_ras */
 static const struct genl_split_ops drm_ras_nl_ops[] = {
 	{
@@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
 		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+		.doit		= drm_ras_nl_clear_error_counter_doit,
+		.policy		= drm_ras_clear_error_counter_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 struct genl_family drm_ras_nl_family __ro_after_init = {
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index 06ccd9342773..a398643572a5 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
 				      struct genl_info *info);
 int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
 					struct netlink_callback *cb);
+int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
+					struct genl_info *info);
 
 extern struct genl_family drm_ras_nl_family;
 
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index 5d50209e51db..f2a787bc4f64 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -58,6 +58,17 @@ struct drm_ras_node {
 	int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
 				   const char **name, u32 *val);
 
+	/**
+	 * @clear_error_counter:
+	 *
+	 * This callback is used by drm_ras to clear a specific error counter.
+	 * Driver should implement this callback to support clearing error counters
+	 * of a node.
+	 *
+	 * Returns: 0 on success, negative error code on failure.
+	 */
+	int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
+
 	/** @priv: Driver private data */
 	void *priv;
 };
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 5f40fa5b869d..218a3ee86805 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -41,6 +41,7 @@ enum {
 enum {
 	DRM_RAS_CMD_LIST_NODES = 1,
 	DRM_RAS_CMD_GET_ERROR_COUNTER,
+	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
 
 	__DRM_RAS_CMD_MAX,
 	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras
  2026-04-09  7:33 [PATCH v2 0/2] Add clear-error-counter command to drm_ras Riana Tauro
  2026-04-09  7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
@ 2026-04-09  7:33 ` Riana Tauro
  1 sibling, 0 replies; 5+ messages in thread
From: Riana Tauro @ 2026-04-09  7:33 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: aravind.iddamsetty, anshuman.gupta, rodrigo.vivi, joonas.lahtinen,
	simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
	ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
	raag.jadav, anvesh.bakwad, maarten.lankhorst, Riana Tauro

Add support for clear-error-counter command in XE drm_ras
This resets the counter value.

Usage:

$ sudo ynl --family drm_ras  --do clear-error-counter --json \
'{"node-id":1, "error-id":1}'
None

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/xe/xe_drm_ras.c | 35 +++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index e07dc23a155e..c21c8b428de6 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -27,6 +27,16 @@ static int hw_query_error_counter(struct xe_drm_ras_counter *info,
 	return 0;
 }
 
+static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
+{
+	if (!info || !info[error_id].name)
+		return -ENOENT;
+
+	atomic_set(&info[error_id].counter, 0);
+
+	return 0;
+}
+
 static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_id,
 					     const char **name, u32 *val)
 {
@@ -37,6 +47,15 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
 	return hw_query_error_counter(info, error_id, name, val);
 }
 
+static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+	struct xe_device *xe = node->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
+
+	return hw_clear_error_counter(info, error_id);
+}
+
 static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
 					   const char **name, u32 *val)
 {
@@ -47,6 +66,15 @@ static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id
 	return hw_query_error_counter(info, error_id, name, val);
 }
 
+static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
+{
+	struct xe_device *xe = node->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+	return hw_clear_error_counter(info, error_id);
+}
+
 static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
 {
 	struct xe_drm_ras_counter *counter;
@@ -92,10 +120,13 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
 	if (IS_ERR(ras->info[severity]))
 		return PTR_ERR(ras->info[severity]);
 
-	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE)
+	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
 		node->query_error_counter = query_correctable_error_counter;
-	else
+		node->clear_error_counter = clear_correctable_error_counter;
+	} else {
 		node->query_error_counter = query_uncorrectable_error_counter;
+		node->clear_error_counter = clear_uncorrectable_error_counter;
+	}
 
 	return 0;
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
  2026-04-09  7:21   ` Tauro, Riana
@ 2026-04-09 13:37     ` Rodrigo Vivi
  0 siblings, 0 replies; 5+ messages in thread
From: Rodrigo Vivi @ 2026-04-09 13:37 UTC (permalink / raw)
  To: Tauro, Riana
  Cc: intel-xe, dri-devel, netdev, Zack McKevitt, joonas.lahtinen,
	aravind.iddamsetty, anshuman.gupta, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, raag.jadav,
	anvesh.bakwad, maarten.lankhorst, Jakub Kicinski, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet

On Thu, Apr 09, 2026 at 12:51:44PM +0530, Tauro, Riana wrote:
> Hi Zack
> 
> Could you please take a look at this patch if applicable to your usecase.
> Please let me know if any
> changes are required
> 
> @Rodrigo This is already reviewed by Jakub and Raag.
> If there are no opens, can this be merged via drm_misc

if we push this to drm-misc-next, it might take a few weeks to propagate
back to drm-xe-next. With other work from you and Raag going fast pace
on drm-xe-next around this area, I'm afraid it could cause some conflicts.

It is definitely fine by me, but another option is to get ack from
drm-misc maintainers to get this through drm-xe-next.

so, really okay with drm-misc-next?

> 
> Thanks
> Riana
> 
> On 4/9/2026 1:03 PM, Riana Tauro wrote:
> > Introduce a new 'clear-error-counter' drm_ras command to reset the counter
> > value for a specific error counter of a given node.
> > 
> > The command is a 'do' netlink request with 'node-id' and 'error-id'
> > as parameters with no response payload.
> > 
> > Usage:
> > 
> > $ sudo ynl --family drm_ras  --do clear-error-counter --json \
> > '{"node-id":1, "error-id":1}'
> > None
> > 
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
> > Cc: Lijo Lazar <lijo.lazar@amd.com>
> > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > Cc: David S. Miller <davem@davemloft.net>
> > Cc: Paolo Abeni <pabeni@redhat.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > Reviewed-by: Jakub Kicinski <kuba@kernel.org>
> > Reviewed-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >   Documentation/gpu/drm-ras.rst            |  8 +++++
> >   Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
> >   drivers/gpu/drm/drm_ras.c                | 43 +++++++++++++++++++++++-
> >   drivers/gpu/drm/drm_ras_nl.c             | 13 +++++++
> >   drivers/gpu/drm/drm_ras_nl.h             |  2 ++
> >   include/drm/drm_ras.h                    | 11 ++++++
> >   include/uapi/drm/drm_ras.h               |  1 +
> >   7 files changed, 89 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> > index 70b246a78fc8..4636e68f5678 100644
> > --- a/Documentation/gpu/drm-ras.rst
> > +++ b/Documentation/gpu/drm-ras.rst
> > @@ -52,6 +52,8 @@ User space tools can:
> >     as a parameter.
> >   * Query specific error counter values with the ``get-error-counter`` command, using both
> >     ``node-id`` and ``error-id`` as parameters.
> > +* Clear specific error counters with the ``clear-error-counter`` command, using both
> > +  ``node-id`` and ``error-id`` as parameters.
> >   YAML-based Interface
> >   --------------------
> > @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
> >       sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
> >       {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
> > +Example: Clear an error counter for a given node
> > +
> > +.. code-block:: bash
> > +
> > +    sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
> > +    None
> > diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> > index 79af25dac3c5..e113056f8c01 100644
> > --- a/Documentation/netlink/specs/drm_ras.yaml
> > +++ b/Documentation/netlink/specs/drm_ras.yaml
> > @@ -99,7 +99,7 @@ operations:
> >         flags: [admin-perm]
> >         do:
> >           request:
> > -          attributes:
> > +          attributes: &id-attrs
> >               - node-id
> >               - error-id
> >           reply:
> > @@ -113,3 +113,14 @@ operations:
> >               - node-id
> >           reply:
> >             attributes: *errorinfo
> > +    -
> > +      name: clear-error-counter
> > +      doc: >-
> > +           Clear error counter for a given node.
> > +           The request includes the error-id and node-id of the
> > +           counter to be cleared.
> > +      attribute-set: error-counter-attrs
> > +      flags: [admin-perm]
> > +      do:
> > +        request:
> > +          attributes: *id-attrs
> > diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> > index b2fa5ab86d87..d6eab29a1394 100644
> > --- a/drivers/gpu/drm/drm_ras.c
> > +++ b/drivers/gpu/drm/drm_ras.c
> > @@ -26,7 +26,7 @@
> >    * efficient lookup by ID. Nodes can be registered or unregistered
> >    * dynamically at runtime.
> >    *
> > - * A Generic Netlink family `drm_ras` exposes two main operations to
> > + * A Generic Netlink family `drm_ras` exposes the below operations to
> >    * userspace:
> >    *
> >    * 1. LIST_NODES: Dump all currently registered RAS nodes.
> > @@ -37,6 +37,10 @@
> >    *    Returns all counters of a node if only Node ID is provided or specific
> >    *    error counters.
> >    *
> > + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
> > + *    Userspace must provide Node ID, Error ID.
> > + *    Clears specific error counter of a node if supported.
> > + *
> >    * Node registration:
> >    *
> >    * - drm_ras_node_register(): Registers a new node and assigns
> > @@ -66,6 +70,8 @@
> >    *   operation, fetching all counters from a specific node.
> >    * - drm_ras_nl_get_error_counter_doit(): Implements the GET_ERROR_COUNTER doit
> >    *   operation, fetching a counter value from a specific node.
> > + * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
> > + *   operation, clearing a counter value from a specific node.
> >    */
> >   static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> > @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
> >   	return doit_reply_value(info, node_id, error_id);
> >   }
> > +/**
> > + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter of a node
> > + * @skb: Netlink message buffer
> > + * @info: Generic Netlink info containing attributes of the request
> > + *
> > + * Extracts the node ID and error ID from the netlink attributes and
> > + * clears the current value.
> > + *
> > + * Return: 0 on success, or negative errno on failure.
> > + */
> > +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> > +					struct genl_info *info)
> > +{
> > +	struct drm_ras_node *node;
> > +	u32 node_id, error_id;
> > +
> > +	if (!info->attrs ||
> > +	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
> > +	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
> > +		return -EINVAL;
> > +
> > +	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> > +	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> > +
> > +	node = xa_load(&drm_ras_xa, node_id);
> > +	if (!node || !node->clear_error_counter)
> > +		return -ENOENT;
> > +
> > +	if (error_id < node->error_counter_range.first ||
> > +	    error_id > node->error_counter_range.last)
> > +		return -EINVAL;
> > +
> > +	return node->clear_error_counter(node, error_id);
> > +}
> > +
> >   /**
> >    * drm_ras_node_register() - Register a new RAS node
> >    * @node: Node structure to register
> > diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> > index 16803d0c4a44..dea1c1b2494e 100644
> > --- a/drivers/gpu/drm/drm_ras_nl.c
> > +++ b/drivers/gpu/drm/drm_ras_nl.c
> > @@ -22,6 +22,12 @@ static const struct nla_policy drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
> >   	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> >   };
> > +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
> > +static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> > +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> > +	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> > +};
> > +
> >   /* Ops table for drm_ras */
> >   static const struct genl_split_ops drm_ras_nl_ops[] = {
> >   	{
> > @@ -43,6 +49,13 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
> >   		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
> >   		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
> >   	},
> > +	{
> > +		.cmd		= DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> > +		.doit		= drm_ras_nl_clear_error_counter_doit,
> > +		.policy		= drm_ras_clear_error_counter_nl_policy,
> > +		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> > +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> > +	},
> >   };
> >   struct genl_family drm_ras_nl_family __ro_after_init = {
> > diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
> > index 06ccd9342773..a398643572a5 100644
> > --- a/drivers/gpu/drm/drm_ras_nl.h
> > +++ b/drivers/gpu/drm/drm_ras_nl.h
> > @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct sk_buff *skb,
> >   				      struct genl_info *info);
> >   int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
> >   					struct netlink_callback *cb);
> > +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> > +					struct genl_info *info);
> >   extern struct genl_family drm_ras_nl_family;
> > diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> > index 5d50209e51db..f2a787bc4f64 100644
> > --- a/include/drm/drm_ras.h
> > +++ b/include/drm/drm_ras.h
> > @@ -58,6 +58,17 @@ struct drm_ras_node {
> >   	int (*query_error_counter)(struct drm_ras_node *node, u32 error_id,
> >   				   const char **name, u32 *val);
> > +	/**
> > +	 * @clear_error_counter:
> > +	 *
> > +	 * This callback is used by drm_ras to clear a specific error counter.
> > +	 * Driver should implement this callback to support clearing error counters
> > +	 * of a node.
> > +	 *
> > +	 * Returns: 0 on success, negative error code on failure.
> > +	 */
> > +	int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
> > +
> >   	/** @priv: Driver private data */
> >   	void *priv;
> >   };
> > diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> > index 5f40fa5b869d..218a3ee86805 100644
> > --- a/include/uapi/drm/drm_ras.h
> > +++ b/include/uapi/drm/drm_ras.h
> > @@ -41,6 +41,7 @@ enum {
> >   enum {
> >   	DRM_RAS_CMD_LIST_NODES = 1,
> >   	DRM_RAS_CMD_GET_ERROR_COUNTER,
> > +	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> >   	__DRM_RAS_CMD_MAX,
> >   	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-09 13:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09  7:33 [PATCH v2 0/2] Add clear-error-counter command to drm_ras Riana Tauro
2026-04-09  7:33 ` [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink " Riana Tauro
2026-04-09  7:21   ` Tauro, Riana
2026-04-09 13:37     ` Rodrigo Vivi
2026-04-09  7:33 ` [PATCH v2 2/2] drm/xe/xe_drm_ras: Add support for clear-error-counter in XE drm_ras Riana Tauro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox