* [PATCH v3 1/4] drm/ras: Introduce error threshold
2026-06-04 18:46 [PATCH v3 0/4] Introduce error threshold to drm_ras Raag Jadav
@ 2026-06-04 18:46 ` Raag Jadav
2026-06-15 8:56 ` Tauro, Riana
2026-06-04 18:46 ` [PATCH v3 2/4] drm/xe/xe_ras: Add support for error counter Raag Jadav
` (2 subsequent siblings)
3 siblings, 1 reply; 7+ messages in thread
From: Raag Jadav @ 2026-06-04 18:46 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
Raag Jadav
Add get-error-threshold and set-error-threshold command support which
allows querying/setting error threshold of the counter. Threshold in RAS
context means the number of errors the hardware is expected to accumulate
before it raises them to software. This is to have a fine grained control
over error notifications that are raised by the hardware.
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Document threshold definition (Riana)
Return -EOPNOTSUPP on threshold callbacks absence (Riana)
Cancel and free genlmsg on failure (Riana)
Document threshold bounds checking responsibility (Riana)
v3: Move documentation from yaml to rst file (Riana)
s/value/threshold (Riana)
Use goto for error handling (Riana)
---
Documentation/gpu/drm-ras.rst | 18 +++
Documentation/netlink/specs/drm_ras.yaml | 32 +++++
drivers/gpu/drm/drm_ras.c | 167 +++++++++++++++++++++++
drivers/gpu/drm/drm_ras_nl.c | 27 ++++
drivers/gpu/drm/drm_ras_nl.h | 4 +
include/drm/drm_ras.h | 29 ++++
include/uapi/drm/drm_ras.h | 3 +
7 files changed, 280 insertions(+)
diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 4636e68f5678..178797819d30 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -54,6 +54,10 @@ User space tools can:
``node-id`` and ``error-id`` as parameters.
* Clear specific error counters with the ``clear-error-counter`` command, using both
``node-id`` and ``error-id`` as parameters.
+* Query specific error counter threshold with the ``get-error-threshold`` command, using both
+ ``node-id`` and ``error-id`` as parameters.
+* Set specific error counter threshold with the ``set-error-threshold`` command, using
+ ``node-id``, ``error-id`` and ``error-threshold`` as parameters.
YAML-based Interface
--------------------
@@ -109,3 +113,17 @@ Example: Clear an error counter for a given node
sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
None
+
+Example: Query error threshold of a given counter
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --do get-error-threshold --json '{"node-id":0, "error-id":1}'
+ {'error-id': 1, 'error-name': 'error_name1', 'error-threshold': 16}
+
+Example: Set error threshold of a given counter
+
+.. code-block:: bash
+
+ sudo ynl --family drm_ras --do set-error-threshold --json '{"node-id":0, "error-id":1, "error-threshold":8}'
+ None
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index e113056f8c01..9cf7f9cde242 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -69,6 +69,10 @@ attribute-sets:
name: error-value
type: u32
doc: Current value of the requested error counter.
+ -
+ name: error-threshold
+ type: u32
+ doc: Error threshold of the counter.
operations:
list:
@@ -124,3 +128,31 @@ operations:
do:
request:
attributes: *id-attrs
+ -
+ name: get-error-threshold
+ doc: >-
+ Retrieve error threshold of a given counter.
+ The response includes the id, the name, and current threshold
+ of the counter.
+ attribute-set: error-counter-attrs
+ flags: [admin-perm]
+ do:
+ request:
+ attributes: *id-attrs
+ reply:
+ attributes:
+ - error-id
+ - error-name
+ - error-threshold
+ -
+ name: set-error-threshold
+ doc: >-
+ Set error threshold of a given counter.
+ attribute-set: error-counter-attrs
+ flags: [admin-perm]
+ do:
+ request:
+ attributes:
+ - node-id
+ - error-id
+ - error-threshold
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index 467a169026fc..bcb6e0ef2d67 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -41,6 +41,13 @@
* Userspace must provide Node ID, Error ID.
* Clears specific error counter of a node if supported.
*
+ * 4. GET_ERROR_THRESHOLD: Query error threshold of a given counter.
+ * Userspace must provide Node ID and Error ID.
+ * Returns the error threshold of a specific counter.
+ *
+ * 5. SET_ERROR_THRESHOLD: Set error threshold of a given counter.
+ * Userspace must provide Node ID, Error ID and threshold to be set.
+ *
* Node registration:
*
* - drm_ras_node_register(): Registers a new node and assigns
@@ -61,6 +68,13 @@
* + The error counters in the driver doesn't need to be contiguous, but the
* driver must return -ENOENT to the query_error_counter as an indication
* that the ID should be skipped and not listed in the netlink API.
+ * + The driver can optionally implement query_error_threshold() and
+ * set_error_threshold() callbacks to facilitate getting/setting error
+ * threshold of the counter. Threshold in RAS context means the number of
+ * errors the hardware is expected to accumulate before it raises them to
+ * software. This is to have a fine grained control over error notifications
+ * that are raised by the hardware.
+ * + The driver is responsible for error threshold bounds checking.
*
* Netlink handlers:
*
@@ -72,6 +86,10 @@
* operation, fetching a counter value from a specific node.
* - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
* operation, clearing a counter value from a specific node.
+ * - drm_ras_nl_get_error_threshold_doit(): Implements the GET_ERROR_THRESHOLD doit
+ * operation, fetching the error threshold of a specific counter.
+ * - drm_ras_nl_set_error_threshold_doit(): Implements the SET_ERROR_THRESHOLD doit
+ * operation, setting the error threshold of a specific counter.
*/
static DEFINE_XARRAY_ALLOC(drm_ras_xa);
@@ -168,6 +186,43 @@ static int get_node_error_counter(u32 node_id, u32 error_id,
return node->query_error_counter(node, error_id, name, value);
}
+static int get_node_error_threshold(u32 node_id, u32 error_id,
+ const char **name, u32 *threshold)
+{
+ struct drm_ras_node *node;
+
+ node = xa_load(&drm_ras_xa, node_id);
+ if (!node)
+ return -ENOENT;
+
+ if (!node->query_error_threshold)
+ return -EOPNOTSUPP;
+
+ if (error_id < node->error_counter_range.first ||
+ error_id > node->error_counter_range.last)
+ return -EINVAL;
+
+ return node->query_error_threshold(node, error_id, name, threshold);
+}
+
+static int set_node_error_threshold(u32 node_id, u32 error_id, u32 threshold)
+{
+ struct drm_ras_node *node;
+
+ node = xa_load(&drm_ras_xa, node_id);
+ if (!node)
+ return -ENOENT;
+
+ if (!node->set_error_threshold)
+ return -EOPNOTSUPP;
+
+ if (error_id < node->error_counter_range.first ||
+ error_id > node->error_counter_range.last)
+ return -EINVAL;
+
+ return node->set_error_threshold(node, error_id, threshold);
+}
+
static int msg_reply_value(struct sk_buff *msg, u32 error_id,
const char *error_name, u32 value)
{
@@ -186,6 +241,24 @@ static int msg_reply_value(struct sk_buff *msg, u32 error_id,
value);
}
+static int msg_reply_threshold(struct sk_buff *msg, u32 error_id,
+ const char *error_name, u32 threshold)
+{
+ int ret;
+
+ ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+ if (ret)
+ return ret;
+
+ ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
+ error_name);
+ if (ret)
+ return ret;
+
+ return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
+ threshold);
+}
+
static int doit_reply_value(struct genl_info *info, u32 node_id,
u32 error_id)
{
@@ -225,6 +298,45 @@ static int doit_reply_value(struct genl_info *info, u32 node_id,
return ret;
}
+static int doit_reply_threshold(struct genl_info *info, u32 node_id,
+ u32 error_id)
+{
+ const char *error_name;
+ struct sk_buff *msg;
+ struct nlattr *hdr;
+ u32 threshold;
+ int ret;
+
+ msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+ if (!msg)
+ return -ENOMEM;
+
+ hdr = genlmsg_iput(msg, info);
+ if (!hdr) {
+ ret = -EMSGSIZE;
+ goto free_msg;
+ }
+
+ ret = get_node_error_threshold(node_id, error_id,
+ &error_name, &threshold);
+ if (ret)
+ goto cancel_msg;
+
+ ret = msg_reply_threshold(msg, error_id, error_name, threshold);
+ if (ret)
+ goto cancel_msg;
+
+ genlmsg_end(msg, hdr);
+
+ return genlmsg_reply(msg, info);
+
+cancel_msg:
+ genlmsg_cancel(msg, hdr);
+free_msg:
+ nlmsg_free(msg);
+ return ret;
+}
+
/**
* drm_ras_nl_get_error_counter_dumpit() - Dump all Error Counters
* @skb: Netlink message buffer
@@ -358,6 +470,61 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
return node->clear_error_counter(node, error_id);
}
+/**
+ * drm_ras_nl_get_error_threshold_doit() - Query error threshold of a counter
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the Node ID and Error ID from the netlink attributes and retrieves
+ * the error threshold of the corresponding counter. Sends the result back to
+ * the requesting user via the standard Genl reply.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_get_error_threshold_doit(struct sk_buff *skb,
+ struct genl_info *info)
+{
+ u32 node_id, error_id;
+
+ if (!info->attrs ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
+ return -EINVAL;
+
+ node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+ error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+ return doit_reply_threshold(info, node_id, error_id);
+}
+
+/**
+ * drm_ras_nl_set_error_threshold_doit() - Set error threshold of a counter
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the Node ID, Error ID and threshold from the netlink attributes and
+ * sets the error threshold of the corresponding counter.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb,
+ struct genl_info *info)
+{
+ u32 node_id, error_id, threshold;
+
+ if (!info->attrs ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID) ||
+ GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD))
+ return -EINVAL;
+
+ node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+ error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+ threshold = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD]);
+
+ return set_node_error_threshold(node_id, error_id, threshold);
+}
+
/**
* drm_ras_node_register() - Register a new RAS node
* @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index dea1c1b2494e..02e8e5054d05 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -28,6 +28,19 @@ static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_E
[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
};
+/* DRM_RAS_CMD_GET_ERROR_THRESHOLD - do */
+static const struct nla_policy drm_ras_get_error_threshold_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
+/* DRM_RAS_CMD_SET_ERROR_THRESHOLD - do */
+static const struct nla_policy drm_ras_set_error_threshold_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD + 1] = {
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+ [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD] = { .type = NLA_U32, },
+};
+
/* Ops table for drm_ras */
static const struct genl_split_ops drm_ras_nl_ops[] = {
{
@@ -56,6 +69,20 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
.maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
+ {
+ .cmd = DRM_RAS_CMD_GET_ERROR_THRESHOLD,
+ .doit = drm_ras_nl_get_error_threshold_doit,
+ .policy = drm_ras_get_error_threshold_nl_policy,
+ .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+ },
+ {
+ .cmd = DRM_RAS_CMD_SET_ERROR_THRESHOLD,
+ .doit = drm_ras_nl_set_error_threshold_doit,
+ .policy = drm_ras_set_error_threshold_nl_policy,
+ .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+ },
};
struct genl_family drm_ras_nl_family __ro_after_init = {
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index a398643572a5..57b1e647d833 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -20,6 +20,10 @@ int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
struct genl_info *info);
+int drm_ras_nl_get_error_threshold_doit(struct sk_buff *skb,
+ struct genl_info *info);
+int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb,
+ struct genl_info *info);
extern struct genl_family drm_ras_nl_family;
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index f2a787bc4f64..9cda4bbc9749 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -69,6 +69,35 @@ struct drm_ras_node {
*/
int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
+ /**
+ * @query_error_threshold:
+ *
+ * This callback is used by drm-ras to query error threshold of a
+ * specific counter.
+ *
+ * Driver should expect query_error_threshold() to be called with
+ * error_id from `error_counter_range.first` to
+ * `error_counter_range.last`.
+ *
+ * Returns: 0 on success, negative error code on failure.
+ */
+ int (*query_error_threshold)(struct drm_ras_node *node, u32 error_id,
+ const char **name, u32 *threshold);
+ /**
+ * @set_error_threshold:
+ *
+ * This callback is used by drm-ras to set error threshold of a specific
+ * counter.
+ *
+ * Driver should expect set_error_threshold() to be called with error_id
+ * from `error_counter_range.first` to `error_counter_range.last`.
+ * Driver is responsible for error threshold bounds checking.
+ *
+ * Returns: 0 on success, negative error code on failure.
+ */
+ int (*set_error_threshold)(struct drm_ras_node *node, u32 error_id,
+ u32 threshold);
+
/** @priv: Driver private data */
void *priv;
};
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 218a3ee86805..27c68956495f 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -33,6 +33,7 @@ enum {
DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+ DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
@@ -42,6 +43,8 @@ enum {
DRM_RAS_CMD_LIST_NODES = 1,
DRM_RAS_CMD_GET_ERROR_COUNTER,
DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+ DRM_RAS_CMD_GET_ERROR_THRESHOLD,
+ DRM_RAS_CMD_SET_ERROR_THRESHOLD,
__DRM_RAS_CMD_MAX,
DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3 1/4] drm/ras: Introduce error threshold
2026-06-04 18:46 ` [PATCH v3 1/4] drm/ras: Introduce error threshold Raag Jadav
@ 2026-06-15 8:56 ` Tauro, Riana
0 siblings, 0 replies; 7+ messages in thread
From: Tauro, Riana @ 2026-06-15 8:56 UTC (permalink / raw)
To: Raag Jadav, intel-xe, dri-devel, netdev
Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
michal.wajdeczko, matthew.d.roper, mallesh.koujalagi
On 05-06-2026 00:16, Raag Jadav wrote:
> Add get-error-threshold and set-error-threshold command support which
> allows querying/setting error threshold of the counter. Threshold in RAS
> context means the number of errors the hardware is expected to accumulate
> before it raises them to software. This is to have a fine grained control
> over error notifications that are raised by the hardware.
>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
> v2: Document threshold definition (Riana)
> Return -EOPNOTSUPP on threshold callbacks absence (Riana)
> Cancel and free genlmsg on failure (Riana)
> Document threshold bounds checking responsibility (Riana)
> v3: Move documentation from yaml to rst file (Riana)
> s/value/threshold (Riana)
> Use goto for error handling (Riana)
> ---
> Documentation/gpu/drm-ras.rst | 18 +++
> Documentation/netlink/specs/drm_ras.yaml | 32 +++++
> drivers/gpu/drm/drm_ras.c | 167 +++++++++++++++++++++++
> drivers/gpu/drm/drm_ras_nl.c | 27 ++++
> drivers/gpu/drm/drm_ras_nl.h | 4 +
> include/drm/drm_ras.h | 29 ++++
> include/uapi/drm/drm_ras.h | 3 +
> 7 files changed, 280 insertions(+)
>
> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
> index 4636e68f5678..178797819d30 100644
> --- a/Documentation/gpu/drm-ras.rst
> +++ b/Documentation/gpu/drm-ras.rst
> @@ -54,6 +54,10 @@ User space tools can:
> ``node-id`` and ``error-id`` as parameters.
> * Clear specific error counters with the ``clear-error-counter`` command, using both
> ``node-id`` and ``error-id`` as parameters.
> +* Query specific error counter threshold with the ``get-error-threshold`` command, using both
> + ``node-id`` and ``error-id`` as parameters.
> +* Set specific error counter threshold with the ``set-error-threshold`` command, using
> + ``node-id``, ``error-id`` and ``error-threshold`` as parameters.
>
> YAML-based Interface
> --------------------
> @@ -109,3 +113,17 @@ Example: Clear an error counter for a given node
>
> sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
> None
> +
> +Example: Query error threshold of a given counter
> +
> +.. code-block:: bash
> +
> + sudo ynl --family drm_ras --do get-error-threshold --json '{"node-id":0, "error-id":1}'
> + {'error-id': 1, 'error-name': 'error_name1', 'error-threshold': 16}
> +
> +Example: Set error threshold of a given counter
> +
> +.. code-block:: bash
> +
> + sudo ynl --family drm_ras --do set-error-threshold --json '{"node-id":0, "error-id":1, "error-threshold":8}'
> + None
> diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
> index e113056f8c01..9cf7f9cde242 100644
> --- a/Documentation/netlink/specs/drm_ras.yaml
> +++ b/Documentation/netlink/specs/drm_ras.yaml
> @@ -69,6 +69,10 @@ attribute-sets:
> name: error-value
> type: u32
> doc: Current value of the requested error counter.
> + -
> + name: error-threshold
> + type: u32
> + doc: Error threshold of the counter.
>
> operations:
> list:
> @@ -124,3 +128,31 @@ operations:
> do:
> request:
> attributes: *id-attrs
> + -
> + name: get-error-threshold
> + doc: >-
> + Retrieve error threshold of a given counter.
> + The response includes the id, the name, and current threshold
> + of the counter.
> + attribute-set: error-counter-attrs
> + flags: [admin-perm]
> + do:
> + request:
> + attributes: *id-attrs
> + reply:
> + attributes:
> + - error-id
> + - error-name
> + - error-threshold
> + -
> + name: set-error-threshold
> + doc: >-
> + Set error threshold of a given counter.
> + attribute-set: error-counter-attrs
> + flags: [admin-perm]
> + do:
> + request:
> + attributes:
> + - node-id
> + - error-id
> + - error-threshold
> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> index 467a169026fc..bcb6e0ef2d67 100644
> --- a/drivers/gpu/drm/drm_ras.c
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -41,6 +41,13 @@
> * Userspace must provide Node ID, Error ID.
> * Clears specific error counter of a node if supported.
> *
> + * 4. GET_ERROR_THRESHOLD: Query error threshold of a given counter.
> + * Userspace must provide Node ID and Error ID.
> + * Returns the error threshold of a specific counter.
> + *
> + * 5. SET_ERROR_THRESHOLD: Set error threshold of a given counter.
> + * Userspace must provide Node ID, Error ID and threshold to be set.
> + *
> * Node registration:
> *
> * - drm_ras_node_register(): Registers a new node and assigns
> @@ -61,6 +68,13 @@
> * + The error counters in the driver doesn't need to be contiguous, but the
> * driver must return -ENOENT to the query_error_counter as an indication
> * that the ID should be skipped and not listed in the netlink API.
> + * + The driver can optionally implement query_error_threshold() and
> + * set_error_threshold() callbacks to facilitate getting/setting error
> + * threshold of the counter. Threshold in RAS context means the number of
> + * errors the hardware is expected to accumulate before it raises them to
> + * software. This is to have a fine grained control over error notifications
> + * that are raised by the hardware.
> + * + The driver is responsible for error threshold bounds checking.
Can the threshold be set to 0? What should the behaviour be?
> *
> * Netlink handlers:
> *
> @@ -72,6 +86,10 @@
> * operation, fetching a counter value from a specific node.
> * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
> * operation, clearing a counter value from a specific node.
> + * - drm_ras_nl_get_error_threshold_doit(): Implements the GET_ERROR_THRESHOLD doit
> + * operation, fetching the error threshold of a specific counter.
> + * - drm_ras_nl_set_error_threshold_doit(): Implements the SET_ERROR_THRESHOLD doit
> + * operation, setting the error threshold of a specific counter.
> */
>
> static DEFINE_XARRAY_ALLOC(drm_ras_xa);
> @@ -168,6 +186,43 @@ static int get_node_error_counter(u32 node_id, u32 error_id,
> return node->query_error_counter(node, error_id, name, value);
> }
>
> +static int get_node_error_threshold(u32 node_id, u32 error_id,
> + const char **name, u32 *threshold)
> +{
> + struct drm_ras_node *node;
> +
> + node = xa_load(&drm_ras_xa, node_id);
> + if (!node)
> + return -ENOENT;
> +
> + if (!node->query_error_threshold)
> + return -EOPNOTSUPP;
> +
> + if (error_id < node->error_counter_range.first ||
> + error_id > node->error_counter_range.last)
> + return -EINVAL;
> +
> + return node->query_error_threshold(node, error_id, name, threshold);
> +}
> +
> +static int set_node_error_threshold(u32 node_id, u32 error_id, u32 threshold)
> +{
> + struct drm_ras_node *node;
> +
> + node = xa_load(&drm_ras_xa, node_id);
> + if (!node)
> + return -ENOENT;
> +
> + if (!node->set_error_threshold)
> + return -EOPNOTSUPP;
> +
> + if (error_id < node->error_counter_range.first ||
> + error_id > node->error_counter_range.last)
> + return -EINVAL;
> +
> + return node->set_error_threshold(node, error_id, threshold);
> +}
> +
> static int msg_reply_value(struct sk_buff *msg, u32 error_id,
> const char *error_name, u32 value)
> {
> @@ -186,6 +241,24 @@ static int msg_reply_value(struct sk_buff *msg, u32 error_id,
> value);
> }
>
> +static int msg_reply_threshold(struct sk_buff *msg, u32 error_id,
> + const char *error_name, u32 threshold)
> +{
> + int ret;
> +
> + ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
> + if (ret)
> + return ret;
> +
> + ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> + error_name);
> + if (ret)
> + return ret;
can be in a single line
> +
> + return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
> + threshold);
same
> +}
> +
> static int doit_reply_value(struct genl_info *info, u32 node_id,
> u32 error_id)
> {
> @@ -225,6 +298,45 @@ static int doit_reply_value(struct genl_info *info, u32 node_id,
> return ret;
> }
>
> +static int doit_reply_threshold(struct genl_info *info, u32 node_id,
> + u32 error_id)
> +{
> + const char *error_name;
> + struct sk_buff *msg;
> + struct nlattr *hdr;
> + u32 threshold;
> + int ret;
> +
> + msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
> + if (!msg)
> + return -ENOMEM;
> +
> + hdr = genlmsg_iput(msg, info);
> + if (!hdr) {
> + ret = -EMSGSIZE;
> + goto free_msg;
> + }
> +
> + ret = get_node_error_threshold(node_id, error_id,
> + &error_name, &threshold);
same
Thanks
Riana
> + if (ret)
> + goto cancel_msg;
> +
> + ret = msg_reply_threshold(msg, error_id, error_name, threshold);
> + if (ret)
> + goto cancel_msg;
> +
> + genlmsg_end(msg, hdr);
> +
> + return genlmsg_reply(msg, info);
> +
> +cancel_msg:
> + genlmsg_cancel(msg, hdr);
> +free_msg:
> + nlmsg_free(msg);
> + return ret;
> +}
> +
> /**
> * drm_ras_nl_get_error_counter_dumpit() - Dump all Error Counters
> * @skb: Netlink message buffer
> @@ -358,6 +470,61 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> return node->clear_error_counter(node, error_id);
> }
>
> +/**
> + * drm_ras_nl_get_error_threshold_doit() - Query error threshold of a counter
> + * @skb: Netlink message buffer
> + * @info: Generic Netlink info containing attributes of the request
> + *
> + * Extracts the Node ID and Error ID from the netlink attributes and retrieves
> + * the error threshold of the corresponding counter. Sends the result back to
> + * the requesting user via the standard Genl reply.
> + *
> + * Return: 0 on success, or negative errno on failure.
> + */
> +int drm_ras_nl_get_error_threshold_doit(struct sk_buff *skb,
> + struct genl_info *info)
> +{
> + u32 node_id, error_id;
> +
> + if (!info->attrs ||
> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
> + return -EINVAL;
> +
> + node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> + error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> +
> + return doit_reply_threshold(info, node_id, error_id);
> +}
> +
> +/**
> + * drm_ras_nl_set_error_threshold_doit() - Set error threshold of a counter
> + * @skb: Netlink message buffer
> + * @info: Generic Netlink info containing attributes of the request
> + *
> + * Extracts the Node ID, Error ID and threshold from the netlink attributes and
> + * sets the error threshold of the corresponding counter.
> + *
> + * Return: 0 on success, or negative errno on failure.
> + */
> +int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb,
> + struct genl_info *info)
> +{
> + u32 node_id, error_id, threshold;
> +
> + if (!info->attrs ||
> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID) ||
> + GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD))
> + return -EINVAL;
> +
> + node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
> + error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
> + threshold = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD]);
> +
> + return set_node_error_threshold(node_id, error_id, threshold);
> +}
> +
> /**
> * drm_ras_node_register() - Register a new RAS node
> * @node: Node structure to register
> diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
> index dea1c1b2494e..02e8e5054d05 100644
> --- a/drivers/gpu/drm/drm_ras_nl.c
> +++ b/drivers/gpu/drm/drm_ras_nl.c
> @@ -28,6 +28,19 @@ static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_E
> [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> };
>
> +/* DRM_RAS_CMD_GET_ERROR_THRESHOLD - do */
> +static const struct nla_policy drm_ras_get_error_threshold_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> +};
> +
> +/* DRM_RAS_CMD_SET_ERROR_THRESHOLD - do */
> +static const struct nla_policy drm_ras_set_error_threshold_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD + 1] = {
> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
> + [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD] = { .type = NLA_U32, },
> +};
> +
> /* Ops table for drm_ras */
> static const struct genl_split_ops drm_ras_nl_ops[] = {
> {
> @@ -56,6 +69,20 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
> .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> },
> + {
> + .cmd = DRM_RAS_CMD_GET_ERROR_THRESHOLD,
> + .doit = drm_ras_nl_get_error_threshold_doit,
> + .policy = drm_ras_get_error_threshold_nl_policy,
> + .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> + },
> + {
> + .cmd = DRM_RAS_CMD_SET_ERROR_THRESHOLD,
> + .doit = drm_ras_nl_set_error_threshold_doit,
> + .policy = drm_ras_set_error_threshold_nl_policy,
> + .maxattr = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
> + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> + },
> };
>
> struct genl_family drm_ras_nl_family __ro_after_init = {
> diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
> index a398643572a5..57b1e647d833 100644
> --- a/drivers/gpu/drm/drm_ras_nl.h
> +++ b/drivers/gpu/drm/drm_ras_nl.h
> @@ -20,6 +20,10 @@ int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
> struct netlink_callback *cb);
> int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
> struct genl_info *info);
> +int drm_ras_nl_get_error_threshold_doit(struct sk_buff *skb,
> + struct genl_info *info);
> +int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb,
> + struct genl_info *info);
>
> extern struct genl_family drm_ras_nl_family;
>
> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
> index f2a787bc4f64..9cda4bbc9749 100644
> --- a/include/drm/drm_ras.h
> +++ b/include/drm/drm_ras.h
> @@ -69,6 +69,35 @@ struct drm_ras_node {
> */
> int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
>
> + /**
> + * @query_error_threshold:
> + *
> + * This callback is used by drm-ras to query error threshold of a
> + * specific counter.
> + *
> + * Driver should expect query_error_threshold() to be called with
> + * error_id from `error_counter_range.first` to
> + * `error_counter_range.last`.
> + *
> + * Returns: 0 on success, negative error code on failure.
> + */
> + int (*query_error_threshold)(struct drm_ras_node *node, u32 error_id,
> + const char **name, u32 *threshold);
> + /**
> + * @set_error_threshold:
> + *
> + * This callback is used by drm-ras to set error threshold of a specific
> + * counter.
> + *
> + * Driver should expect set_error_threshold() to be called with error_id
> + * from `error_counter_range.first` to `error_counter_range.last`.
> + * Driver is responsible for error threshold bounds checking.
> + *
> + * Returns: 0 on success, negative error code on failure.
> + */
> + int (*set_error_threshold)(struct drm_ras_node *node, u32 error_id,
> + u32 threshold);
> +
> /** @priv: Driver private data */
> void *priv;
> };
> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
> index 218a3ee86805..27c68956495f 100644
> --- a/include/uapi/drm/drm_ras.h
> +++ b/include/uapi/drm/drm_ras.h
> @@ -33,6 +33,7 @@ enum {
> DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
> DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
> DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
> + DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
>
> __DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
> DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
> @@ -42,6 +43,8 @@ enum {
> DRM_RAS_CMD_LIST_NODES = 1,
> DRM_RAS_CMD_GET_ERROR_COUNTER,
> DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
> + DRM_RAS_CMD_GET_ERROR_THRESHOLD,
> + DRM_RAS_CMD_SET_ERROR_THRESHOLD,
>
> __DRM_RAS_CMD_MAX,
> DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 2/4] drm/xe/xe_ras: Add support for error counter
2026-06-04 18:46 [PATCH v3 0/4] Introduce error threshold to drm_ras Raag Jadav
2026-06-04 18:46 ` [PATCH v3 1/4] drm/ras: Introduce error threshold Raag Jadav
@ 2026-06-04 18:46 ` Raag Jadav
2026-06-04 18:46 ` [PATCH v3 3/4] drm/xe/ras: Add support for error threshold Raag Jadav
2026-06-04 18:46 ` [PATCH v3 4/4] drm/xe/drm_ras: Wire up error threshold callbacks Raag Jadav
3 siblings, 0 replies; 7+ messages in thread
From: Raag Jadav @ 2026-06-04 18:46 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
Raag Jadav
From: Riana Tauro <riana.tauro@intel.com>
Do not review, CI only.
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
drivers/gpu/drm/xe/xe_device.c | 20 +-
drivers/gpu/drm/xe/xe_device_types.h | 2 +
drivers/gpu/drm/xe/xe_drm_ras.c | 41 ++--
drivers/gpu/drm/xe/xe_hw_error.c | 13 --
drivers/gpu/drm/xe/xe_pci.c | 3 +
drivers/gpu/drm/xe/xe_pci_types.h | 1 +
drivers/gpu/drm/xe/xe_ras.c | 192 ++++++++++++++++++
drivers/gpu/drm/xe/xe_ras.h | 5 +
drivers/gpu/drm/xe/xe_ras_types.h | 51 +++++
drivers/gpu/drm/xe/xe_sysctrl_mailbox.c | 28 +++
drivers/gpu/drm/xe/xe_sysctrl_mailbox.h | 3 +
drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 4 +
include/uapi/drm/xe_drm.h | 11 +-
13 files changed, 337 insertions(+), 37 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index cea935c3ba67..879023133e46 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -62,6 +62,7 @@
#include "xe_psmi.h"
#include "xe_pxp.h"
#include "xe_query.h"
+#include "xe_ras.h"
#include "xe_shrinker.h"
#include "xe_soc_remapper.h"
#include "xe_survivability_mode.h"
@@ -742,6 +743,7 @@ static void vf_update_device_info(struct xe_device *xe)
xe->info.has_late_bind = 0;
xe->info.skip_guc_pc = 1;
xe->info.skip_pcode = 1;
+ xe->info.has_drm_ras = false;
}
static int xe_device_vram_alloc(struct xe_device *xe)
@@ -990,6 +992,16 @@ int xe_device_probe(struct xe_device *xe)
if (err)
return err;
+ err = xe_soc_remapper_init(xe);
+ if (err)
+ return err;
+
+ err = xe_sysctrl_init(xe);
+ if (err)
+ return err;
+
+ xe_ras_init(xe);
+
/*
* Now that GT is initialized (TTM in particular),
* we can try to init display, and inherit the initial fb.
@@ -1030,10 +1042,6 @@ int xe_device_probe(struct xe_device *xe)
xe_nvm_init(xe);
- err = xe_soc_remapper_init(xe);
- if (err)
- return err;
-
err = xe_heci_gsc_init(xe);
if (err)
return err;
@@ -1072,10 +1080,6 @@ int xe_device_probe(struct xe_device *xe)
if (err)
goto err_unregister_display;
- err = xe_sysctrl_init(xe);
- if (err)
- goto err_unregister_display;
-
err = xe_device_sysfs_init(xe);
if (err)
goto err_unregister_display;
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 4e7f79c1d9f7..fae72310f060 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -156,6 +156,8 @@ struct xe_device {
u8 has_cached_pt:1;
/** @info.has_device_atomics_on_smem: Supports device atomics on SMEM */
u8 has_device_atomics_on_smem:1;
+ /** @info.has_drm_ras: Device supports drm_ras (Reliability, Availability, Serviceability) */
+ u8 has_drm_ras:1;
/** @info.has_fan_control: Device supports fan control */
u8 has_fan_control:1;
/** @info.has_flat_ccs: Whether flat CCS metadata is used */
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index cd236f53699e..7937d8ba0ed9 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -11,27 +11,46 @@
#include "xe_device_types.h"
#include "xe_drm_ras.h"
+#include "xe_ras.h"
static const char * const error_components[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;
static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
-static int hw_query_error_counter(struct xe_drm_ras_counter *info,
- u32 error_id, const char **name, u32 *val)
+static int query_error_counter(struct xe_device *xe,
+ enum drm_xe_ras_error_severity severity,
+ u32 error_id, const char **name, u32 *val)
{
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[severity];
+
if (!info || !info[error_id].name)
return -ENOENT;
*name = info[error_id].name;
+
+ /* Fetch counter from system controller if supported */
+ if (xe->info.has_sysctrl)
+ return xe_ras_get_counter(xe, severity, error_id, val);
+
*val = atomic_read(&info[error_id].counter);
return 0;
}
-static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
+static int clear_error_counter(struct xe_device *xe,
+ enum drm_xe_ras_error_severity severity,
+ u32 error_id)
{
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[severity];
+
if (!info || !info[error_id].name)
return -ENOENT;
+ /* Clear counter from system controller if supported */
+ if (xe->info.has_sysctrl)
+ return xe_ras_clear_counter(xe, severity, error_id);
+
atomic_set(&info[error_id].counter, 0);
return 0;
@@ -41,38 +60,30 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
const char **name, u32 *val)
{
struct xe_device *xe = ep->priv;
- struct xe_drm_ras *ras = &xe->ras;
- struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
- return hw_query_error_counter(info, error_id, name, val);
+ return query_error_counter(xe, DRM_XE_RAS_ERR_SEV_UNCORRECTABLE, error_id, name, val);
}
static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
{
struct xe_device *xe = node->priv;
- struct xe_drm_ras *ras = &xe->ras;
- struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
- return hw_clear_error_counter(info, error_id);
+ return clear_error_counter(xe, DRM_XE_RAS_ERR_SEV_UNCORRECTABLE, error_id);
}
static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
const char **name, u32 *val)
{
struct xe_device *xe = ep->priv;
- struct xe_drm_ras *ras = &xe->ras;
- struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
- return hw_query_error_counter(info, error_id, name, val);
+ return query_error_counter(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id, name, val);
}
static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
{
struct xe_device *xe = node->priv;
- struct xe_drm_ras *ras = &xe->ras;
- struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
- return hw_clear_error_counter(info, error_id);
+ return clear_error_counter(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id);
}
static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 4b72959b2276..3c1dc9f83d1a 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -516,14 +516,6 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
}
}
-static int hw_error_info_init(struct xe_device *xe)
-{
- if (xe->info.platform != XE_PVC)
- return 0;
-
- return xe_drm_ras_init(xe);
-}
-
/*
* Process hardware errors during boot
*/
@@ -550,16 +542,11 @@ static void process_hw_errors(struct xe_device *xe)
void xe_hw_error_init(struct xe_device *xe)
{
struct xe_tile *tile = xe_device_get_root_tile(xe);
- int ret;
if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
return;
INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
- ret = hw_error_info_init(xe);
- if (ret)
- drm_err(&xe->drm, "Failed to initialize XE DRM RAS (%pe)\n", ERR_PTR(ret));
-
process_hw_errors(xe);
}
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 205ba01e713c..33bd9b9a6451 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -355,6 +355,7 @@ static const __maybe_unused struct xe_device_desc pvc_desc = {
PLATFORM(PVC),
.dma_mask_size = 52,
.has_display = false,
+ .has_drm_ras = true,
.has_gsc_nvm = 1,
.has_heci_gscfi = 1,
.max_gt_per_tile = 1,
@@ -457,6 +458,7 @@ static const struct xe_device_desc cri_desc = {
PLATFORM(CRESCENTISLAND),
.dma_mask_size = 52,
.has_display = false,
+ .has_drm_ras = true,
.has_flat_ccs = false,
.has_gsc_nvm = 1,
.has_i2c = true,
@@ -760,6 +762,7 @@ static int xe_info_init_early(struct xe_device *xe,
xe->info.is_dgfx = desc->is_dgfx;
xe->info.has_cached_pt = desc->has_cached_pt;
+ xe->info.has_drm_ras = desc->has_drm_ras;
xe->info.has_fan_control = desc->has_fan_control;
/* runtime fusing may force flat_ccs to disabled later */
xe->info.has_flat_ccs = desc->has_flat_ccs;
diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
index 5b85e2c24b7b..24d4a3d00517 100644
--- a/drivers/gpu/drm/xe/xe_pci_types.h
+++ b/drivers/gpu/drm/xe/xe_pci_types.h
@@ -40,6 +40,7 @@ struct xe_device_desc {
u8 has_cached_pt:1;
u8 has_display:1;
+ u8 has_drm_ras:1;
u8 has_fan_control:1;
u8 has_flat_ccs:1;
u8 has_gsc_nvm:1;
diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
index 4cb16b419b0c..7cb6fcb1254a 100644
--- a/drivers/gpu/drm/xe/xe_ras.c
+++ b/drivers/gpu/drm/xe/xe_ras.c
@@ -4,11 +4,15 @@
*/
#include "xe_device.h"
+#include "xe_drm_ras.h"
+#include "xe_pm.h"
#include "xe_printk.h"
#include "xe_ras.h"
#include "xe_ras_types.h"
#include "xe_sysctrl.h"
#include "xe_sysctrl_event_types.h"
+#include "xe_sysctrl_mailbox.h"
+#include "xe_sysctrl_mailbox_types.h"
/* Severity of detected errors */
enum xe_ras_severity {
@@ -31,6 +35,17 @@ enum xe_ras_component {
XE_RAS_COMP_MAX
};
+/* RAS response status codes */
+enum xe_ras_response_status {
+ XE_RAS_STATUS_SUCCESS = 0,
+ XE_RAS_STATUS_INVALID_PARAM,
+ XE_RAS_STATUS_OP_NOT_SUPPORTED,
+ XE_RAS_STATUS_TIMEOUT,
+ XE_RAS_STATUS_HARDWARE_FAILURE,
+ XE_RAS_STATUS_INSUFFICIENT_RESOURCES,
+ XE_RAS_STATUS_MAX
+};
+
static const char *const xe_ras_severities[] = {
[XE_RAS_SEV_NOT_SUPPORTED] = "Not Supported",
[XE_RAS_SEV_CORRECTABLE] = "Correctable Error",
@@ -50,6 +65,56 @@ static const char *const xe_ras_components[] = {
};
static_assert(ARRAY_SIZE(xe_ras_components) == XE_RAS_COMP_MAX);
+static u8 drm_to_xe_ras_severity(u8 severity)
+{
+ switch (severity) {
+ case DRM_XE_RAS_ERR_SEV_CORRECTABLE:
+ return XE_RAS_SEV_CORRECTABLE;
+ case DRM_XE_RAS_ERR_SEV_UNCORRECTABLE:
+ return XE_RAS_SEV_UNCORRECTABLE;
+ default:
+ return XE_RAS_SEV_NOT_SUPPORTED;
+ }
+}
+
+static u8 drm_to_xe_ras_component(u8 component)
+{
+ switch (component) {
+ case DRM_XE_RAS_ERR_COMP_CORE_COMPUTE:
+ return XE_RAS_COMP_CORE_COMPUTE;
+ case DRM_XE_RAS_ERR_COMP_SOC_INTERNAL:
+ return XE_RAS_COMP_SOC_INTERNAL;
+ case DRM_XE_RAS_ERR_COMP_DEVICE_MEMORY:
+ return XE_RAS_COMP_DEVICE_MEMORY;
+ case DRM_XE_RAS_ERR_COMP_PCIE:
+ return XE_RAS_COMP_PCIE;
+ case DRM_XE_RAS_ERR_COMP_FABRIC:
+ return XE_RAS_COMP_FABRIC;
+ default:
+ return XE_RAS_COMP_NOT_SUPPORTED;
+ }
+}
+
+static int ras_status_to_errno(u32 status)
+{
+ switch (status) {
+ case XE_RAS_STATUS_SUCCESS:
+ return 0;
+ case XE_RAS_STATUS_INVALID_PARAM:
+ return -EINVAL;
+ case XE_RAS_STATUS_OP_NOT_SUPPORTED:
+ return -EOPNOTSUPP;
+ case XE_RAS_STATUS_TIMEOUT:
+ return -ETIMEDOUT;
+ case XE_RAS_STATUS_HARDWARE_FAILURE:
+ return -EIO;
+ case XE_RAS_STATUS_INSUFFICIENT_RESOURCES:
+ return -ENOSPC;
+ default:
+ return -EPROTO;
+ }
+}
+
static inline const char *sev_to_str(u8 severity)
{
if (severity >= XE_RAS_SEV_MAX)
@@ -91,3 +156,130 @@ void xe_ras_counter_threshold_crossed(struct xe_device *xe,
comp_to_str(component), sev_to_str(severity));
}
}
+
+static int get_counter(struct xe_device *xe, struct xe_ras_error_class *counter, u32 *value)
+{
+ struct xe_ras_get_counter_response response = {0};
+ struct xe_ras_get_counter_request request = {0};
+ struct xe_sysctrl_mailbox_command command = {0};
+ struct xe_ras_error_common *common;
+ size_t rlen;
+ int ret;
+
+ request.counter = *counter;
+
+ xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_GET_COUNTER,
+ &request, sizeof(request), &response, sizeof(response));
+
+ ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen);
+ if (ret) {
+ xe_err(xe, "sysctrl: failed to get counter %d\n", ret);
+ return ret;
+ }
+
+ if (rlen != sizeof(response)) {
+ xe_err(xe, "sysctrl: unexpected get counter response length %zu (expected %zu)\n",
+ rlen, sizeof(response));
+ return -EIO;
+ }
+
+ common = &response.counter.common;
+ *value = response.value;
+
+ xe_dbg(xe, "[RAS]: get counter value %u for %s %s\n", *value,
+ comp_to_str(common->component), sev_to_str(common->severity));
+
+ return 0;
+}
+
+/**
+ * xe_ras_get_counter() - Get error counter value
+ * @xe: Xe device instance
+ * @severity: Error severity to be queried (&enum drm_xe_ras_error_severity)
+ * @component: Error component to be queried (&enum drm_xe_ras_error_component)
+ * @value: Counter value
+ *
+ * This function retrieves the value of a specific error counter based on
+ * the error severity and component.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int xe_ras_get_counter(struct xe_device *xe, u8 severity, u8 component, u32 *value)
+{
+ struct xe_ras_error_class counter = {0};
+
+ counter.common.severity = drm_to_xe_ras_severity(severity);
+ counter.common.component = drm_to_xe_ras_component(component);
+
+ guard(xe_pm_runtime)(xe);
+ return get_counter(xe, &counter, value);
+}
+
+/**
+ * xe_ras_clear_counter() - Clear error counter value
+ * @xe: Xe device instance
+ * @severity: Error severity to be cleared (&enum drm_xe_ras_error_severity)
+ * @component: Error component to be cleared (&enum drm_xe_ras_error_component)
+ *
+ * This function clears the value of a specific error counter based on
+ * the error severity and component.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component)
+{
+ struct xe_ras_clear_counter_response response = {0};
+ struct xe_ras_clear_counter_request request = {0};
+ struct xe_sysctrl_mailbox_command command = {0};
+ struct xe_ras_error_class *counter;
+ size_t rlen;
+ int ret;
+
+ counter = &request.counter;
+ counter->common.severity = drm_to_xe_ras_severity(severity);
+ counter->common.component = drm_to_xe_ras_component(component);
+
+ xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_CLEAR_COUNTER,
+ &request, sizeof(request), &response, sizeof(response));
+
+ guard(xe_pm_runtime)(xe);
+ ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen);
+ if (ret) {
+ xe_err(xe, "sysctrl: failed to clear counter %d\n", ret);
+ return ret;
+ }
+
+ if (rlen != sizeof(response)) {
+ xe_err(xe, "sysctrl: unexpected clear counter response length %zu (expected %zu)\n",
+ rlen, sizeof(response));
+ return -EIO;
+ }
+
+ ret = ras_status_to_errno(response.status);
+ if (ret) {
+ xe_err(xe, "sysctrl: clear counter command failed with status %#x\n",
+ response.status);
+ return ret;
+ }
+
+ counter = &response.counter;
+
+ xe_dbg(xe, "[RAS]: clear counter value for %s %s\n", comp_to_str(counter->common.component),
+ sev_to_str(counter->common.severity));
+
+ return 0;
+}
+
+/**
+ * xe_ras_init - Initialize Xe RAS
+ * @xe: xe device instance
+ *
+ * Register drm_ras nodes
+ */
+void xe_ras_init(struct xe_device *xe)
+{
+ if (!xe->info.has_drm_ras)
+ return;
+
+ xe_drm_ras_init(xe);
+}
diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
index ea90593b62dc..ba0b0224df23 100644
--- a/drivers/gpu/drm/xe/xe_ras.h
+++ b/drivers/gpu/drm/xe/xe_ras.h
@@ -6,10 +6,15 @@
#ifndef _XE_RAS_H_
#define _XE_RAS_H_
+#include <linux/types.h>
+
struct xe_device;
struct xe_sysctrl_event_response;
void xe_ras_counter_threshold_crossed(struct xe_device *xe,
struct xe_sysctrl_event_response *response);
+int xe_ras_get_counter(struct xe_device *xe, u8 severity, u8 component, u32 *value);
+int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component);
+void xe_ras_init(struct xe_device *xe);
#endif
diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
index 4e63c67f806a..c6392435d1c6 100644
--- a/drivers/gpu/drm/xe/xe_ras_types.h
+++ b/drivers/gpu/drm/xe/xe_ras_types.h
@@ -70,4 +70,55 @@ struct xe_ras_threshold_crossed {
struct xe_ras_error_class counters[XE_RAS_NUM_COUNTERS];
} __packed;
+/**
+ * struct xe_ras_get_counter_request - Request structure for get counter
+ */
+struct xe_ras_get_counter_request {
+ /** @counter: Error counter to be queried */
+ struct xe_ras_error_class counter;
+ /** @reserved: Reserved for future use */
+ u32 reserved;
+} __packed;
+
+/**
+ * struct xe_ras_get_counter_response - Response structure for get counter
+ */
+struct xe_ras_get_counter_response {
+ /** @counter: Error counter that was queried */
+ struct xe_ras_error_class counter;
+ /** @value: Current counter value */
+ u32 value;
+ /** @timestamp: Timestamp when counter was last updated */
+ u64 timestamp;
+ /** @threshold: Threshold value for the counter */
+ u32 threshold;
+ /** @reserved: Reserved */
+ u32 reserved[57];
+} __packed;
+
+/**
+ * struct xe_ras_clear_counter_request - Request structure for clear counter
+ */
+struct xe_ras_clear_counter_request {
+ /** @counter: Counter class to be cleared */
+ struct xe_ras_error_class counter;
+ /** @reserved: Reserved for future use */
+ u32 reserved;
+} __packed;
+
+/**
+ * struct xe_ras_clear_counter_response - Response structure for clear counter
+ */
+struct xe_ras_clear_counter_response {
+ /** @counter: Counter class that was cleared */
+ struct xe_ras_error_class counter;
+ /** @reserved: Reserved */
+ u32 reserved;
+ /** @timestamp: Timestamp when the counter was cleared */
+ u64 timestamp;
+ /** @status: Status of the clear operation */
+ u32 status;
+ /** @reserved1: Reserved for future use */
+ u32 reserved1[3];
+} __packed;
#endif
diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox.c b/drivers/gpu/drm/xe/xe_sysctrl_mailbox.c
index 3caa9f15875f..e13eebaac1d0 100644
--- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox.c
+++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox.c
@@ -293,6 +293,34 @@ static int sysctrl_send_command(struct xe_sysctrl *sc,
return 0;
}
+/**
+ * xe_sysctrl_create_command() - Create system controller command
+ * @command: Sysctrl command structure
+ * @group_id: Command group ID
+ * @cmd_id: Command ID
+ * @request: Pointer to request buffer (can be NULL)
+ * @request_len: Size of request buffer
+ * @response: Pointer to response buffer
+ * @response_len: Size of response buffer
+ *
+ * Helper function to create sysctrl command to be sent via %xe_sysctrl_send_command()
+ */
+void xe_sysctrl_create_command(struct xe_sysctrl_mailbox_command *command, u8 group_id, u8 cmd_id,
+ void *request, size_t request_len, void *response,
+ size_t response_len)
+{
+ struct xe_sysctrl_app_msg_hdr header = {0};
+
+ header.data = FIELD_PREP(APP_HDR_GROUP_ID_MASK, group_id) |
+ FIELD_PREP(APP_HDR_COMMAND_MASK, cmd_id);
+
+ command->header = header;
+ command->data_in = request;
+ command->data_in_len = request_len;
+ command->data_out = response;
+ command->data_out_len = response_len;
+}
+
/**
* xe_sysctrl_mailbox_init - Initialize System Controller mailbox interface
* @sc: System controller structure
diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox.h b/drivers/gpu/drm/xe/xe_sysctrl_mailbox.h
index f67e9234de48..fb434cc165b2 100644
--- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox.h
+++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox.h
@@ -23,6 +23,9 @@ struct xe_sysctrl_mailbox_command;
#define XE_SYSCTRL_APP_HDR_VERSION(hdr) \
FIELD_GET(APP_HDR_VERSION_MASK, (hdr)->data)
+void xe_sysctrl_create_command(struct xe_sysctrl_mailbox_command *command, u8 group_id, u8 cmd_id,
+ void *request, size_t request_len, void *response,
+ size_t response_len);
void xe_sysctrl_mailbox_init(struct xe_sysctrl *sc);
int xe_sysctrl_send_command(struct xe_sysctrl *sc,
struct xe_sysctrl_mailbox_command *cmd,
diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
index 84d7c647e743..6e3753554510 100644
--- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
+++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
@@ -22,9 +22,13 @@ enum xe_sysctrl_group {
/**
* enum xe_sysctrl_gfsp_cmd - Commands supported by GFSP group
*
+ * @XE_SYSCTRL_CMD_GET_COUNTER: Get error counter value
+ * @XE_SYSCTRL_CMD_CLEAR_COUNTER: Clear error counter value
* @XE_SYSCTRL_CMD_GET_PENDING_EVENT: Retrieve pending event
*/
enum xe_sysctrl_gfsp_cmd {
+ XE_SYSCTRL_CMD_GET_COUNTER = 0x03,
+ XE_SYSCTRL_CMD_CLEAR_COUNTER = 0x04,
XE_SYSCTRL_CMD_GET_PENDING_EVENT = 0x07,
};
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 48e9f1fdb78d..50c80af4ad4e 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -2589,6 +2589,12 @@ enum drm_xe_ras_error_component {
DRM_XE_RAS_ERR_COMP_CORE_COMPUTE = 1,
/** @DRM_XE_RAS_ERR_COMP_SOC_INTERNAL: SoC Internal Error */
DRM_XE_RAS_ERR_COMP_SOC_INTERNAL,
+ /** @DRM_XE_RAS_ERR_COMP_DEVICE_MEMORY: Device Memory Error */
+ DRM_XE_RAS_ERR_COMP_DEVICE_MEMORY,
+ /** @DRM_XE_RAS_ERR_COMP_PCIE: PCIe Subsystem Error */
+ DRM_XE_RAS_ERR_COMP_PCIE,
+ /** @DRM_XE_RAS_ERR_COMP_FABRIC: Fabric Subsystem Error */
+ DRM_XE_RAS_ERR_COMP_FABRIC,
/** @DRM_XE_RAS_ERR_COMP_MAX: Max Error */
DRM_XE_RAS_ERR_COMP_MAX /* non-ABI */
};
@@ -2606,7 +2612,10 @@ enum drm_xe_ras_error_component {
*/
#define DRM_XE_RAS_ERROR_COMPONENT_NAMES { \
[DRM_XE_RAS_ERR_COMP_CORE_COMPUTE] = "core-compute", \
- [DRM_XE_RAS_ERR_COMP_SOC_INTERNAL] = "soc-internal" \
+ [DRM_XE_RAS_ERR_COMP_SOC_INTERNAL] = "soc-internal", \
+ [DRM_XE_RAS_ERR_COMP_DEVICE_MEMORY] = "device-memory", \
+ [DRM_XE_RAS_ERR_COMP_PCIE] = "pcie", \
+ [DRM_XE_RAS_ERR_COMP_FABRIC] = "fabric", \
}
#if defined(__cplusplus)
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH v3 3/4] drm/xe/ras: Add support for error threshold
2026-06-04 18:46 [PATCH v3 0/4] Introduce error threshold to drm_ras Raag Jadav
2026-06-04 18:46 ` [PATCH v3 1/4] drm/ras: Introduce error threshold Raag Jadav
2026-06-04 18:46 ` [PATCH v3 2/4] drm/xe/xe_ras: Add support for error counter Raag Jadav
@ 2026-06-04 18:46 ` Raag Jadav
2026-06-15 8:17 ` Tauro, Riana
2026-06-04 18:46 ` [PATCH v3 4/4] drm/xe/drm_ras: Wire up error threshold callbacks Raag Jadav
3 siblings, 1 reply; 7+ messages in thread
From: Raag Jadav @ 2026-06-04 18:46 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
Raag Jadav
System controller allows getting/setting per counter threshold, which it
uses to raise error events to the driver. Get/set it using the respective
mailbox command.
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Add RAS operation status codes (Riana)
v3: Reuse status codes and uapi mapping from counter series (Riana)
Access request/response counter using local pointer (Riana)
Mark unused field as reserved (Riana)
---
drivers/gpu/drm/xe/xe_ras.c | 105 ++++++++++++++++++
drivers/gpu/drm/xe/xe_ras.h | 2 +
drivers/gpu/drm/xe/xe_ras_types.h | 51 +++++++++
drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 4 +
4 files changed, 162 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
index 7cb6fcb1254a..d6f89b429cec 100644
--- a/drivers/gpu/drm/xe/xe_ras.c
+++ b/drivers/gpu/drm/xe/xe_ras.c
@@ -270,6 +270,111 @@ int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component)
return 0;
}
+/**
+ * xe_ras_get_threshold() - Get error counter threshold
+ * @xe: Xe device instance
+ * @severity: Error severity to be queried (&enum drm_xe_ras_error_severity)
+ * @component: Error component to be queried (&enum drm_xe_ras_error_component)
+ * @threshold: Counter threshold
+ *
+ * This function retrieves the error threshold of a specific counter based on
+ * severity and component.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int xe_ras_get_threshold(struct xe_device *xe, u8 severity, u8 component, u32 *threshold)
+{
+ struct xe_ras_get_threshold_response response = {};
+ struct xe_ras_get_threshold_request request = {};
+ struct xe_sysctrl_mailbox_command command = {};
+ struct xe_ras_error_class *counter;
+ size_t len;
+ int ret;
+
+ counter = &request.counter;
+ counter->common.severity = drm_to_xe_ras_severity(severity);
+ counter->common.component = drm_to_xe_ras_component(component);
+
+ xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_GET_THRESHOLD,
+ &request, sizeof(request), &response, sizeof(response));
+
+ guard(xe_pm_runtime)(xe);
+ ret = xe_sysctrl_send_command(&xe->sc, &command, &len);
+ if (ret) {
+ xe_err(xe, "sysctrl: failed to get threshold %d\n", ret);
+ return ret;
+ }
+
+ if (len != sizeof(response)) {
+ xe_err(xe, "sysctrl: unexpected get threshold response length %zu (expected %zu)\n",
+ len, sizeof(response));
+ return -EIO;
+ }
+
+ counter = &response.counter;
+ *threshold = response.threshold;
+
+ xe_dbg(xe, "[RAS]: get counter threshold %u for %s %s\n", *threshold,
+ comp_to_str(counter->common.component), sev_to_str(counter->common.severity));
+ return 0;
+}
+
+/**
+ * xe_ras_set_threshold() - Set error counter threshold
+ * @xe: Xe device instance
+ * @severity: Error severity to be set (&enum drm_xe_ras_error_severity)
+ * @component: Error component to be set (&enum drm_xe_ras_error_component)
+ * @threshold: Counter threshold
+ *
+ * This function sets the error threshold of a specific counter based on
+ * severity and component.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int xe_ras_set_threshold(struct xe_device *xe, u8 severity, u8 component, u32 threshold)
+{
+ struct xe_ras_set_threshold_response response = {};
+ struct xe_ras_set_threshold_request request = {};
+ struct xe_sysctrl_mailbox_command command = {};
+ struct xe_ras_error_class *counter;
+ size_t len;
+ int ret;
+
+ counter = &request.counter;
+ counter->common.severity = drm_to_xe_ras_severity(severity);
+ counter->common.component = drm_to_xe_ras_component(component);
+ request.threshold = threshold;
+
+ xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_SET_THRESHOLD,
+ &request, sizeof(request), &response, sizeof(response));
+
+ guard(xe_pm_runtime)(xe);
+ ret = xe_sysctrl_send_command(&xe->sc, &command, &len);
+ if (ret) {
+ xe_err(xe, "sysctrl: failed to set threshold %d\n", ret);
+ return ret;
+ }
+
+ if (len != sizeof(response)) {
+ xe_err(xe, "sysctrl: unexpected set threshold response length %zu (expected %zu)\n",
+ len, sizeof(response));
+ return -EIO;
+ }
+
+ ret = ras_status_to_errno(response.status);
+ if (ret) {
+ xe_err(xe, "sysctrl: set threshold command failed with status %#x\n",
+ response.status);
+ return ret;
+ }
+
+ counter = &response.counter;
+
+ xe_dbg(xe, "[RAS]: set counter threshold %u for %s %s\n", response.threshold,
+ comp_to_str(counter->common.component), sev_to_str(counter->common.severity));
+ return 0;
+}
+
/**
* xe_ras_init - Initialize Xe RAS
* @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
index ba0b0224df23..1aa43c54b710 100644
--- a/drivers/gpu/drm/xe/xe_ras.h
+++ b/drivers/gpu/drm/xe/xe_ras.h
@@ -15,6 +15,8 @@ void xe_ras_counter_threshold_crossed(struct xe_device *xe,
struct xe_sysctrl_event_response *response);
int xe_ras_get_counter(struct xe_device *xe, u8 severity, u8 component, u32 *value);
int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component);
+int xe_ras_get_threshold(struct xe_device *xe, u8 severity, u8 component, u32 *threshold);
+int xe_ras_set_threshold(struct xe_device *xe, u8 severity, u8 component, u32 threshold);
void xe_ras_init(struct xe_device *xe);
#endif
diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
index c6392435d1c6..8ea817583eed 100644
--- a/drivers/gpu/drm/xe/xe_ras_types.h
+++ b/drivers/gpu/drm/xe/xe_ras_types.h
@@ -121,4 +121,55 @@ struct xe_ras_clear_counter_response {
/** @reserved1: Reserved for future use */
u32 reserved1[3];
} __packed;
+
+/**
+ * struct xe_ras_get_threshold_request - Request structure for get threshold
+ */
+struct xe_ras_get_threshold_request {
+ /** @counter: Counter to get threshold for */
+ struct xe_ras_error_class counter;
+ /** @reserved: Reserved for future use */
+ u32 reserved;
+} __packed;
+
+/**
+ * struct xe_ras_get_threshold_response - Response structure for get threshold
+ */
+struct xe_ras_get_threshold_response {
+ /** @counter: Counter ID */
+ struct xe_ras_error_class counter;
+ /** @threshold: Threshold value */
+ u32 threshold;
+ /** @reserved: Reserved for future use */
+ u32 reserved[4];
+} __packed;
+
+/**
+ * struct xe_ras_set_threshold_request - Request structure for set threshold
+ */
+struct xe_ras_set_threshold_request {
+ /** @counter: Counter to set threshold for */
+ struct xe_ras_error_class counter;
+ /** @threshold: Threshold value to set */
+ u32 threshold;
+ /** @reserved: Reserved for future use */
+ u32 reserved;
+} __packed;
+
+/**
+ * struct xe_ras_set_threshold_response - Response structure for set threshold
+ */
+struct xe_ras_set_threshold_response {
+ /** @counter: Counter ID */
+ struct xe_ras_error_class counter;
+ /** @reserved: Reserved */
+ u32 reserved;
+ /** @threshold: Updated threshold value */
+ u32 threshold;
+ /** @status: Set threshold operation status */
+ u32 status;
+ /** @reserved1: Reserved for future use */
+ u32 reserved1[2];
+} __packed;
+
#endif
diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
index 6e3753554510..10f06aa5c4b5 100644
--- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
+++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
@@ -24,11 +24,15 @@ enum xe_sysctrl_group {
*
* @XE_SYSCTRL_CMD_GET_COUNTER: Get error counter value
* @XE_SYSCTRL_CMD_CLEAR_COUNTER: Clear error counter value
+ * @XE_SYSCTRL_CMD_GET_THRESHOLD: Retrieve error threshold
+ * @XE_SYSCTRL_CMD_SET_THRESHOLD: Set error threshold
* @XE_SYSCTRL_CMD_GET_PENDING_EVENT: Retrieve pending event
*/
enum xe_sysctrl_gfsp_cmd {
XE_SYSCTRL_CMD_GET_COUNTER = 0x03,
XE_SYSCTRL_CMD_CLEAR_COUNTER = 0x04,
+ XE_SYSCTRL_CMD_GET_THRESHOLD = 0x05,
+ XE_SYSCTRL_CMD_SET_THRESHOLD = 0x06,
XE_SYSCTRL_CMD_GET_PENDING_EVENT = 0x07,
};
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3 3/4] drm/xe/ras: Add support for error threshold
2026-06-04 18:46 ` [PATCH v3 3/4] drm/xe/ras: Add support for error threshold Raag Jadav
@ 2026-06-15 8:17 ` Tauro, Riana
0 siblings, 0 replies; 7+ messages in thread
From: Tauro, Riana @ 2026-06-15 8:17 UTC (permalink / raw)
To: Raag Jadav, intel-xe, dri-devel, netdev
Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
michal.wajdeczko, matthew.d.roper, mallesh.koujalagi
On 05-06-2026 00:16, Raag Jadav wrote:
> System controller allows getting/setting per counter threshold, which it
for correctable errors.
> uses to raise error events to the driver. Get/set it using the respective
> mailbox command.
>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
> v2: Add RAS operation status codes (Riana)
> v3: Reuse status codes and uapi mapping from counter series (Riana)
> Access request/response counter using local pointer (Riana)
> Mark unused field as reserved (Riana)
> ---
> drivers/gpu/drm/xe/xe_ras.c | 105 ++++++++++++++++++
> drivers/gpu/drm/xe/xe_ras.h | 2 +
> drivers/gpu/drm/xe/xe_ras_types.h | 51 +++++++++
> drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 4 +
> 4 files changed, 162 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 7cb6fcb1254a..d6f89b429cec 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -270,6 +270,111 @@ int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component)
> return 0;
> }
>
> +/**
> + * xe_ras_get_threshold() - Get error counter threshold
> + * @xe: Xe device instance
> + * @severity: Error severity to be queried (&enum drm_xe_ras_error_severity)
> + * @component: Error component to be queried (&enum drm_xe_ras_error_component)
> + * @threshold: Counter threshold
> + *
> + * This function retrieves the error threshold of a specific counter based on
> + * severity and component.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int xe_ras_get_threshold(struct xe_device *xe, u8 severity, u8 component, u32 *threshold)
> +{
> + struct xe_ras_get_threshold_response response = {};
> + struct xe_ras_get_threshold_request request = {};
> + struct xe_sysctrl_mailbox_command command = {};
> + struct xe_ras_error_class *counter;
> + size_t len;
> + int ret;
> +
> + counter = &request.counter;
> + counter->common.severity = drm_to_xe_ras_severity(severity);
> + counter->common.component = drm_to_xe_ras_component(component);
> +
> + xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_GET_THRESHOLD,
> + &request, sizeof(request), &response, sizeof(response));
> +
> + guard(xe_pm_runtime)(xe);
> + ret = xe_sysctrl_send_command(&xe->sc, &command, &len);
> + if (ret) {
> + xe_err(xe, "sysctrl: failed to get threshold %d\n", ret);
> + return ret;
> + }
> +
> + if (len != sizeof(response)) {
> + xe_err(xe, "sysctrl: unexpected get threshold response length %zu (expected %zu)\n",
> + len, sizeof(response));
> + return -EIO;
> + }
> +
> + counter = &response.counter;
> + *threshold = response.threshold;
> +
> + xe_dbg(xe, "[RAS]: get counter threshold %u for %s %s\n", *threshold,
> + comp_to_str(counter->common.component), sev_to_str(counter->common.severity));
"get threshold" to be consistent with <operation> <value> <component>
<severity>
and other prints
> + return 0;
> +}
> +
> +/**
> + * xe_ras_set_threshold() - Set error counter threshold
> + * @xe: Xe device instance
> + * @severity: Error severity to be set (&enum drm_xe_ras_error_severity)
> + * @component: Error component to be set (&enum drm_xe_ras_error_component)
> + * @threshold: Counter threshold
> + *
> + * This function sets the error threshold of a specific counter based on
> + * severity and component.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int xe_ras_set_threshold(struct xe_device *xe, u8 severity, u8 component, u32 threshold)
> +{
> + struct xe_ras_set_threshold_response response = {};
> + struct xe_ras_set_threshold_request request = {};
> + struct xe_sysctrl_mailbox_command command = {};
> + struct xe_ras_error_class *counter;
> + size_t len;
> + int ret;
> +
> + counter = &request.counter;
> + counter->common.severity = drm_to_xe_ras_severity(severity);
> + counter->common.component = drm_to_xe_ras_component(component);
> + request.threshold = threshold;
> +
> + xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_SET_THRESHOLD,
> + &request, sizeof(request), &response, sizeof(response));
> +
> + guard(xe_pm_runtime)(xe);
> + ret = xe_sysctrl_send_command(&xe->sc, &command, &len);
> + if (ret) {
> + xe_err(xe, "sysctrl: failed to set threshold %d\n", ret);
> + return ret;
> + }
> +
> + if (len != sizeof(response)) {
> + xe_err(xe, "sysctrl: unexpected set threshold response length %zu (expected %zu)\n",
> + len, sizeof(response));
> + return -EIO;
> + }
> +
> + ret = ras_status_to_errno(response.status);
> + if (ret) {
> + xe_err(xe, "sysctrl: set threshold command failed with status %#x\n",
> + response.status);
> + return ret;
> + }
> +
> + counter = &response.counter;
> +
> + xe_dbg(xe, "[RAS]: set counter threshold %u for %s %s\n", response.threshold,
set threshold
> + comp_to_str(counter->common.component), sev_to_str(counter->common.severity));
> + return 0;
> +}
> +
> /**
> * xe_ras_init - Initialize Xe RAS
> * @xe: xe device instance
> diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
> index ba0b0224df23..1aa43c54b710 100644
> --- a/drivers/gpu/drm/xe/xe_ras.h
> +++ b/drivers/gpu/drm/xe/xe_ras.h
> @@ -15,6 +15,8 @@ void xe_ras_counter_threshold_crossed(struct xe_device *xe,
> struct xe_sysctrl_event_response *response);
> int xe_ras_get_counter(struct xe_device *xe, u8 severity, u8 component, u32 *value);
> int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component);
> +int xe_ras_get_threshold(struct xe_device *xe, u8 severity, u8 component, u32 *threshold);
> +int xe_ras_set_threshold(struct xe_device *xe, u8 severity, u8 component, u32 threshold);
> void xe_ras_init(struct xe_device *xe);
>
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
> index c6392435d1c6..8ea817583eed 100644
> --- a/drivers/gpu/drm/xe/xe_ras_types.h
> +++ b/drivers/gpu/drm/xe/xe_ras_types.h
> @@ -121,4 +121,55 @@ struct xe_ras_clear_counter_response {
> /** @reserved1: Reserved for future use */
> u32 reserved1[3];
> } __packed;
> +
> +/**
> + * struct xe_ras_get_threshold_request - Request structure for get threshold
> + */
> +struct xe_ras_get_threshold_request {
> + /** @counter: Counter to get threshold for */
> + struct xe_ras_error_class counter;
> + /** @reserved: Reserved for future use */
> + u32 reserved;
> +} __packed;
> +
> +/**
> + * struct xe_ras_get_threshold_response - Response structure for get threshold
> + */
> +struct xe_ras_get_threshold_response {
> + /** @counter: Counter ID */
> + struct xe_ras_error_class counter;
> + /** @threshold: Threshold value */
> + u32 threshold;
> + /** @reserved: Reserved for future use */
> + u32 reserved[4];
> +} __packed;
> +
> +/**
> + * struct xe_ras_set_threshold_request - Request structure for set threshold
> + */
> +struct xe_ras_set_threshold_request {
> + /** @counter: Counter to set threshold for */
> + struct xe_ras_error_class counter;
> + /** @threshold: Threshold value to set */
> + u32 threshold;
> + /** @reserved: Reserved for future use */
> + u32 reserved;
> +} __packed;
> +
> +/**
> + * struct xe_ras_set_threshold_response - Response structure for set threshold
> + */
> +struct xe_ras_set_threshold_response {
> + /** @counter: Counter ID */
> + struct xe_ras_error_class counter;
> + /** @reserved: Reserved */
> + u32 reserved;
> + /** @threshold: Updated threshold value */
> + u32 threshold;
> + /** @status: Set threshold operation status */
Nit: Already part of set threshold. Can be just operation status
Thanks
Riana
> + u32 status;
> + /** @reserved1: Reserved for future use */
> + u32 reserved1[2];
> +} __packed;
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
> index 6e3753554510..10f06aa5c4b5 100644
> --- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
> +++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
> @@ -24,11 +24,15 @@ enum xe_sysctrl_group {
> *
> * @XE_SYSCTRL_CMD_GET_COUNTER: Get error counter value
> * @XE_SYSCTRL_CMD_CLEAR_COUNTER: Clear error counter value
> + * @XE_SYSCTRL_CMD_GET_THRESHOLD: Retrieve error threshold
> + * @XE_SYSCTRL_CMD_SET_THRESHOLD: Set error threshold
> * @XE_SYSCTRL_CMD_GET_PENDING_EVENT: Retrieve pending event
> */
> enum xe_sysctrl_gfsp_cmd {
> XE_SYSCTRL_CMD_GET_COUNTER = 0x03,
> XE_SYSCTRL_CMD_CLEAR_COUNTER = 0x04,
> + XE_SYSCTRL_CMD_GET_THRESHOLD = 0x05,
> + XE_SYSCTRL_CMD_SET_THRESHOLD = 0x06,
> XE_SYSCTRL_CMD_GET_PENDING_EVENT = 0x07,
> };
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 4/4] drm/xe/drm_ras: Wire up error threshold callbacks
2026-06-04 18:46 [PATCH v3 0/4] Introduce error threshold to drm_ras Raag Jadav
` (2 preceding siblings ...)
2026-06-04 18:46 ` [PATCH v3 3/4] drm/xe/ras: Add support for error threshold Raag Jadav
@ 2026-06-04 18:46 ` Raag Jadav
3 siblings, 0 replies; 7+ messages in thread
From: Raag Jadav @ 2026-06-04 18:46 UTC (permalink / raw)
To: intel-xe, dri-devel, netdev
Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
Raag Jadav
Now that we have get/set error threshold support in xe driver, wire them
up to drm_ras so that userspace can make use of the functionality.
$ sudo ynl --family drm_ras --do get-error-threshold \
--json '{"node-id":0, "error-id":2}'
{'error-id': 2, 'error-name': 'soc-internal', 'error-threshold': 16}
$ sudo ynl --family drm_ras --do set-error-threshold \
--json '{"node-id":0, "error-id":2, "error-threshold":8}'
None
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
---
v3: Return -ENOENT on info absence (Riana)
---
drivers/gpu/drm/xe/xe_drm_ras.c | 34 +++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index 7937d8ba0ed9..24e5082add37 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -86,6 +86,38 @@ static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_
return clear_error_counter(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id);
}
+static int query_correctable_error_threshold(struct drm_ras_node *ep, u32 error_id,
+ const char **name, u32 *val)
+{
+ struct xe_device *xe = ep->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+ if (!info || !info[error_id].name)
+ return -ENOENT;
+
+ if (!xe->info.has_sysctrl)
+ return -EOPNOTSUPP;
+
+ *name = info[error_id].name;
+ return xe_ras_get_threshold(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id, val);
+}
+
+static int set_correctable_error_threshold(struct drm_ras_node *ep, u32 error_id, u32 val)
+{
+ struct xe_device *xe = ep->priv;
+ struct xe_drm_ras *ras = &xe->ras;
+ struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+ if (!info || !info[error_id].name)
+ return -ENOENT;
+
+ if (!xe->info.has_sysctrl)
+ return -EOPNOTSUPP;
+
+ return xe_ras_set_threshold(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id, val);
+}
+
static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
{
struct xe_drm_ras_counter *counter;
@@ -134,6 +166,8 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
node->query_error_counter = query_correctable_error_counter;
node->clear_error_counter = clear_correctable_error_counter;
+ node->query_error_threshold = query_correctable_error_threshold;
+ node->set_error_threshold = set_correctable_error_threshold;
} else {
node->query_error_counter = query_uncorrectable_error_counter;
node->clear_error_counter = clear_uncorrectable_error_counter;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread