From: "Tauro, Riana" <riana.tauro@intel.com>
To: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
Cc: <anshuman.gupta@intel.com>, <rodrigo.vivi@intel.com>,
<aravind.iddamsetty@linux.intel.com>, <badal.nilawar@intel.com>,
<raag.jadav@intel.com>, <ravi.kishore.koppuravuri@intel.com>,
<soham.purkait@intel.com>, <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v5 2/6] drm/xe/xe_ras: Add support to get error counter in CRI
Date: Wed, 6 May 2026 14:29:02 +0530 [thread overview]
Message-ID: <b47bd299-b538-49de-b8a3-bfc24f72ff74@intel.com> (raw)
In-Reply-To: <8c81ee71-14c1-49f2-ba7f-479ed9fabfc0@intel.com>
On 5/6/2026 1:33 PM, Mallesh, Koujalagi wrote:
>
> On 04-05-2026 12:26 pm, Riana Tauro wrote:
>> Add request/response structures and helper functions to query system
>> controller to get error counter value.
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: add structures for clear counter
>> move commands to sysctrl file
>> split functions
>> fix commit message (Raag)
>>
>> v3: fix log message
>> squash patches
>> change error code for sysctrl error (Raag)
>>
>> v4: rename function
>> remove unecessary macro (Raag)
>> add documentation for enum
>>
>> v5: rebase
>> ---
>> drivers/gpu/drm/xe/xe_ras.c | 91 +++++++++++++++++++
>> drivers/gpu/drm/xe/xe_ras.h | 4 +
>> drivers/gpu/drm/xe/xe_ras_types.h | 30 ++++++
>> drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 2 +
>> 4 files changed, 127 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
>> index 4cb16b419b0c..47a58ce3b3ca 100644
>> --- a/drivers/gpu/drm/xe/xe_ras.c
>> +++ b/drivers/gpu/drm/xe/xe_ras.c
>> @@ -4,11 +4,14 @@
>> */
>> #include "xe_device.h"
>> +#include "xe_pm.h"
>> #include "xe_printk.h"
>> #include "xe_ras.h"
>> #include "xe_ras_types.h"
>> #include "xe_sysctrl.h"
>> #include "xe_sysctrl_event_types.h"
>> +#include "xe_sysctrl_mailbox.h"
>> +#include "xe_sysctrl_mailbox_types.h"
>> /* Severity of detected errors */
>> enum xe_ras_severity {
>> @@ -50,6 +53,23 @@ static const char *const xe_ras_components[] = {
>> };
>> static_assert(ARRAY_SIZE(xe_ras_components) == XE_RAS_COMP_MAX);
>> +/* Mapping from drm_xe_ras_error_component to xe_ras_component */
>> +static const int drm_to_xe_ras_component[] = {
>> + [DRM_XE_RAS_ERR_COMP_CORE_COMPUTE] = XE_RAS_COMP_CORE_COMPUTE,
>> + [DRM_XE_RAS_ERR_COMP_SOC_INTERNAL] = XE_RAS_COMP_SOC_INTERNAL,
>> + [DRM_XE_RAS_ERR_COMP_DEVICE_MEMORY] = XE_RAS_COMP_DEVICE_MEMORY,
>> + [DRM_XE_RAS_ERR_COMP_PCIE] = XE_RAS_COMP_PCIE,
>> + [DRM_XE_RAS_ERR_COMP_FABRIC] = XE_RAS_COMP_FABRIC
>> +};
>> +static_assert(ARRAY_SIZE(drm_to_xe_ras_component) ==
>> DRM_XE_RAS_ERR_COMP_MAX);
>
> Inconsistent component ordering like
>
> In UAPI:
>
> CORE_COMPUTE(1)
>
> Internal order:
>
> DEVICE_MEMORY(1)
> CORE_COMPUTE(2)
>
The ordering here is intentional. Different devices expose the errors
using different mechanisms, PVC uses registers and CRI uses system
controller.
The uapi ordering was introduced when the initial netlink patch was
floated for PVC,
changing uapi will break PVC and changing system controller f/w ordering
is not possible.
That is the reason for having mappings for both.
Thanks
Riana
> can you please make consistent ordering to easy to maintenance and no
> confusion.
>
> Thanks,
>
> -/Mallesh
>
>> +
>> +/* Mapping from drm_xe_ras_error_severity to xe_ras_severity */
>> +static const int drm_to_xe_ras_severity[] = {
>> + [DRM_XE_RAS_ERR_SEV_CORRECTABLE] = XE_RAS_SEV_CORRECTABLE,
>> + [DRM_XE_RAS_ERR_SEV_UNCORRECTABLE] = XE_RAS_SEV_UNCORRECTABLE
>> +};
>> +static_assert(ARRAY_SIZE(drm_to_xe_ras_severity) ==
>> DRM_XE_RAS_ERR_SEV_MAX);
>> +
>> static inline const char *sev_to_str(u8 severity)
>> {
>> if (severity >= XE_RAS_SEV_MAX)
>> @@ -66,6 +86,22 @@ static inline const char *comp_to_str(u8 component)
>> return xe_ras_components[component];
>> }
>> +static void prepare_ras_command(struct xe_sysctrl_mailbox_command
>> *command,
>> + u32 cmd, void *request, size_t request_len,
>> + void *response, size_t response_len)
>> +{
>> + struct xe_sysctrl_app_msg_hdr header = {0};
>> +
>> + header.data = FIELD_PREP(APP_HDR_GROUP_ID_MASK,
>> XE_SYSCTRL_GROUP_GFSP) |
>> + FIELD_PREP(APP_HDR_COMMAND_MASK, cmd);
>> +
>> + command->header = header;
>> + command->data_in = request;
>> + command->data_in_len = request_len;
>> + command->data_out = response;
>> + command->data_out_len = response_len;
>> +}
>> +
>> void xe_ras_counter_threshold_crossed(struct xe_device *xe,
>> struct xe_sysctrl_event_response *response)
>> {
>> @@ -91,3 +127,58 @@ void xe_ras_counter_threshold_crossed(struct
>> xe_device *xe,
>> comp_to_str(component), sev_to_str(severity));
>> }
>> }
>> +
>> +static int get_counter(struct xe_device *xe, struct
>> xe_ras_error_class *error_class,
>> + u32 *value)
>> +{
>> + struct xe_ras_get_counter_response response = {0};
>> + struct xe_ras_get_counter_request request = {0};
>> + struct xe_sysctrl_mailbox_command command = {0};
>> + size_t rlen;
>> + int ret;
>> +
>> + request.error_class = *error_class;
>> +
>> + prepare_ras_command(&command, XE_SYSCTRL_CMD_GET_COUNTER,
>> &request, sizeof(request),
>> + &response, sizeof(response));
>> +
>> + ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen);
>> + if (ret) {
>> + xe_err(xe, "sysctrl: failed to get counter %d\n", ret);
>> + return ret;
>> + }
>> +
>> + if (rlen != sizeof(response)) {
>> + xe_err(xe, "sysctrl: unexpected get counter response length
>> %zu (expected %zu)\n",
>> + rlen, sizeof(response));
>> + return -EIO;
>> + }
>> +
>> + *value = response.counter_value;
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * xe_ras_get_counter() - Get error counter value
>> + * @xe: xe device instance
>> + * @severity: Error severity level to be queried
>> + * @error_id: Error component to be queried
>> + * @value: Counter value
>> + *
>> + * This function retrieves the value of a specific error counter
>> based on
>> + * the error severity and component.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int xe_ras_get_counter(struct xe_device *xe, enum
>> drm_xe_ras_error_severity severity,
>> + u32 error_id, u32 *value)
>> +{
>> + struct xe_ras_error_class error_class = {0};
>> +
>> + error_class.common.severity = drm_to_xe_ras_severity[severity];
>> + error_class.common.component = drm_to_xe_ras_component[error_id];
>> +
>> + guard(xe_pm_runtime)(xe);
>> + return get_counter(xe, &error_class, value);
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
>> index ea90593b62dc..74582c911b02 100644
>> --- a/drivers/gpu/drm/xe/xe_ras.h
>> +++ b/drivers/gpu/drm/xe/xe_ras.h
>> @@ -6,10 +6,14 @@
>> #ifndef _XE_RAS_H_
>> #define _XE_RAS_H_
>> +#include <uapi/drm/xe_drm.h>
>> +
>> struct xe_device;
>> struct xe_sysctrl_event_response;
>> void xe_ras_counter_threshold_crossed(struct xe_device *xe,
>> struct xe_sysctrl_event_response *response);
>> +int xe_ras_get_counter(struct xe_device *xe, enum
>> drm_xe_ras_error_severity severity,
>> + u32 error_id, u32 *value);
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_ras_types.h
>> b/drivers/gpu/drm/xe/xe_ras_types.h
>> index 4e63c67f806a..74d85875cd63 100644
>> --- a/drivers/gpu/drm/xe/xe_ras_types.h
>> +++ b/drivers/gpu/drm/xe/xe_ras_types.h
>> @@ -70,4 +70,34 @@ struct xe_ras_threshold_crossed {
>> struct xe_ras_error_class counters[XE_RAS_NUM_COUNTERS];
>> } __packed;
>> +/**
>> + * struct xe_ras_get_counter_request - Request for get error counter
>> + */
>> +struct xe_ras_get_counter_request {
>> + /** @error_class: Error class counter to be queried */
>> + struct xe_ras_error_class error_class;
>> + /** @reserved: Reserved for future use */
>> + u32 reserved;
>> +} __packed;
>> +
>> +/**
>> + * struct xe_ras_get_counter_response - Response for get error counter
>> + */
>> +struct xe_ras_get_counter_response {
>> + /** @error_class: Error class counter that was queried */
>> + struct xe_ras_error_class error_class;
>> + /** @counter_value: Current counter value */
>> + u32 counter_value;
>> + /** @timestamp: Timestamp when counter was last updated */
>> + u64 timestamp;
>> + /** @threshold_value: Threshold value for the counter */
>> + u32 threshold_value;
>> + /** @counter_status: Status of the counter */
>> + u32 counter_status:8;
>> + /** @reserved: Reserved for future use */
>> + u32 reserved:24;
>> + /** @reserved1: Reserved for future use */
>> + u32 reserved1[56];
>> +} __packed;
>> +
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
>> b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
>> index 84d7c647e743..b315847cbf64 100644
>> --- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
>> +++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
>> @@ -22,9 +22,11 @@ enum xe_sysctrl_group {
>> /**
>> * enum xe_sysctrl_gfsp_cmd - Commands supported by GFSP group
>> *
>> + * @XE_SYSCTRL_CMD_GET_COUNTER: Get error counter value
>> * @XE_SYSCTRL_CMD_GET_PENDING_EVENT: Retrieve pending event
>> */
>> enum xe_sysctrl_gfsp_cmd {
>> + XE_SYSCTRL_CMD_GET_COUNTER = 0x03,
>> XE_SYSCTRL_CMD_GET_PENDING_EVENT = 0x07,
>> };
next prev parent reply other threads:[~2026-05-06 8:59 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-04 6:56 [PATCH v5 0/6] Add get-error-counter and clear-error-counter support for CRI Riana Tauro
2026-05-04 6:43 ` ✗ CI.checkpatch: warning for Add get-error-counter and clear-error-counter support for CRI (rev4) Patchwork
2026-05-04 6:45 ` ✓ CI.KUnit: success " Patchwork
2026-05-04 6:56 ` [PATCH v5 1/6] drm/xe/uapi: Add additional error components to xe drm_ras Riana Tauro
2026-05-04 6:56 ` [PATCH v5 2/6] drm/xe/xe_ras: Add support to get error counter in CRI Riana Tauro
2026-05-06 8:03 ` Mallesh, Koujalagi
2026-05-06 8:59 ` Tauro, Riana [this message]
2026-05-04 6:56 ` [PATCH v5 3/6] drm/xe/xe_ras: Add helper to clear error counter Riana Tauro
2026-05-04 6:56 ` [PATCH v5 4/6] drm/xe/xe_drm_ras: Wire get-error-counter and clear-error-counter support for CRI Riana Tauro
2026-05-04 6:56 ` [PATCH v5 5/6] drm/xe/xe_ras: Move xe drm_ras registration Riana Tauro
2026-05-04 10:53 ` Tauro, Riana
2026-05-04 16:22 ` Raag Jadav
2026-05-04 6:56 ` [PATCH v5 6/6] drm/xe/xe_ras: Control xe drm_ras registration with a flag Riana Tauro
2026-05-04 8:00 ` ✓ Xe.CI.BAT: success for Add get-error-counter and clear-error-counter support for CRI (rev4) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b47bd299-b538-49de-b8a3-bfc24f72ff74@intel.com \
--to=riana.tauro@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=mallesh.koujalagi@intel.com \
--cc=raag.jadav@intel.com \
--cc=ravi.kishore.koppuravuri@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=soham.purkait@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox