From: "Tauro, Riana" <riana.tauro@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
<rodrigo.vivi@intel.com>, <aravind.iddamsetty@linux.intel.com>,
<badal.nilawar@intel.com>, <ravi.kishore.koppuravuri@intel.com>,
<mallesh.koujalagi@intel.com>, <soham.purkait@intel.com>
Subject: Re: [PATCH v5 4/6] drm/xe/xe_drm_ras: Wire get-error-counter and clear-error-counter support for CRI
Date: Tue, 12 May 2026 10:38:50 +0530 [thread overview]
Message-ID: <433010fb-0133-4d0e-9214-4002442fee40@intel.com> (raw)
In-Reply-To: <agH3Gaw434-z508w@black.igk.intel.com>
On 5/11/2026 9:04 PM, Raag Jadav wrote:
> On Mon, May 04, 2026 at 12:26:19PM +0530, Riana Tauro wrote:
>> Hook CRI get-error-counter and clear-error-counter support to
>> xe_drm_ras to allow userspace to query and clear counters if supported.
>> When userspace requests for a drm_ras error counter, query the system
>> controller to get/clear the value.
>>
>> Integrate this with xe_drm_ras.
>>
>> Usage :
>>
>> Query all error counter value using ynl
>>
>> $ sudo ynl --family drm_ras --dump get-error-counter --json \
>> '{"node-id":0}'
>> [{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0},
>> {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0},
>> {'error-id': 3, 'error-name': 'device-memory', 'error-value': 0},
>> {'error-id': 4, 'error-name': 'pcie', 'error-value': 0},
>> {'error-id': 5, 'error-name': 'fabric', 'error-value': 0}]
>>
>> Query single error counter value using ynl
>>
>> $ sudo ynl --family drm_ras --do get-error-counter --json \
>> '{"node-id":1, "error-id":1}'
>> {'error-id': 1, 'error-name': 'core-compute', 'error-value': 2}
>>
>> Clear counter using ynl
>>
>> $ sudo ynl --family drm_ras --do clear-error-counter --json '\
>> {"node-id":1, "error-id":1}'
>> None
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> v2: split patches (Raag)
>>
>> v3: fix early return
>> align spacing in commit message (Raag)
>> integrate clear counter with drm_ras
>>
>> v4: rebase
>> ---
>> drivers/gpu/drm/xe/xe_drm_ras.c | 39 +++++++++++++++++++++------------
>> 1 file changed, 25 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
>> index c21c8b428de6..e7bd3f09a762 100644
>> --- a/drivers/gpu/drm/xe/xe_drm_ras.c
>> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c
>> @@ -11,27 +11,46 @@
>>
>> #include "xe_device_types.h"
>> #include "xe_drm_ras.h"
>> +#include "xe_ras.h"
>>
>> static const char * const error_components[] = DRM_XE_RAS_ERROR_COMPONENT_NAMES;
>> static const char * const error_severity[] = DRM_XE_RAS_ERROR_SEVERITY_NAMES;
>>
>> -static int hw_query_error_counter(struct xe_drm_ras_counter *info,
>> +static int hw_query_error_counter(struct xe_device *xe,
>> + enum drm_xe_ras_error_severity severity,
>> u32 error_id, const char **name, u32 *val)
>> {
>> + struct xe_drm_ras *ras = &xe->ras;
>> + struct xe_drm_ras_counter *info = ras->info[severity];
>> +
>> if (!info || !info[error_id].name)
>> return -ENOENT;
>>
>> *name = info[error_id].name;
>> +
>> + /* Fetch counter from system controller if supported */
>> + if (xe->info.has_sysctrl)
> Curious. Should we check this inside xe_ras_get_counter()? We'll probably
> end up doing it at some point anyway.
We also have PVC counters. If moved, the hw_error counters also have to
be moved under
the ras function. Let's keep the hw_error and ras seperated.
if we need to move sometime in future, then we could have a generic check
For now, we can retain the same.
Thanks
Riana
>
>> + return xe_ras_get_counter(xe, severity, error_id, val);
>> +
>> *val = atomic_read(&info[error_id].counter);
>>
>> return 0;
>> }
>>
>> -static int hw_clear_error_counter(struct xe_drm_ras_counter *info, u32 error_id)
>> +static int hw_clear_error_counter(struct xe_device *xe,
>> + enum drm_xe_ras_error_severity severity,
>> + u32 error_id)
>> {
>> + struct xe_drm_ras *ras = &xe->ras;
>> + struct xe_drm_ras_counter *info = ras->info[severity];
>> +
>> if (!info || !info[error_id].name)
>> return -ENOENT;
>>
>> + /* Clear counter from system controller if supported */
>> + if (xe->info.has_sysctrl)
> Ditto.
>
> Raag
>
>> + return xe_ras_clear_counter(xe, severity, error_id);
>> +
>> atomic_set(&info[error_id].counter, 0);
>>
>> return 0;
>> @@ -41,38 +60,30 @@ static int query_uncorrectable_error_counter(struct drm_ras_node *ep, u32 error_
>> const char **name, u32 *val)
>> {
>> struct xe_device *xe = ep->priv;
>> - struct xe_drm_ras *ras = &xe->ras;
>> - struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
>>
>> - return hw_query_error_counter(info, error_id, name, val);
>> + return hw_query_error_counter(xe, DRM_XE_RAS_ERR_SEV_UNCORRECTABLE, error_id, name, val);
>> }
>>
>> static int clear_uncorrectable_error_counter(struct drm_ras_node *node, u32 error_id)
>> {
>> struct xe_device *xe = node->priv;
>> - struct xe_drm_ras *ras = &xe->ras;
>> - struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_UNCORRECTABLE];
>>
>> - return hw_clear_error_counter(info, error_id);
>> + return hw_clear_error_counter(xe, DRM_XE_RAS_ERR_SEV_UNCORRECTABLE, error_id);
>> }
>>
>> static int query_correctable_error_counter(struct drm_ras_node *ep, u32 error_id,
>> const char **name, u32 *val)
>> {
>> struct xe_device *xe = ep->priv;
>> - struct xe_drm_ras *ras = &xe->ras;
>> - struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
>>
>> - return hw_query_error_counter(info, error_id, name, val);
>> + return hw_query_error_counter(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id, name, val);
>> }
>>
>> static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_id)
>> {
>> struct xe_device *xe = node->priv;
>> - struct xe_drm_ras *ras = &xe->ras;
>> - struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
>>
>> - return hw_clear_error_counter(info, error_id);
>> + return hw_clear_error_counter(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id);
>> }
>>
>> static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
>> --
>> 2.47.1
>>
next prev parent reply other threads:[~2026-05-12 5:09 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-04 6:56 [PATCH v5 0/6] Add get-error-counter and clear-error-counter support for CRI Riana Tauro
2026-05-04 6:43 ` ✗ CI.checkpatch: warning for Add get-error-counter and clear-error-counter support for CRI (rev4) Patchwork
2026-05-04 6:45 ` ✓ CI.KUnit: success " Patchwork
2026-05-04 6:56 ` [PATCH v5 1/6] drm/xe/uapi: Add additional error components to xe drm_ras Riana Tauro
2026-05-08 6:37 ` Mallesh, Koujalagi
2026-05-12 6:58 ` Tauro, Riana
2026-05-04 6:56 ` [PATCH v5 2/6] drm/xe/xe_ras: Add support to get error counter in CRI Riana Tauro
2026-05-06 8:03 ` Mallesh, Koujalagi
2026-05-06 8:59 ` Tauro, Riana
2026-05-11 15:27 ` Raag Jadav
2026-05-12 5:27 ` Tauro, Riana
2026-05-12 5:47 ` Raag Jadav
2026-05-13 8:43 ` Tauro, Riana
2026-05-04 6:56 ` [PATCH v5 3/6] drm/xe/xe_ras: Add helper to clear error counter Riana Tauro
2026-05-08 7:50 ` Mallesh, Koujalagi
2026-05-11 6:20 ` Tauro, Riana
2026-05-11 7:42 ` Mallesh, Koujalagi
2026-05-11 7:49 ` Tauro, Riana
2026-05-11 15:32 ` Raag Jadav
2026-05-12 6:48 ` Tauro, Riana
2026-05-04 6:56 ` [PATCH v5 4/6] drm/xe/xe_drm_ras: Wire get-error-counter and clear-error-counter support for CRI Riana Tauro
2026-05-11 15:34 ` Raag Jadav
2026-05-12 5:08 ` Tauro, Riana [this message]
2026-05-04 6:56 ` [PATCH v5 5/6] drm/xe/xe_ras: Move xe drm_ras registration Riana Tauro
2026-05-04 10:53 ` Tauro, Riana
2026-05-04 16:22 ` Raag Jadav
2026-05-12 5:04 ` Tauro, Riana
2026-05-12 16:19 ` Anoop Vijay
2026-05-11 15:36 ` Raag Jadav
2026-05-04 6:56 ` [PATCH v5 6/6] drm/xe/xe_ras: Control xe drm_ras registration with a flag Riana Tauro
2026-05-11 15:46 ` Raag Jadav
2026-05-04 8:00 ` ✓ Xe.CI.BAT: success for Add get-error-counter and clear-error-counter support for CRI (rev4) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=433010fb-0133-4d0e-9214-4002442fee40@intel.com \
--to=riana.tauro@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=mallesh.koujalagi@intel.com \
--cc=raag.jadav@intel.com \
--cc=ravi.kishore.koppuravuri@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=soham.purkait@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.