From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id F2960EF06ED for ; Tue, 10 Feb 2026 11:45:35 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B7D0110E559; Tue, 10 Feb 2026 11:45:35 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="LNcHGnqZ"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) by gabe.freedesktop.org (Postfix) with ESMTPS id BF37610E559; Tue, 10 Feb 2026 11:45:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770723934; x=1802259934; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=Q/Zxmc6fEqGwUJ7tZP+TOeuVaonZG+7hO4gpr3z1jAc=; b=LNcHGnqZUIyiW+nl458oewU/BKhC/A96/tNKndiwlLf1BVNvm+Zy5UOu tv0XrLHTI8SqITfWoDB4O3ix7PeyHHe4KlC4e+5+pBzKrDqV8EAXIWzlX kl1Qw+ufcFBGNvMhAMmVazf3xai5SpfdMCsHKJw5s9lVGbrG3d/lITFSH HNyc3jIQINJdAoWKzOYCSF1IgLJVKGBAoFkuKFgJANlGCo7RoTIgbWBJJ VyDFjZsivT02j+v0RwWOpNqGX1MzpZKdbkDBhMKBjjFvLqp7OAnmZoQp/ IyyAeg4BNQM7BrBY7/7YtT2QrBPKa8whfSJReug1zrdYhpKyH0w6ltUK5 Q==; X-CSE-ConnectionGUID: sG3vuiJdRrWHuQtrZ21E5g== X-CSE-MsgGUID: E9UbUYcqS+qK6vTZdf9pdQ== X-IronPort-AV: E=McAfee;i="6800,10657,11696"; a="71836157" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="71836157" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 03:45:34 -0800 X-CSE-ConnectionGUID: pnRa/TdgSjipBHrPhxBOYQ== X-CSE-MsgGUID: MTem0qG7TaicjjI5FjogHw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="234866500" Received: from black.igk.intel.com ([10.91.253.5]) by fmviesa002.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 03:45:31 -0800 Date: Tue, 10 Feb 2026 12:45:28 +0100 From: Raag Jadav To: Riana Tauro Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, rodrigo.vivi@intel.com, joonas.lahtinen@linux.intel.com, simona.vetter@ffwll.ch, airlied@gmail.com, pratik.bari@intel.com, joshua.santosh.ranjan@intel.com, ashwin.kumar.kulkarni@intel.com, shubham.kumar@intel.com, ravi.kishore.koppuravuri@intel.com, Himal Prasad Ghimiray Subject: Re: [PATCH v5 4/5] drm/xe/xe_hw_error: Add support for Core-Compute errors Message-ID: References: <20260202064356.286243-7-riana.tauro@intel.com> <20260202064356.286243-11-riana.tauro@intel.com> <90eaf920-2bc3-4d33-9a64-f6373d704dc0@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <90eaf920-2bc3-4d33-9a64-f6373d704dc0@intel.com> X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Tue, Feb 10, 2026 at 11:28:39AM +0530, Riana Tauro wrote: > On 2/5/2026 9:00 PM, Raag Jadav wrote: > > On Mon, Feb 02, 2026 at 12:14:00PM +0530, Riana Tauro wrote: > > > PVC supports GT error reporting via vector registers along with > > > error status register. Add support to report these errors and > > > update respective counters. Incase of Subslice error reported > > > by vector register, process the error status register > > > for applicable bits. > > > > > > The counter is embedded in the xe drm ras structure and is > > > exposed to the userspace using the drm_ras generic netlink > > > interface. > > > > > > $ sudo ynl --family drm_ras --do query-error-counter --json > > > > We usually add '\' at the end for wrapping commands so that they're easy > > to apply directly (and same for all other patches where applicable). > > > > > '{"node-id":0, "error-id":1}' > > > > Ditto. > > Will fix this Thank you. > > > {'error-id': 1, 'error-name': 'core-compute', 'error-value': 0} > > > > > > Co-developed-by: Himal Prasad Ghimiray > > > Signed-off-by: Himal Prasad Ghimiray > > > Signed-off-by: Riana Tauro > > > --- > > > v2: Add ID's and names as uAPI (Rodrigo) > > > Add documentation > > > Modify commit message > > > > > > v3: remove 'error' from counters > > > use drmm_kcalloc > > > add a for_each for severity > > > differentitate error classes and severity in UAPI(Raag) > > > Use correctable and uncorrectable in uapi (Pratik / Aravind) > > > > > > v4: modify enums in UAPI > > > improve comments > > > add bounds check in handler > > > add error mask macro (Raag) > > > use atomic_t > > > add null pointer checks > > > --- > > > drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 62 ++++++- > > > drivers/gpu/drm/xe/xe_hw_error.c | 199 +++++++++++++++++++-- > > > 2 files changed, 241 insertions(+), 20 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h > > > index c146b9ef44eb..17982a335941 100644 > > > --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h > > > +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h > > > @@ -6,15 +6,59 @@ > > > #ifndef _XE_HW_ERROR_REGS_H_ > > > #define _XE_HW_ERROR_REGS_H_ > > > -#define HEC_UNCORR_ERR_STATUS(base) XE_REG((base) + 0x118) > > > -#define UNCORR_FW_REPORTED_ERR BIT(6) > > > +#define HEC_UNCORR_ERR_STATUS(base) XE_REG((base) + 0x118) > > > +#define UNCORR_FW_REPORTED_ERR REG_BIT(6) > > > -#define HEC_UNCORR_FW_ERR_DW0(base) XE_REG((base) + 0x124) > > > +#define HEC_UNCORR_FW_ERR_DW0(base) XE_REG((base) + 0x124) > > > + > > > +#define ERR_STAT_GT_COR 0x100160 > > > +#define EU_GRF_COR_ERR REG_BIT(15) > > > +#define EU_IC_COR_ERR REG_BIT(14) > > > +#define SLM_COR_ERR REG_BIT(13) > > > +#define GUC_COR_ERR REG_BIT(1) > > > + > > > +#define ERR_STAT_GT_NONFATAL 0x100164 > > > +#define ERR_STAT_GT_FATAL 0x100168 > > > +#define EU_GRF_FAT_ERR REG_BIT(15) > > > +#define SLM_FAT_ERR REG_BIT(13) > > > +#define GUC_FAT_ERR REG_BIT(6) > > > +#define FPU_FAT_ERR REG_BIT(3) > > > + > > > +#define ERR_STAT_GT_REG(x) XE_REG(_PICK_EVEN((x), \ > > > + ERR_STAT_GT_COR, \ > > > + ERR_STAT_GT_NONFATAL)) > > > + > > > +#define PVC_COR_ERR_MASK (GUC_COR_ERR | SLM_COR_ERR | \ > > > + EU_IC_COR_ERR | EU_GRF_COR_ERR) > > > + > > > +#define PVC_FAT_ERR_MASK (FPU_FAT_ERR | GUC_FAT_ERR | \ > > > + EU_GRF_FAT_ERR | SLM_FAT_ERR) > > > > Nit: Whitespace please! > > alignment? Yes please! > > > +#define DEV_ERR_STAT_NONFATAL 0x100178 > > > +#define DEV_ERR_STAT_CORRECTABLE 0x10017c > > > +#define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \ > > > + DEV_ERR_STAT_CORRECTABLE, \ > > > + DEV_ERR_STAT_NONFATAL)) > > > > I know it was already like this but how does this evaluate for FATAL? > > #define _PICK_EVEN(__index, __a, __b) ((__a) + (__index) * ((__b) - (__a))) > (index, 0x10017c, 0x100178) = (0x10017c + index * (0x100178 - 0x10017c)); > 0 = 0x10017c > 1 = 0x100178 > 2 = 0x100174 The addresses are usually unsigned, so I got lost there a bit. > > > +#define XE_CSC_ERROR 17 > > > +#define XE_GT_ERROR 0 > > > + > > > +#define ERR_STAT_GT_FATAL_VECTOR_0 0x100260 > > > +#define ERR_STAT_GT_FATAL_VECTOR_1 0x100264 > > > + > > > +#define ERR_STAT_GT_FATAL_VECTOR_REG(x) XE_REG(_PICK_EVEN((x), \ > > > + ERR_STAT_GT_FATAL_VECTOR_0, \ > > > + ERR_STAT_GT_FATAL_VECTOR_1)) > > > + > > > +#define ERR_STAT_GT_COR_VECTOR_0 0x1002a0 > > > +#define ERR_STAT_GT_COR_VECTOR_1 0x1002a4 > > > + > > > +#define ERR_STAT_GT_COR_VECTOR_REG(x) XE_REG(_PICK_EVEN((x), \ > > > + ERR_STAT_GT_COR_VECTOR_0, \ > > > + ERR_STAT_GT_COR_VECTOR_1)) > > > + > > > +#define ERR_STAT_GT_VECTOR_REG(hw_err, x) (hw_err == HARDWARE_ERROR_CORRECTABLE ? \ > > > + ERR_STAT_GT_COR_VECTOR_REG(x) : \ > > > + ERR_STAT_GT_FATAL_VECTOR_REG(x)) > > > > Ditto for whitespace. > > > > > -#define DEV_ERR_STAT_NONFATAL 0x100178 > > > -#define DEV_ERR_STAT_CORRECTABLE 0x10017c > > > -#define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN((x), \ > > > - DEV_ERR_STAT_CORRECTABLE, \ > > > - DEV_ERR_STAT_NONFATAL)) > > > -#define XE_CSC_ERROR BIT(17) > > > #endif > > > diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c > > > index 2019aaaa1ebe..ff31fb322c8a 100644 > > > --- a/drivers/gpu/drm/xe/xe_hw_error.c > > > +++ b/drivers/gpu/drm/xe/xe_hw_error.c > > > @@ -3,6 +3,7 @@ > > > * Copyright © 2025 Intel Corporation > > > */ > > > +#include > > > #include > > > #include "regs/xe_gsc_regs.h" > > > @@ -15,7 +16,13 @@ > > > #include "xe_mmio.h" > > > #include "xe_survivability_mode.h" > > > -#define HEC_UNCORR_FW_ERR_BITS 4 > > > +#define GT_HW_ERROR_MAX_ERR_BITS 16 > > > +#define HEC_UNCORR_FW_ERR_BITS 4 > > > +#define XE_RAS_REG_SIZE 32 > > > + > > > +#define PVC_ERROR_MASK_SET(hw_err, err_bit) \ > > > + ((hw_err == HARDWARE_ERROR_CORRECTABLE) ? (BIT(err_bit) & PVC_COR_ERR_MASK) : \ > > > + (BIT(err_bit) & PVC_FAT_ERR_MASK)) > > > > I'd write this as below and move it to xe_hw_error_regs.h > > This is not specific to register selection or defining bits. It's related to > mask. So .c should be the right place Don't the mask bits come from register def? > > #define PVC_COR_ERR_MASK_SET(err_bit) (PVC_COR_ERR_MASK & REG_BIT(err_bit)) > > #define PVC_FAT_ERR_MASK_SET(err_bit) (PVC_FAT_ERR_MASK & REG_BIT(err_bit)) > > > > #define PVC_ERR_MASK_SET(hw_err, err_bit) ((hw_err == HARDWARE_ERROR_CORRECTABLE) ? \ > > PVC_COR_ERR_MASK_SET(err_bit) : \ > > PVC_FAT_ERR_MASK_SET(err_bit) > > > > ... > > > > > +static void gt_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err, > > > + u32 error_id) > > > +{ > > > + const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err); > > > + struct xe_device *xe = tile_to_xe(tile); > > > + struct xe_drm_ras *ras = &xe->ras; > > > + struct xe_drm_ras_counter *info = ras->info[severity]; > > > + struct xe_mmio *mmio = &tile->mmio; > > > + unsigned long err_stat = 0; > > > + int i, len; > > > + > > > + if (xe->info.platform != XE_PVC) > > > + return; > > > + > > > + if (!info) > > > + return; > > > > Since info allocation is not related to hardware, we shouldn't even be > > at this point without it. So let's not hide bugs and fail probe instead. > > yes currently it is supported only on PVC. I can remove this here as there > is a PVC check but cannot remove the one suggested below. Fair, but please also return the allocation failure. With that perhaps xe_hw_error_init() will be int now. Raag > > > + if (hw_err == HARDWARE_ERROR_NONFATAL) { > > > + atomic_inc(&info[error_id].counter); > > > + log_hw_error(tile, info[error_id].name, severity); > > > + return; > > > + } > > > > ... > > > > > static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err) > > > { > > > const enum drm_xe_ras_error_severity severity = hw_err_to_severity(hw_err); > > > const char *severity_str = error_severity[severity]; > > > struct xe_device *xe = tile_to_xe(tile); > > > - unsigned long flags; > > > - u32 err_src; > > > + struct xe_drm_ras *ras = &xe->ras; > > > + struct xe_drm_ras_counter *info = ras->info[severity]; > > > + unsigned long flags, err_src; > > > + u32 err_bit; > > > - if (xe->info.platform != XE_BATTLEMAGE) > > > + if (!IS_DGFX(xe)) > > > return; > > > spin_lock_irqsave(&xe->irq.lock, flags); > > > @@ -108,11 +242,53 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er > > > goto unlock; > > > } > > > - if (err_src & XE_CSC_ERROR) > > > + /* > > > + * On encountering CSC firmware errors, the graphics device becomes unrecoverable > > > + * so return immediately on error. The only way to recover from these errors is > > > + * firmware flash. The device will enter Runtime Survivability mode when such > > > + * errors are detected. > > > + */ > > > + if (err_src & XE_CSC_ERROR) { > > > csc_hw_error_handler(tile, hw_err); > > > + goto clear_reg; > > > + } > > > - xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src); > > > + if (!info) > > > + goto clear_reg; > > > > Same as above. > > > > Raag