From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 813B7C001B0 for ; Thu, 10 Aug 2023 07:58:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 585A510E4EB; Thu, 10 Aug 2023 07:58:55 +0000 (UTC) Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by gabe.freedesktop.org (Postfix) with ESMTPS id 17EC810E4EB for ; Thu, 10 Aug 2023 07:58:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1691654333; x=1723190333; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=PFDp0HmM8BuMvlMM55gifVAr64+iyV1mWfUXMVI0I4o=; b=IIz4gtQkSXRzm+Sqh9ORr3mmFoBjKPUXIRvdSMo1Bd8FS98rS+TKKAOw podIspCg2MD8TNzAndk32b0sBaCacK+zu+cLO/acJlUSAmVa0qFUGjT9F kLEbOeylKZTxXDBVoPMU18VHMLTBvU3ErBK5788wnapPQ2ZEtxJE8itnq P/zSDGDdlFlytULzBfZsFiB63ricusVeGAIgygeFVxr8ITfqkrWxd4jj+ wshRCuTvUzVnKDm7/AsCNG/GjQmQKaFZCqs7RWgPI6RZsrHdZdHfL+L8x Wt53RAABQJAwVGlED7SRo9N89SV7oIYKAjeqvvuwALZO9kF8aOATOdDRN Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10797"; a="375046923" X-IronPort-AV: E=Sophos;i="6.01,161,1684825200"; d="scan'208";a="375046923" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Aug 2023 00:55:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10797"; a="822145255" X-IronPort-AV: E=Sophos;i="6.01,161,1684825200"; d="scan'208";a="822145255" Received: from haijindi-mobl2.ger.corp.intel.com (HELO localhost) ([10.252.49.164]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Aug 2023 00:54:59 -0700 From: Jani Nikula To: Himal Prasad Ghimiray , intel-xe@lists.freedesktop.org In-Reply-To: <20230810050755.715775-2-himal.prasad.ghimiray@intel.com> Organization: Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo References: <20230810050755.715775-1-himal.prasad.ghimiray@intel.com> <20230810050755.715775-2-himal.prasad.ghimiray@intel.com> Date: Thu, 10 Aug 2023 10:54:56 +0300 Message-ID: <877cq3s567.fsf@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Intel-xe] [PATCH v2 1/4] drm/xe: Handle errors from various components. X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Joonas Lahtinen , Rodrigo Vivi , Matt Roper Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Thu, 10 Aug 2023, Himal Prasad Ghimiray wrote: > The GFX device can generate numbers of classes of error under the new > infrastructure: correctable, non-fatal, and fatal errors. > > The non-fatal and fatal error classes distinguish between levels of > severity for uncorrectable errors. Driver will only handle logging > of errors and updating counters from various components within the > graphics device. Anything more will be handled at system level. > > For errors that will route as interrupts, three bits in the Master > Interrupt Register will be used to convey the class of error. > > For each class of error: Determine source of error (IP block) by reading > the Device Error Source Register (RW1C) that > corresponds to the class of error being serviced. > > Bspec: 50875, 53073, 53074, 53075 > > Cc: Rodrigo Vivi > Cc: Aravind Iddamsetty > Cc: Matthew Brost > Cc: Matt Roper > Cc: Joonas Lahtinen > Cc: Jani Nikula > Signed-off-by: Himal Prasad Ghimiray > --- > drivers/gpu/drm/xe/regs/xe_regs.h | 7 + > drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++++++ > drivers/gpu/drm/xe/xe_device_types.h | 6 + > drivers/gpu/drm/xe/xe_irq.c | 220 +++++++++++++++++++ > 4 files changed, 341 insertions(+) > create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h > > diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/= xe_regs.h > index ec45b1ba9db1..9901e55fc89c 100644 > --- a/drivers/gpu/drm/xe/regs/xe_regs.h > +++ b/drivers/gpu/drm/xe/regs/xe_regs.h > @@ -87,7 +87,14 @@ > #define GU_MISC_IRQ REG_BIT(29) > #define DISPLAY_IRQ REG_BIT(16) > #define GT_DW_IRQ(x) REG_BIT(x) > +#define XE_ERROR_IRQ(x) REG_BIT(26 + (x)) >=20=20 > #define PVC_RP_STATE_CAP XE_REG(0x281014) >=20=20 > +enum hardware_error { > + HARDWARE_ERROR_CORRECTABLE =3D 0, > + HARDWARE_ERROR_NONFATAL =3D 1, > + HARDWARE_ERROR_FATAL =3D 2, > + HARDWARE_ERROR_MAX, > +}; This file is about registers. IMO enums belong somewhere else. Define hardware registers using macros. > #endif > diff --git a/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h b/drivers/gpu/d= rm/xe/regs/xe_tile_error_regs.h > new file mode 100644 > index 000000000000..fbb794b2f183 > --- /dev/null > +++ b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h > @@ -0,0 +1,108 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright =C2=A9 2023 Intel Corporation > + */ > +#ifndef XE_TILE_ERROR_REGS_H_ > +#define XE_TILE_ERROR_REGS_H_ > + > +#include > + > +#define _DEV_ERR_STAT_NONFATAL 0x100178 > +#define _DEV_ERR_STAT_CORRECTABLE 0x10017c > +#define DEV_ERR_STAT_REG(x) XE_REG(_PICK_EVEN= ((x), \ > + _DEV_ERR_STAT_CORRECTABLE, \ > + _DEV_ERR_STAT_NONFATAL)) > + > +#define DEV_ERR_STAT_MAX_ERROR_BIT (21) > + > +/* Count of Correctable and Uncorrectable errors reported on tile */ > +enum xe_tile_hw_errors {g > + XE_TILE_HW_ERR_GT_FATAL =3D 0, > + XE_TILE_HW_ERR_SGGI_FATAL, > + XE_TILE_HW_ERR_DISPLAY_FATAL, > + XE_TILE_HW_ERR_SGDI_FATAL, > + XE_TILE_HW_ERR_SGLI_FATAL, > + XE_TILE_HW_ERR_SGUNIT_FATAL, > + XE_TILE_HW_ERR_SGCI_FATAL, > + XE_TILE_HW_ERR_GSC_FATAL, > + XE_TILE_HW_ERR_SOC_FATAL, > + XE_TILE_HW_ERR_MERT_FATAL, > + XE_TILE_HW_ERR_SGMI_FATAL, > + XE_TILE_HW_ERR_UNKNOWN_FATAL, > + XE_TILE_HW_ERR_SGGI_NONFATAL, > + XE_TILE_HW_ERR_DISPLAY_NONFATAL, > + XE_TILE_HW_ERR_SGDI_NONFATAL, > + XE_TILE_HW_ERR_SGLI_NONFATAL, > + XE_TILE_HW_ERR_GT_NONFATAL, > + XE_TILE_HW_ERR_SGUNIT_NONFATAL, > + XE_TILE_HW_ERR_SGCI_NONFATAL, > + XE_TILE_HW_ERR_GSC_NONFATAL, > + XE_TILE_HW_ERR_SOC_NONFATAL, > + XE_TILE_HW_ERR_MERT_NONFATAL, > + XE_TILE_HW_ERR_SGMI_NONFATAL, > + XE_TILE_HW_ERR_UNKNOWN_NONFATAL, > + XE_TILE_HW_ERR_GT_CORR, > + XE_TILE_HW_ERR_DISPLAY_CORR, > + XE_TILE_HW_ERR_SGUNIT_CORR, > + XE_TILE_HW_ERR_GSC_CORR, > + XE_TILE_HW_ERR_SOC_CORR, > + XE_TILE_HW_ERR_UNKNOWN_CORR, > +}; Ditto about enums and regs. > + > +#define XE_TILE_HW_ERROR_MAX (XE_TILE_HW_ERR_UNKNOWN_CORR + 1) If it's an enum, adding that last in the enum does the trick. > + > +#define PVC_DEV_ERR_STAT_FATAL_MASK \ > + (REG_BIT(0) | \ > + REG_BIT(1) | \ > + REG_BIT(8) | \ > + REG_BIT(9) | \ > + REG_BIT(13) | \ > + REG_BIT(16) | \ > + REG_BIT(20)) > + > +#define PVC_DEV_ERR_STAT_NONFATAL_MASK \ > + (REG_BIT(0) | \ > + REG_BIT(1) | \ > + REG_BIT(8) | \ > + REG_BIT(9) | \ > + REG_BIT(13) | \ > + REG_BIT(16) | \ > + REG_BIT(20)) > + > +#define PVC_DEV_ERR_STAT_CORRECTABLE_MASK \ > + (REG_BIT(0) | \ > + REG_BIT(8)) > + > +#define DG2_DEV_ERR_STAT_FATAL_MASK \ > + (REG_BIT(0) | \ > + REG_BIT(4) | \ > + REG_BIT(8) | \ > + REG_BIT(12) | \ > + REG_BIT(16)) > + > +#define DG2_DEV_ERR_STAT_NONFATAL_MASK \ > + (REG_BIT(0) | \ > + REG_BIT(4) | \ > + REG_BIT(8) | \ > + REG_BIT(12) | \ > + REG_BIT(16) | \ > + REG_BIT(20)) > + > +#define DG2_DEV_ERR_STAT_CORRECTABLE_MASK \ > + (REG_BIT(0) | \ > + REG_BIT(4) | \ > + REG_BIT(8) | \ > + REG_BIT(12) | \ > + REG_BIT(16)) Are the above supposed to match what's in xe_tile_hw_errors? Seems rather unmaintainable. > + > +#define REG_SIZE 32 > + > +#define xe_tile_log_hw_err(tile, fmt, ...) \ > + drm_err_ratelimited(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " f= mt, \ > + tile->id, ##__VA_ARGS__) > + > +#define xe_tile_log_hw_warn(tile, fmt, ...) \ > + drm_warn(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \ > + tile->id, ##__VA_ARGS__) Do we really want to keep adding new macros for all possible scenarios in the driver? This is getting out of hand. Where's HW_ERR defined? > + > +#endif > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe= _device_types.h > index f84ecb976f5d..1335ba74981a 100644 > --- a/drivers/gpu/drm/xe/xe_device_types.h > +++ b/drivers/gpu/drm/xe/xe_device_types.h > @@ -16,6 +16,7 @@ > #include "xe_gt_types.h" > #include "xe_platform_types.h" > #include "xe_step_types.h" > +#include "regs/xe_tile_error_regs.h" >=20=20 > #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY) > #include "ext/intel_device_info.h" > @@ -166,6 +167,11 @@ struct xe_tile { >=20=20 > /** @sysfs: sysfs' kobj used by xe_tile_sysfs */ > struct kobject *sysfs; > + > + /** @tile_hw_errors: hardware errors reported for the tile */ > + struct tile_hw_errors { > + unsigned long hw[XE_TILE_HW_ERROR_MAX]; Even with the documentation comment, I have to look up the source code to realize this is the *number* of errors for each class. Maybe "count" is more informative than "hw". > + } errors; > }; >=20=20 > /** > diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c > index 2022a5643e01..04a665faea23 100644 > --- a/drivers/gpu/drm/xe/xe_irq.c > +++ b/drivers/gpu/drm/xe/xe_irq.c > @@ -362,6 +362,223 @@ static void dg1_intr_enable(struct xe_device *xe, b= ool stall) > xe_mmio_read32(mmio, DG1_MSTR_TILE_INTR); > } >=20=20 > +static const char * > +hardware_error_type_to_str(const enum hardware_error hw_err) > +{ > + switch (hw_err) { > + case HARDWARE_ERROR_CORRECTABLE: > + return "CORRECTABLE"; > + case HARDWARE_ERROR_NONFATAL: > + return "NONFATAL"; > + case HARDWARE_ERROR_FATAL: > + return "FATAL"; > + default: > + return "UNKNOWN"; > + } > +} > + > +struct error_msg_counter_pair { > + const char *errmsg; > + int errcounter; Counter? Or type/class/whatever? > +}; > + > +struct error_msg_counter_pair dev_err_stat_fatal_reg[] =3D { > + {"GT", XE_TILE_HW_ERR_GT_FATAL /* Bit Pos 0 */}, Does this again tie the enums and the bit positions together, similar to how the mask macros also do above? There needs to be a single point of truth for all of this. I think this needs a redesign. BR, Jani. > + {"SGGI Cmd Parity", XE_TILE_HW_ERR_SGGI_FATAL /* Bit Pos 1 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"DISPLAY", XE_TILE_HW_ERR_DISPLAY_FATAL /* Bit Pos 4 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"GSC error", XE_TILE_HW_ERR_GSC_FATAL /* Bit Pos 8 */}, > + {"SGLI Cmd Parity", XE_TILE_HW_ERR_SGLI_FATAL /* Bit Pos 9 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"SGUNIT", XE_TILE_HW_ERR_SGUNIT_FATAL /* Bit Pos 12 */}, > + {"SGCI Cmd Parity", XE_TILE_HW_ERR_SGCI_FATAL /* Bit Pos 13 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"SOC ERROR", XE_TILE_HW_ERR_SOC_FATAL /* Bit Pos 16 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_FATAL}, > + {"MERT Cmd Parity", XE_TILE_HW_ERR_MERT_FATAL /* Bit Pos 20 */}, > +}; > + > +struct error_msg_counter_pair dev_err_stat_nonfatal_reg[] =3D { > + {"GT", XE_TILE_HW_ERR_GT_NONFATAL /* Bit Pos 0 */}, > + {"SGGI Data Parity", XE_TILE_HW_ERR_SGGI_NONFATAL /* Bit Pos 1 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"DISPLAY", XE_TILE_HW_ERR_DISPLAY_NONFATAL /* Bit Pos 4 */}, > + {"SGDI Data Parity", XE_TILE_HW_ERR_SGDI_NONFATAL /* Bit Pos 5 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"GSC", XE_TILE_HW_ERR_GSC_NONFATAL /* Bit Pos 8 */}, > + {"SGLI Data Parity", XE_TILE_HW_ERR_SGLI_NONFATAL /* Bit Pos 9 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"SGUNIT", XE_TILE_HW_ERR_SGUNIT_NONFATAL /* Bit Pos 12 */}, > + {"SGCI Data Parity", XE_TILE_HW_ERR_SGCI_NONFATAL /* Bit Pos 13 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"SOC", XE_TILE_HW_ERR_SOC_NONFATAL /* Bit Pos 16 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL /* Bit Pos 17 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_NONFATAL}, > + {"MERT Data Parity", XE_TILE_HW_ERR_MERT_NONFATAL /* Bit Pos 20 */}, > +}; > + > +struct error_msg_counter_pair dev_err_stat_correctable_reg[] =3D { > + {"GT", XE_TILE_HW_ERR_GT_CORR /* Bit Pos 0 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"DISPLAY", XE_TILE_HW_ERR_DISPLAY_CORR /* Bit Pos 4 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"GSC", XE_TILE_HW_ERR_GSC_CORR /* Bit Pos 8 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"SGUNIT", XE_TILE_HW_ERR_SGUNIT_CORR /* Bit Pos 12 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"SOC", XE_TILE_HW_ERR_SOC_CORR /* Bit Pos 16 */}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > + {"Undefined", XE_TILE_HW_ERR_UNKNOWN_CORR}, > +}; > + > +static void update_valid_error_regs(struct xe_device *xe) > +{ > + unsigned long mask =3D 0; > + > + u32 i; > + > + if (xe->info.platform =3D=3D XE_DG2) { > + mask =3D ~(0 | DG2_DEV_ERR_STAT_FATAL_MASK); > + for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT) > + dev_err_stat_fatal_reg[i] =3D (struct error_msg_counter_pair) > + {.errmsg =3D "Undefined", .errcounter =3D XE_TILE_HW_ERR_UNKNOWN_FATA= L}; Nope. For one thing, the arrays really should be static const, placed in rodata, and not mutable. For another, if you have a platform with two or more different devices, whichever gets probed last clobbers the data. > + > + mask =3D ~(0 | DG2_DEV_ERR_STAT_NONFATAL_MASK); > + for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT) > + dev_err_stat_nonfatal_reg[i] =3D (struct error_msg_counter_pair) > + {.errmsg =3D "Undefined", .errcounter =3D XE_TILE_HW_ERR_UNKNOWN_NONF= ATAL}; > + > + mask =3D ~(0 | DG2_DEV_ERR_STAT_CORRECTABLE_MASK); > + for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT) > + dev_err_stat_correctable_reg[i] =3D (struct error_msg_counter_pair) > + {.errmsg =3D "Undefined", .errcounter =3D XE_TILE_HW_ERR_UNKNOWN_CORR= }; > + } else if (xe->info.platform =3D=3D XE_PVC) { > + mask =3D ~(0 | PVC_DEV_ERR_STAT_FATAL_MASK); > + for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT) > + dev_err_stat_fatal_reg[i] =3D (struct error_msg_counter_pair) > + {.errmsg =3D "Undefined", .errcounter =3D XE_TILE_HW_ERR_UNKNOWN_FATA= L}; > + > + mask =3D ~(0 | PVC_DEV_ERR_STAT_NONFATAL_MASK); > + for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT) > + dev_err_stat_nonfatal_reg[i] =3D (struct error_msg_counter_pair) > + {.errmsg =3D "Undefined", .errcounter =3D XE_TILE_HW_ERR_UNKNOWN_NONF= ATAL}; > + > + mask =3D ~(0 | PVC_DEV_ERR_STAT_CORRECTABLE_MASK); > + for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT) > + dev_err_stat_correctable_reg[i] =3D (struct error_msg_counter_pair) > + {.errmsg =3D "Undefined", .errcounter =3D XE_TILE_HW_ERR_UNKNOWN_CORR= }; > + } > +} > + > +static void > +xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_err= or hw_err) > +{ > + const char *hw_err_str =3D hardware_error_type_to_str(hw_err); > + struct error_msg_counter_pair *errstat; > + unsigned long errsrc; > + unsigned long flags; > + const char *errmsg; > + struct xe_gt *mmio; > + u32 counter; > + u32 errcntr; > + u32 errbit; > + > + switch (hw_err) { > + case HARDWARE_ERROR_FATAL: > + errstat =3D (struct error_msg_counter_pair *)dev_err_stat_fatal_reg; Why the casts? > + counter =3D XE_TILE_HW_ERR_UNKNOWN_FATAL; > + break; > + case HARDWARE_ERROR_NONFATAL: > + errstat =3D (struct error_msg_counter_pair *)dev_err_stat_nonfatal_reg; > + counter =3D XE_TILE_HW_ERR_UNKNOWN_NONFATAL; > + break; > + case HARDWARE_ERROR_CORRECTABLE: > + errstat =3D (struct error_msg_counter_pair *)dev_err_stat_correctable_= reg; > + counter =3D XE_TILE_HW_ERR_UNKNOWN_CORR; > + break; > + default: > + return; > + } > + > + spin_lock_irqsave(&tile_to_xe(tile)->irq.lock, flags); > + mmio =3D tile->primary_gt; > + errsrc =3D xe_mmio_read32(mmio, DEV_ERR_STAT_REG(hw_err)); > + > + if (!errsrc) { > + xe_tile_log_hw_err(tile, "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str); > + goto unlock; > + } > + > + for_each_set_bit(errbit, &errsrc, REG_SIZE) { > + if (errbit < DEV_ERR_STAT_MAX_ERROR_BIT) { > + errmsg =3D errstat[errbit].errmsg; > + errcntr =3D errstat[errbit].errcounter; > + } else { > + errmsg =3D "Undefined"; > + errcntr =3D counter; > + } > + > + if (hw_err =3D=3D HARDWARE_ERROR_CORRECTABLE) > + xe_tile_log_hw_warn(tile, "%s %s error bit[%d] is set\n", > + errmsg, hw_err_str, errbit); > + else > + xe_tile_log_hw_err(tile, "%s %s error bit[%d] is set\n", > + errmsg, hw_err_str, errbit); > + > + tile->errors.hw[errcntr]++; > + } > + > + xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc); > +unlock: > + spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags); > +} > + > +/* > + * XE Platforms adds three Error bits to the Master Interrupt > + * Register to support error handling. These three bits are > + * used to convey the class of error: > + * FATAL, NONFATAL, or CORRECTABLE. > + * > + * To process an interrupt: > + * Determine source of error (IP block) by reading > + * the Device Error Source Register (RW1C) that > + * corresponds to the class of error being serviced > + * and log the error. > + */ > +static void > +xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl) > +{ > + enum hardware_error hw_err; > + > + for (hw_err =3D 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) { > + if (master_ctl & XE_ERROR_IRQ(hw_err)) > + xe_hw_error_source_handler(tile, hw_err); > + } > +} > + > /* > * Top-level interrupt handler for Xe_LP+ and beyond. These platforms h= ave > * a "master tile" interrupt register which must be consulted before the > @@ -413,6 +630,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg) > xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl); >=20=20 > gt_irq_handler(tile, master_ctl, intr_dw, identity); > + xe_hw_error_irq_handler(tile, master_ctl); >=20=20 > /* > * Display interrupts (including display backlight operations > @@ -561,6 +779,8 @@ int xe_irq_install(struct xe_device *xe) > return -EINVAL; > } >=20=20 > + update_valid_error_regs(xe); > + > xe->irq.enabled =3D true; >=20=20 > xe_irq_reset(xe); --=20 Jani Nikula, Intel Open Source Graphics Center