Re: [Intel-xe] [PATCH v5 4/4] drm/xe: Process fatal hardware errors.

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
To: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>,
	intel-xe@lists.freedesktop.org
Cc: Matt Roper <matthew.d.roper@intel.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>
Subject: Re: [Intel-xe] [PATCH v5 4/4] drm/xe: Process fatal hardware errors.
Date: Tue, 26 Sep 2023 09:51:49 +0530	[thread overview]
Message-ID: <9d63f599-3cf5-f1c4-8ade-a4a808c78cd2@linux.intel.com> (raw)
In-Reply-To: <20230823085842.1440523-5-himal.prasad.ghimiray@intel.com>


On 23/08/23 14:28, Himal Prasad Ghimiray wrote:
> Fatal errors are reported as PCIe errors. When a PCIe error is asserted,
> the OS will perform a device warm reset which causes the driver to reload.
> The error registers are sticky and the values are maintained through a
> warm reset. We read these registers during the boot flow of the driver and
> increment the respective error counters.
>
> Bspec: 53076
please mention 50875 as well.
>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> ---
>  drivers/gpu/drm/xe/regs/xe_regs.h |  3 +++
>  drivers/gpu/drm/xe/xe_hw_error.c  | 37 ++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_hw_error.h  |  3 ++-
>  drivers/gpu/drm/xe/xe_irq.c       |  2 +-
>  4 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h
> index e223975a5acf..b8f2b1762d3f 100644
> --- a/drivers/gpu/drm/xe/regs/xe_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h
> @@ -90,5 +90,8 @@
>  #define   GT_DW_IRQ(x)				REG_BIT(x)
>  #define   XE_ERROR_IRQ(x)			REG_BIT(26 + (x))
>  
> +#define DEV_PCIEERR_STATUS			XE_REG(0x100180)
> +#define   DEV_PCIEERR_IS_FATAL(x)		REG_BIT(x * 4 + 2)
place it as per the address.
> +
>  #define PVC_RP_STATE_CAP			XE_REG(0x281014)
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index deb020a509d2..9595e3369656 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -183,7 +183,7 @@ static const struct err_msg_cntr_pair err_stat_gt_correctable_vectr_reg[] = {
>  	[2 ... 3]         = {"L3BANK",		XE_GT_HW_ERR_L3BANK_CORR},
>  };
>  
> -void xe_assign_hw_err_regs(struct xe_device *xe)
> +static void xe_assign_hw_err_regs(struct xe_device *xe)

change it in the patch that introduced it.
>  {
>  	const struct err_msg_cntr_pair **dev_err_stat = xe->hw_err_regs.dev_err_stat;
>  	const struct err_msg_cntr_pair **err_stat_gt = xe->hw_err_regs.err_stat_gt;
> @@ -417,3 +417,38 @@ xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
>  			xe_hw_error_source_handler(tile, hw_err);
>  	}
>  }
> +
> +/**
> + * process_hw_errors - checks for the occurrence of HW errors
> + *
> + * This checks for the HW Errors including FATAL errors that might
> + * have occurred in the previous boot of the driver which will
> + * initiate PCIe FLR reset of the device and cause the

this is not right, in fact it is better to rephrase as:

"fatal will result in a card warm reset and driver will be reloaded."

> + * driver to reload.
> + */
> +void xe_process_hw_errors(struct xe_device *xe)
> +{
> +	struct xe_tile *root_tile = xe_device_get_root_tile(xe);
> +	struct xe_gt *root_mmio = root_tile->primary_gt;
> +
> +	u32 dev_pcieerr_status, master_ctl;
> +	struct xe_tile *tile;
> +	int i;
> +
> +	xe_assign_hw_err_regs(xe);

lets still have this in xe_irq_install, as it serves for a different purpose.
> +
> +	dev_pcieerr_status = xe_mmio_read32(root_mmio, DEV_PCIEERR_STATUS);
> +
> +	for_each_tile(tile, xe, i) {
> +		struct xe_gt *mmio = tile->primary_gt;
> +
> +		if (dev_pcieerr_status & DEV_PCIEERR_IS_FATAL(i))
> +			xe_hw_error_source_handler(tile, HARDWARE_ERROR_FATAL);
> +
> +		master_ctl = xe_mmio_read32(mmio, GFX_MSTR_IRQ);
> +		xe_hw_error_irq_handler(tile, master_ctl);
> +		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
> +	}
> +	if (dev_pcieerr_status)
> +		xe_mmio_write32(root_mmio, DEV_PCIEERR_STATUS, dev_pcieerr_status);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.h b/drivers/gpu/drm/xe/xe_hw_error.h
> index 3fcbbcc338fe..2812407dd4bf 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.h
> +++ b/drivers/gpu/drm/xe/xe_hw_error.h
> @@ -104,5 +104,6 @@ struct xe_device;
>  struct xe_tile;
>  
>  void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl);
> -void xe_assign_hw_err_regs(struct xe_device *xe);
> +void xe_process_hw_errors(struct xe_device *xe);
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> index 48b933234342..be152ebd6ce9 100644
> --- a/drivers/gpu/drm/xe/xe_irq.c
> +++ b/drivers/gpu/drm/xe/xe_irq.c
> @@ -573,7 +573,7 @@ int xe_irq_install(struct xe_device *xe)
>  		return -EINVAL;
>  	}
>  
> -	xe_assign_hw_err_regs(xe);
> +	xe_process_hw_errors(xe);

this shall be called before xe_irq_reset.
>  
>  	xe->irq.enabled = true;
>  
Thanks,
Aravind.

next prev parent reply	other threads:[~2023-09-26  4:19 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-23  8:58 [Intel-xe] [PATCH v5 0/4] Supporting RAS on XE Himal Prasad Ghimiray
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 1/4] drm/xe: Handle errors from various components Himal Prasad Ghimiray
2023-09-26  4:20   ` Aravind Iddamsetty
2023-09-26  4:57     ` Ghimiray, Himal Prasad
2023-09-26 10:09       ` Aravind Iddamsetty
2023-10-04 12:07   ` Aravind Iddamsetty
2023-10-05  4:01     ` Aravind Iddamsetty
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 2/4] drm/xe: Log and count the GT hardware errors Himal Prasad Ghimiray
2023-09-26  4:20   ` Aravind Iddamsetty
2023-09-26  5:08     ` Ghimiray, Himal Prasad
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 3/4] drm/xe: Support GT hardware error reporting for PVC Himal Prasad Ghimiray
2023-09-26  4:21   ` Aravind Iddamsetty
2023-09-26  5:11     ` Ghimiray, Himal Prasad
2023-08-23  8:58 ` [Intel-xe] [PATCH v5 4/4] drm/xe: Process fatal hardware errors Himal Prasad Ghimiray
2023-09-26  4:21   ` Aravind Iddamsetty [this message]
2023-09-26 10:24     ` Ghimiray, Himal Prasad
2023-08-23  9:00 ` [Intel-xe] ✓ CI.Patch_applied: success for Supporting RAS on XE (rev4) Patchwork
2023-08-23  9:00 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork
2023-08-23  9:01 ` [Intel-xe] ✓ CI.KUnit: success " Patchwork
2023-08-23  9:05 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-08-23  9:05 ` [Intel-xe] ✗ CI.Hooks: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9d63f599-3cf5-f1c4-8ade-a4a808c78cd2@linux.intel.com \
    --to=aravind.iddamsetty@linux.intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.d.roper@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox