Re: [PATCH v5 03/14] drm/xe/xe_pci_error: Implement PCI error recovery callbacks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
To: Riana Tauro <riana.tauro@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: <anshuman.gupta@intel.com>, <rodrigo.vivi@intel.com>,
	<aravind.iddamsetty@linux.intel.com>, <badal.nilawar@intel.com>,
	<raag.jadav@intel.com>, <ravi.kishore.koppuravuri@intel.com>,
	<soham.purkait@intel.com>,
	Michal Wajdeczko <michal.wajdeczko@intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Matt Roper <matthew.d.roper@intel.com>
Subject: Re: [PATCH v5 03/14] drm/xe/xe_pci_error: Implement PCI error recovery callbacks
Date: Thu, 14 May 2026 18:45:24 +0530	[thread overview]
Message-ID: <77411bf1-83a6-4270-ad37-dfacdb3d6b3a@intel.com> (raw)
In-Reply-To: <20260511172908.1122252-19-riana.tauro@intel.com>


On 11-05-2026 10:59 pm, Riana Tauro wrote:
> Add error_detected, mmio_enabled, slot_reset and resume recovery callbacks
> to handle PCIe Advanced Error Reporting (AER) errors.
>
> For fatal errors, the device is wedged and becomes inaccessible. Return
> PCI_ERS_RESULT_SLOT_RESET from error_detected to request a Secondary
> Bus Reset (SBR).
>
> For non-fatal errors, return PCI_ERS_RESULT_CAN_RECOVER from
> error_detected to trigger the mmio_enabled callback. In this callback, the
> device is queried to determine the error cause and attempt recovery based
> on the error type.
>
> Once the secondary bus reset(SBR) is completed the slot_reset callback
> cleanly removes and reprobe the device to restore functionality.
>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
LGTM,
Reviewed-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> ---
> v2: re-order linux headers
>      reword error messages
>      do not clear in_recovery after remove
>      return PCI_ERS_RESULT_DISCONNECT if probe fails (Michal)
>      only wedge device do not send uevent (Raag)
>      set recovery flag in error_detected and clear on resume
>      add default switch case (Mallesh)
>
> v3: do not set in_recovery for disconnect (Mallesh)
>      return if already wedged or in survivability mode
>
> v4: Add comment (Matthew)
>      Fix tab (Mallesh)
>
> v5: remove in_reset
>      disconnect if already in survivability mode or wedged
>      block I/O operations in slot reset (Raag)
>
> Note: The re-probe in this patch will be replaced by
> minimal re-initalization once below patch is merged
>   https://lore.kernel.org/intel-xe/f642453c-f657-41c7-a01b-5a0baf886cd3@intel.com/
>
> ---
>   drivers/gpu/drm/xe/Makefile       |   1 +
>   drivers/gpu/drm/xe/xe_pci.c       |   3 +
>   drivers/gpu/drm/xe/xe_pci_error.c | 115 ++++++++++++++++++++++++++++++
>   3 files changed, 119 insertions(+)
>   create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c
>
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 09661f079d03..091872771e98 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -101,6 +101,7 @@ xe-y += xe_bb.o \
>   	xe_page_reclaim.o \
>   	xe_pat.o \
>   	xe_pci.o \
> +	xe_pci_error.o \
>   	xe_pci_rebar.o \
>   	xe_pcode.o \
>   	xe_pm.o \
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index d55e5af4f4b7..d970c27f5570 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -1320,6 +1320,8 @@ static const struct dev_pm_ops xe_pm_ops = {
>   };
>   #endif
>   
> +extern const struct pci_error_handlers xe_pci_error_handlers;
> +
>   static struct pci_driver xe_pci_driver = {
>   	.name = DRIVER_NAME,
>   	.id_table = pciidlist,
> @@ -1327,6 +1329,7 @@ static struct pci_driver xe_pci_driver = {
>   	.remove = xe_pci_remove,
>   	.shutdown = xe_pci_shutdown,
>   	.sriov_configure = xe_pci_sriov_configure,
> +	.err_handler = &xe_pci_error_handlers,
>   #ifdef CONFIG_PM_SLEEP
>   	.driver.pm = &xe_pm_ops,
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_pci_error.c b/drivers/gpu/drm/xe/xe_pci_error.c
> new file mode 100644
> index 000000000000..42a821ca1a04
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pci_error.c
> @@ -0,0 +1,115 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +#include <linux/pci.h>
> +
> +#include <drm/drm_drv.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt.h"
> +#include "xe_pci.h"
> +#include "xe_survivability_mode.h"
> +#include "xe_uc.h"
> +
> +static void xe_pci_error_handling(struct pci_dev *pdev)
> +{
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +	struct xe_gt *gt;
> +	u8 id;
> +
> +	/*
> +	 * Wedge the device to prevent userspace access but don't send the event yet.
> +	 * Runtime PM ref is taken by PCI core for the duration of error handling.
> +	 */
> +	atomic_set(&xe->wedged.flag, 1);
> +
> +	for_each_gt(gt, xe, id)
> +		xe_gt_declare_wedged(gt);
> +
> +	pci_disable_device(pdev);
> +}
> +
> +static pci_ers_result_t xe_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +
> +	dev_err(&pdev->dev, "Xe Pci error recovery: error detected state %d\n", state);
> +
> +	if (state == pci_channel_io_perm_failure)
> +		return PCI_ERS_RESULT_DISCONNECT;
> +
> +	/* If the device is already wedged or in survivability mode, do not attempt recovery */
> +	if (xe_survivability_mode_is_boot_enabled(xe) || xe_device_wedged(xe))
> +		return PCI_ERS_RESULT_DISCONNECT;
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;
> +	case pci_channel_io_frozen:
> +		xe_pci_error_handling(pdev);
> +		return PCI_ERS_RESULT_NEED_RESET;
> +	default:
> +		dev_err(&pdev->dev, "Unknown state %d\n", state);
> +		return PCI_ERS_RESULT_NEED_RESET;
> +	}
> +}
> +
> +static pci_ers_result_t xe_pci_error_mmio_enabled(struct pci_dev *pdev)
> +{
> +	dev_err(&pdev->dev, "Xe Pci error recovery: MMIO enabled\n");
> +
> +	return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +static pci_ers_result_t xe_pci_error_slot_reset(struct pci_dev *pdev)
> +{
> +	const struct pci_device_id *ent = pci_match_id(pdev->driver->id_table, pdev);
> +	struct xe_device *xe;
> +
> +	dev_err(&pdev->dev, "Xe Pci error recovery: Slot reset\n");
> +
> +	pci_restore_state(pdev);
> +
> +	if (pci_enable_device(pdev)) {
> +		dev_err(&pdev->dev,
> +			"Cannot re-enable PCI device after reset\n");
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +
> +	/*
> +	 * Secondary Bus Reset causes all VRAM state to be lost along with
> +	 * hardware state. As an initial step, re-probe the device to
> +	 * re-initialize the driver and hardware.
> +	 * TODO: optimize by re-initializing only the hardware state and re-creating
> +	 * kernel BOs.
> +	 */
> +	pdev->driver->remove(pdev);
> +
> +	if (pdev->driver->probe(pdev, ent))
> +		return PCI_ERS_RESULT_DISCONNECT;
> +
> +	xe = pdev_to_xe_device(pdev);
> +
> +	/* Wedge the device to prevent I/O operations till the resume callback */
> +	atomic_set(&xe->wedged.flag, 1);
> +
> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +static void xe_pci_error_resume(struct pci_dev *pdev)
> +{
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +
> +	dev_info(&pdev->dev, "Xe Pci error recovery: Recovered\n");
> +
> +	/* Resume I/O operations */
> +	atomic_set(&xe->wedged.flag, 0);
> +}
> +
> +const struct pci_error_handlers xe_pci_error_handlers = {
> +	.error_detected	= xe_pci_error_detected,
> +	.mmio_enabled	= xe_pci_error_mmio_enabled,
> +	.slot_reset	= xe_pci_error_slot_reset,
> +	.resume		= xe_pci_error_resume,
> +};

next prev parent reply	other threads:[~2026-05-14 13:15 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 17:29 [PATCH v5 00/14] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-05-11 17:29 ` [PATCH v5 01/14] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-05-11 17:29 ` [PATCH v5 02/14] drm/xe/xe_sysctrl: Make sysctrl flood limit reusable Riana Tauro
2026-05-14 12:51   ` Mallesh, Koujalagi
2026-05-15  7:46   ` Raag Jadav
2026-05-11 17:29 ` [PATCH v5 03/14] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-05-14 13:15   ` Mallesh, Koujalagi [this message]
2026-05-11 17:29 ` [PATCH v5 04/14] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-05-11 17:29 ` [PATCH v5 05/14] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-05-11 17:29 ` [PATCH v5 06/14] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-05-14 17:40   ` Raag Jadav
2026-05-15  8:30   ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 07/14] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 08/14] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-05-15  8:10   ` Mallesh, Koujalagi
2026-05-11 17:29 ` [PATCH v5 09/14] drm/xe/xe_ras: Add support to query device memory errors Riana Tauro
2026-05-11 17:29 ` [PATCH v5 10/14] drm/xe/xe_ras: Add support to query page offline queue and list Riana Tauro
2026-05-11 17:29 ` [PATCH v5 11/14] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-05-11 21:56   ` Umesh Nerlige Ramappa
2026-05-11 17:29 ` [PATCH v5 12/14] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 13/14] drm/xe/xe_ras: Add support to offline/decline a page address Riana Tauro
2026-05-11 17:29 ` [RFC PATCH v5 14/14] drm/xe/xe_ras: Process pages from offlined list and queue Riana Tauro
2026-05-12  1:05 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev5) Patchwork
2026-05-12  1:06 ` ✓ CI.KUnit: success " Patchwork
2026-05-12  2:29 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-12  6:26 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=77411bf1-83a6-4270-ad37-dfacdb3d6b3a@intel.com \
    --to=mallesh.koujalagi@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=ravi.kishore.koppuravuri@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=soham.purkait@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.