From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8B746EB3622 for ; Mon, 2 Mar 2026 17:37:21 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4E22210E571; Mon, 2 Mar 2026 17:37:21 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="J4nk3rtE"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by gabe.freedesktop.org (Postfix) with ESMTPS id DD1E610E571 for ; Mon, 2 Mar 2026 17:37:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1772473040; x=1804009040; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=i3bLGEvhY/BTwxtFEL4SHeCM8OaHVpCutR+lwNN1av0=; b=J4nk3rtEdWoayyzx0GBrDofN6gUR1/QP93rEADM1KlsdUNdySfPQDUea +1PBD0lCzKJ5KAmlEHMftwaDHmyW+GRKgfh45f2sqlJpz7m8Ky+EOrnPR lJ6tDUBdvZ96CmhiB99Pp/ze8KTCPAquYHRG/a2Vf1425j1453RF+muGC bo3VxN+i2eozLD9ZBKj36dXtAqJHcKkQUDBNuqzvJWQ7kc6jDpme9pgKj znP1M+lptY3uB4MWgLYaO+2jfQlRRTjEGlS9/LGQ4jbo+MP4Ndz2XMYJF LrlTCGJZg/Q+NGfaGqIcN8O9xjSmFfOZsinyS4g0sb4YZYLJo8gMUNg0D A==; X-CSE-ConnectionGUID: 2eeWC/e1TkezmFzV6kiPJg== X-CSE-MsgGUID: LsV1Xn8XQiu6hbWuKIX7Xw== X-IronPort-AV: E=McAfee;i="6800,10657,11717"; a="91064535" X-IronPort-AV: E=Sophos;i="6.21,320,1763452800"; d="scan'208";a="91064535" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Mar 2026 09:37:20 -0800 X-CSE-ConnectionGUID: qNn0t79YQjqbFioi3P440A== X-CSE-MsgGUID: +fWoAfd3QeSjQhx3QPy3Zg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,320,1763452800"; d="scan'208";a="215696531" Received: from black.igk.intel.com ([10.91.253.5]) by fmviesa007.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Mar 2026 09:37:17 -0800 Date: Mon, 2 Mar 2026 18:37:14 +0100 From: Raag Jadav To: Riana Tauro Cc: intel-xe@lists.freedesktop.org, anshuman.gupta@intel.com, rodrigo.vivi@intel.com, aravind.iddamsetty@linux.intel.com, badal.nilawar@intel.com, ravi.kishore.koppuravuri@intel.com, mallesh.koujalagi@intel.com, Michal Wajdeczko , Matthew Brost , Matt Roper Subject: Re: [PATCH v2 03/11] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Message-ID: References: <20260302102155.4074630-13-riana.tauro@intel.com> <20260302102155.4074630-16-riana.tauro@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20260302102155.4074630-16-riana.tauro@intel.com> X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Mon, Mar 02, 2026 at 03:51:58PM +0530, Riana Tauro wrote: > Add error_detected, mmio_enabled, slot_reset and resume > recovery callbacks to handle PCIe Advanced Error Reporting > (AER) errors. > > For fatal errors, the device is wedged and becomes > inaccessible. Return PCI_ERS_RESULT_SLOT_RESET from > error_detected to request a Secondary Bus Reset (SBR). > > For non-fatal errors, return PCI_ERS_RESULT_CAN_RECOVER from > error_detected to trigger the mmio_enabled callback. In this callback, > the device is queried to determine the error cause and attempt > recovery based on the error type. > > Once the secondary bus reset(SBR) is completed the slot_reset callback > cleanly removes and reprobe the device to restore functionality. > > Cc: Michal Wajdeczko > Cc: Matthew Brost > Cc: Matt Roper > Signed-off-by: Riana Tauro > --- > v2: re-order linux headers > reword error messages > do not clear in_recovery after remove > return PCI_ERS_RESULT_DISCONNECT if probe fails (Michal) > only wedge device do not send uevent (Raag) > set recovery flag in error_detected and clear on resume > add default switch case (Mallesh) > --- > drivers/gpu/drm/xe/Makefile | 1 + > drivers/gpu/drm/xe/xe_device.h | 15 +++++ > drivers/gpu/drm/xe/xe_device_types.h | 3 + > drivers/gpu/drm/xe/xe_pci.c | 3 + > drivers/gpu/drm/xe/xe_pci_error.c | 99 ++++++++++++++++++++++++++++ > 5 files changed, 121 insertions(+) > create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile > index 1890bbd1b28d..417b030e5ce7 100644 > --- a/drivers/gpu/drm/xe/Makefile > +++ b/drivers/gpu/drm/xe/Makefile > @@ -99,6 +99,7 @@ xe-y += xe_bb.o \ > xe_page_reclaim.o \ > xe_pat.o \ > xe_pci.o \ > + xe_pci_error.o \ > xe_pci_rebar.o \ > xe_pcode.o \ > xe_pm.o \ > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h > index 39464650533b..972f43d20f1a 100644 > --- a/drivers/gpu/drm/xe/xe_device.h > +++ b/drivers/gpu/drm/xe/xe_device.h > @@ -43,6 +43,21 @@ static inline struct xe_device *ttm_to_xe_device(struct ttm_device *ttm) > return container_of(ttm, struct xe_device, ttm); > } > > +static inline bool xe_device_is_in_recovery(struct xe_device *xe) > +{ > + return atomic_read(&xe->in_recovery); > +} > + > +static inline void xe_device_set_in_recovery(struct xe_device *xe) > +{ > + atomic_set(&xe->in_recovery, 1); > +} > + > +static inline void xe_device_clear_in_recovery(struct xe_device *xe) > +{ > + atomic_set(&xe->in_recovery, 0); > +} > + > struct xe_device *xe_device_create(struct pci_dev *pdev, > const struct pci_device_id *ent); > int xe_device_probe_early(struct xe_device *xe); > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h > index 5599534384fa..616d74792902 100644 > --- a/drivers/gpu/drm/xe/xe_device_types.h > +++ b/drivers/gpu/drm/xe/xe_device_types.h > @@ -504,6 +504,9 @@ struct xe_device { > bool inconsistent_reset; > } wedged; > > + /** @in_recovery: Indicates if device is in recovery */ > + atomic_t in_recovery; > + > /** @bo_device: Struct to control async free of BOs */ > struct xe_bo_dev { > /** @bo_device.async_free: Free worker */ > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c > index ad1e5ef2ee89..825489287f28 100644 > --- a/drivers/gpu/drm/xe/xe_pci.c > +++ b/drivers/gpu/drm/xe/xe_pci.c > @@ -1313,6 +1313,8 @@ static const struct dev_pm_ops xe_pm_ops = { > }; > #endif > > +extern const struct pci_error_handlers xe_pci_error_handlers; > + > static struct pci_driver xe_pci_driver = { > .name = DRIVER_NAME, > .id_table = pciidlist, > @@ -1320,6 +1322,7 @@ static struct pci_driver xe_pci_driver = { > .remove = xe_pci_remove, > .shutdown = xe_pci_shutdown, > .sriov_configure = xe_pci_sriov_configure, > + .err_handler = &xe_pci_error_handlers, > #ifdef CONFIG_PM_SLEEP > .driver.pm = &xe_pm_ops, > #endif > diff --git a/drivers/gpu/drm/xe/xe_pci_error.c b/drivers/gpu/drm/xe/xe_pci_error.c > new file mode 100644 > index 000000000000..d4896a4a5014 > --- /dev/null > +++ b/drivers/gpu/drm/xe/xe_pci_error.c > @@ -0,0 +1,99 @@ > +// SPDX-License-Identifier: MIT > +/* > + * Copyright © 2026 Intel Corporation > + */ > +#include > + > +#include > + > +#include "xe_device.h" > +#include "xe_gt.h" > +#include "xe_pci.h" > +#include "xe_uc.h" > + > +static void xe_pci_error_handling(struct pci_dev *pdev) > +{ > + struct xe_device *xe = pdev_to_xe_device(pdev); > + struct xe_gt *gt; > + u8 id; > + > + /* Wedge the device to prevent userspace access but don't send the event yet */ > + atomic_set(&xe->wedged.flag, 1); > + > + for_each_gt(gt, xe, id) > + xe_gt_declare_wedged(gt); > + > + pci_disable_device(pdev); > +} > + > +static pci_ers_result_t xe_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state) > +{ > + struct xe_device *xe = pdev_to_xe_device(pdev); > + > + dev_err(&pdev->dev, "Xe Pci error recovery: error detected state %d\n", state); > + > + xe_device_set_in_recovery(xe); This looks similar to wedged.flag. If we rather stop exec queues and cancel/flush all pending work properly, perhaps we won't be needing this. Let me explore what can be done here. > + switch (state) { > + case pci_channel_io_normal: > + return PCI_ERS_RESULT_CAN_RECOVER; > + case pci_channel_io_frozen: > + xe_pci_error_handling(pdev); > + return PCI_ERS_RESULT_NEED_RESET; > + case pci_channel_io_perm_failure: > + return PCI_ERS_RESULT_DISCONNECT; > + default: > + dev_err(&pdev->dev, "Unknown state %d\n", state); > + return PCI_ERS_RESULT_NEED_RESET; > + } > +} > + > +static pci_ers_result_t xe_pci_error_mmio_enabled(struct pci_dev *pdev) > +{ > + dev_err(&pdev->dev, "Xe Pci error recovery: MMIO enabled\n"); > + > + return PCI_ERS_RESULT_NEED_RESET; > +} > + > +static pci_ers_result_t xe_pci_error_slot_reset(struct pci_dev *pdev) > +{ > + const struct pci_device_id *ent = pci_match_id(pdev->driver->id_table, pdev); > + struct xe_device *xe = pdev_to_xe_device(pdev); > + > + dev_err(&pdev->dev, "Xe Pci error recovery: Slot reset\n"); > + > + pci_restore_state(pdev); > + > + if (pci_enable_device(pdev)) { > + dev_err(&pdev->dev, > + "Cannot re-enable PCI device after reset\n"); > + return PCI_ERS_RESULT_DISCONNECT; > + } > + > + /* > + * Secondary Bus Reset wipes out all device memory > + * requiring XE KMD to perform a device removal and reprobe. > + */ > + pdev->driver->remove(pdev); A bit fishy, but does the job for now ;) Raag > + if (!pdev->driver->probe(pdev, ent)) > + return PCI_ERS_RESULT_RECOVERED; > + > + return PCI_ERS_RESULT_DISCONNECT; > +} > + > +static void xe_pci_error_resume(struct pci_dev *pdev) > +{ > + struct xe_device *xe = pdev_to_xe_device(pdev); > + > + dev_info(&pdev->dev, "Xe Pci error recovery: Recovered\n"); > + > + xe_device_clear_in_recovery(xe); > +} > + > +const struct pci_error_handlers xe_pci_error_handlers = { > + .error_detected = xe_pci_error_detected, > + .mmio_enabled = xe_pci_error_mmio_enabled, > + .slot_reset = xe_pci_error_slot_reset, > + .resume = xe_pci_error_resume, > +}; > -- > 2.47.1 >