Re: [PATCH v8 04/15] drm/xe/xe_pci_error: Implement PCI error recovery callbacks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Raag Jadav <raag.jadav@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: intel-xe@lists.freedesktop.org, anshuman.gupta@intel.com,
	rodrigo.vivi@intel.com, aravind.iddamsetty@linux.intel.com,
	badal.nilawar@intel.com, ravi.kishore.koppuravuri@intel.com,
	mallesh.koujalagi@intel.com, soham.purkait@intel.com,
	Michal Wajdeczko <michal.wajdeczko@intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Matt Roper <matthew.d.roper@intel.com>
Subject: Re: [PATCH v8 04/15] drm/xe/xe_pci_error: Implement PCI error recovery callbacks
Date: Fri, 19 Jun 2026 12:47:11 +0200	[thread overview]
Message-ID: <ajUeL4DutqTpg3HS@black.igk.intel.com> (raw)
In-Reply-To: <20260608084700.640376-21-riana.tauro@intel.com>

On Mon, Jun 08, 2026 at 02:17:05PM +0530, Riana Tauro wrote:
> Add error_detected, mmio_enabled, slot_reset and resume recovery callbacks
> to handle PCIe Advanced Error Reporting (AER) errors.
> 
> For fatal errors, the device is wedged and becomes inaccessible. Return
> PCI_ERS_RESULT_NEED_RESET from error_detected to request a Secondary
> Bus Reset (SBR).
> 
> For non-fatal errors, return PCI_ERS_RESULT_CAN_RECOVER from
> error_detected to trigger the mmio_enabled callback. In this callback, the
> device is queried to determine the error cause and attempt recovery based
> on the error type.
> 
> Once the secondary bus reset(SBR) is completed the slot_reset callback
> cleanly removes and reprobe the device to restore functionality.

...

> +	/*
> +	 * Secondary Bus Reset causes all VRAM state to be lost along with
> +	 * hardware state. As an initial step, re-probe the device to
> +	 * re-initialize the driver and hardware.
> +	 * TODO: optimize by re-initializing only the hardware state and re-creating
> +	 * kernel BOs.
> +	 */
> +	pdev->driver->remove(pdev);

Curious, how does this effect drm_ras nodes? Do they persist?
If no, is it reasonable to have them recreated on every single error?

Raag

> +	if (pdev->driver->probe(pdev, ent))
> +		return PCI_ERS_RESULT_DISCONNECT;

next prev parent reply	other threads:[~2026-06-19 10:47 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-08  8:47 [PATCH v8 00/15] Introduce Xe Uncorrectable Error Handling Riana Tauro
2026-06-08  8:47 ` [PATCH v8 01/15] drm/xe/xe_survivability: Decouple survivability info from boot survivability Riana Tauro
2026-06-08  8:47 ` [PATCH v8 02/15] drm/xe/xe_sysctrl: Make sysctrl flood limit reusable Riana Tauro
2026-06-08  8:47 ` [PATCH v8 03/15] drm/xe: Improve wedged state management Riana Tauro
2026-06-08  8:47 ` [PATCH v8 04/15] drm/xe/xe_pci_error: Implement PCI error recovery callbacks Riana Tauro
2026-06-19 10:47   ` Raag Jadav [this message]
2026-06-19 11:22     ` Tauro, Riana
2026-06-08  8:47 ` [PATCH v8 05/15] drm/xe/xe_pci_error: Group all devres to release them on PCIe slot reset Riana Tauro
2026-06-08  8:47 ` [PATCH v8 06/15] drm/xe: Skip device access during PCI error recovery Riana Tauro
2026-06-08  8:47 ` [PATCH v8 07/15] drm/xe/xe_ras: Initialize Uncorrectable AER Registers Riana Tauro
2026-06-08  8:47 ` [PATCH v8 08/15] drm/xe/xe_ras: Add support for uncorrectable core-compute errors Riana Tauro
2026-06-12  1:43   ` Mallesh, Koujalagi
2026-06-08  8:47 ` [PATCH v8 09/15] drm/xe/xe_ras: Handle uncorrectable SoC Internal errors Riana Tauro
2026-06-08  8:47 ` [PATCH v8 10/15] drm/xe/xe_ras: Query errors from system controller on probe Riana Tauro
2026-06-08  8:47 ` [PATCH v8 11/15] drm/xe/xe_pci_error: Process errors in mmio_enabled Riana Tauro
2026-06-08 10:18   ` Mallesh, Koujalagi
2026-06-08  8:47 ` [PATCH v8 12/15] drm/xe/xe_ras: Add support to query device memory errors Riana Tauro
2026-06-08  8:47 ` [PATCH v8 13/15] drm/xe/xe_ras: Add support to query page offline queue and list Riana Tauro
2026-06-08  8:47 ` [RFC PATCH v8 14/15] drm/xe/xe_ras: Add support to offline and decline a page address Riana Tauro
2026-06-08  8:47 ` [RFC PATCH v8 15/15] drm/xe/xe_ras: Process pages from offlined list and queue Riana Tauro
2026-06-08 12:50 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev8) Patchwork
2026-06-08 12:52 ` ✓ CI.KUnit: success " Patchwork
2026-06-09  5:28 ` ✗ CI.checkpatch: warning for Introduce Xe Uncorrectable Error Handling (rev9) Patchwork
2026-06-09  5:29 ` ✓ CI.KUnit: success " Patchwork
2026-06-09  6:07 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-09 14:53 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajUeL4DutqTpg3HS@black.igk.intel.com \
    --to=raag.jadav@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=ravi.kishore.koppuravuri@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=soham.purkait@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.