From: Raag Jadav <raag.jadav@intel.com>
To: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
rodrigo.vivi@intel.com, andrealmeid@igalia.com,
christian.koenig@amd.com, airlied@gmail.com,
simona.vetter@ffwll.ch, mripard@kernel.org,
maarten.lankhorst@linux.intel.com, tzimmermann@suse.de,
anshuman.gupta@intel.com, badal.nilawar@intel.com,
riana.tauro@intel.com, karthik.poosa@intel.com,
sk.anirban@intel.com
Subject: Re: [PATCH v8 5/6] drm/xe: Suppress Surprise Link Down on device
Date: Thu, 18 Jun 2026 15:24:53 +0200 [thread overview]
Message-ID: <ajPxpevzpsUemOaL@black.igk.intel.com> (raw)
In-Reply-To: <20260612080722.26726-13-mallesh.koujalagi@intel.com>
On Fri, Jun 12, 2026 at 01:37:28PM +0530, Mallesh Koujalagi wrote:
> PUNIT errors can only be recovered using a power-cycle. Xe KMD
> sends a uevent to notify userspace to trigger a power cycle.
> On platforms where link drop caused by powering the device off and
> back on is reported by hardware as a Surprise Link Down (SLD), which
> AER then escalates as an Uncorrectable Fatal Error. That error fires
> before the device finishes coming back up and defeats the
> very recovery we are attempting.
>
> To keep the expected, recovery-induced link drop from being raised as
> a fatal AER event, mask the Surprise Link Down bit
> (PCI_ERR_UNC_SURPDN) in the upstream port's AER Uncorrectable Error
> Mask register before punit_error_handler() requests the cold reset.
>
> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> ---
> v6:
> - Expand commit message to explain why SUR_DN is masked. (Raag/Riana)
> - Check Slot Implemented bit before reading Slot Capabilities, per
> PCIe spec. (Riana)
> - Add debug log.
>
> v7:
> - Handle surprise link down event properly. (Aravind/Riana)
> - Update commit message. (Riana)
> - Correct log message.
>
> v8:
> - Use find_usp_dev() in punit_error_handler() function.
> ---
> drivers/gpu/drm/xe/xe_ras.c | 65 ++++++++++++++++++++++++++++---------
> 1 file changed, 49 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 93a56a0269f1..15c2fa0d323a 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -209,8 +209,57 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_ras_erro
> return XE_RAS_RECOVERY_ACTION_RECOVERED;
> }
>
> +static struct pci_dev *find_usp_dev(struct pci_dev *pdev)
> +{
> + struct pci_dev *vsp;
> +
> + /*
> + * Device Hierarchy:
> + *
> + * Upstream Switch Port (USP) --> Virtual Switch Port (VSP) --> SGunit (GPU endpoint)
> + */
> + vsp = pci_upstream_bridge(pdev);
> + if (!vsp)
> + return NULL;
> +
> + return pci_upstream_bridge(vsp);
> +}
Unneeded churn, please make sure the function is already at the top in
the original series.
> +#ifdef CONFIG_PCIEAER
> +static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
> +{
> + u32 aer_uncorr_mask;
> + u16 aer_cap;
> +
> + aer_cap = usp->aer_cap;
> + if (!aer_cap) {
> + dev_dbg(&usp->dev,
> + "AER capability not present\n");
> + return;
> + }
> +
> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
> + aer_uncorr_mask |= PCI_ERR_UNC_SURPDN;
> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
> + dev_dbg(&usp->dev, "Surprise Link Down masked for cold reset\n");
What about when we come back on after successful recovery? Do we need to
unmask it?
Raag
> +}
> +#endif /* CONFIG_PCIEAER */
> +
> static void punit_error_handler(struct xe_device *xe)
> {
> +#ifdef CONFIG_PCIEAER
> + struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> + struct pci_dev *usp;
> +
> + /*
> + * Cold reset power-cycles the slot, dropping the PCIe link. The
> + * slot triggers a spurious Surprise Link Down AER event on the USP.
> + */
> + usp = find_usp_dev(pdev);
> +
> + if (usp)
> + pcie_suppress_surprise_link_down(usp);
> +#endif
> xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);
> xe_device_declare_wedged(xe);
> }
> @@ -503,22 +552,6 @@ enum xe_ras_recovery_action xe_ras_process_errors(struct xe_device *xe)
> return XE_RAS_RECOVERY_ACTION_RESET;
> }
>
> -static struct pci_dev *find_usp_dev(struct pci_dev *pdev)
> -{
> - struct pci_dev *vsp;
> -
> - /*
> - * Device Hierarchy:
> - *
> - * Upstream Switch Port (USP) --> Virtual Switch Port (VSP) --> SGunit (GPU endpoint)
> - */
> - vsp = pci_upstream_bridge(pdev);
> - if (!vsp)
> - return NULL;
> -
> - return pci_upstream_bridge(vsp);
> -}
> -
> static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe)
> {
> struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> --
> 2.34.1
>
next prev parent reply other threads:[~2026-06-18 13:25 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 8:07 [PATCH v8 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-12 8:07 ` [PATCH v8 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-12 8:24 ` sashiko-bot
2026-06-12 8:07 ` [PATCH v8 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-12 8:07 ` [PATCH v8 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-12 8:07 ` [PATCH v8 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-12 8:27 ` sashiko-bot
2026-06-12 8:07 ` [PATCH v8 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-12 8:21 ` sashiko-bot
2026-06-15 8:06 ` Tauro, Riana
2026-06-18 13:24 ` Raag Jadav [this message]
2026-06-12 8:07 ` [PATCH v8 6/6] drm/xe/ras: Add debugfs entry to inject punit error Mallesh Koujalagi
2026-06-12 8:23 ` sashiko-bot
2026-06-12 8:16 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev8) Patchwork
2026-06-12 8:18 ` ✓ CI.KUnit: success " Patchwork
2026-06-12 9:03 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-13 1:18 ` ✓ Xe.CI.FULL: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajPxpevzpsUemOaL@black.igk.intel.com \
--to=raag.jadav@intel.com \
--cc=airlied@gmail.com \
--cc=andrealmeid@igalia.com \
--cc=anshuman.gupta@intel.com \
--cc=badal.nilawar@intel.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=karthik.poosa@intel.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=mallesh.koujalagi@intel.com \
--cc=mripard@kernel.org \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona.vetter@ffwll.ch \
--cc=sk.anirban@intel.com \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.