Re: [PATCH v9 5/6] drm/xe: Suppress Surprise Link Down on device

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Mallesh Koujalagi" <mallesh.koujalagi@intel.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v9 5/6] drm/xe: Suppress Surprise Link Down on device
Date: Tue, 23 Jun 2026 22:47:14 +0000	[thread overview]
Message-ID: <20260623224715.43DB11F00A3A@smtp.kernel.org> (raw)
In-Reply-To: <20260622131458.144873-13-mallesh.koujalagi@intel.com>

Thank you for your contribution! Sashiko AI review found 6 potential issue(s) to consider:

New issues:
- [High] VF driver modifies physical Upstream Switch Port (USP) AER configuration, violating SR-IOV isolation.
- [High] Unconditional unmasking of Surprise Link Down overrides platform/firmware AER configurations.
- [Medium] Logic asymmetry: Surprise Link Down is masked unconditionally but unmasked conditionally.
- [Medium] Unbalanced lifecycle operation: AER mask is modified before an asynchronous userspace recovery request.
- [Medium] Race condition / TOCTOU when modifying shared USP AER registers.

Pre-existing issues:
- [High] Hardware topology assumption in find_usp_dev() may modify system bridges instead of GPU internal switches.
--

commit ddd656b7786f19f5a390609ee31b584f88c5c3be
Author: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
Subject: drm/xe: Suppress Surprise Link Down on device

PUNIT errors can only be recovered using a power-cycle. Xe KMD sends a
uevent to notify userspace to trigger a power cycle. On platforms where link
drop caused by powering the device off and back on is reported by hardware
as a Surprise Link Down (SLD), which AER then escalates as an Uncorrectable
Fatal Error. That error fires before the device finishes coming back up and
defeats the very recovery we are attempting.

> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 1eb2bbaccd9bd..5a2fee5a13083 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -256,8 +256,37 @@ static struct pci_dev *find_usp_dev(struct pci_dev *pdev)
>  	return pci_upstream_bridge(vsp);
>  }

[Severity: High]
This is a pre-existing issue, but does find_usp_dev() make an unchecked
assumption about the hardware topology? If the GPU is not connected via an
internal PCIe switch, navigating two levels up could return a generic
motherboard root port or external switch port. Modifying the AER registers
of this shared system bridge might affect error handling for other connected
devices.

>  
> +static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
> +{
> +	u16 aer_cap;
> +
> +	/*
> +	 * Cold reset power-cycles the slot, dropping the PCIe link.
> +	 * This triggers a spurious Surprise Link Down AER event on the
> +	 * Upstream Switch Port (USP). Mask this error to avoid false
> +	 * error reporting during recovery.
> +	 */
> +	aer_cap = pci_find_ext_capability(usp, PCI_EXT_CAP_ID_ERR);
> +	if (!aer_cap) {
> +		dev_dbg(&usp->dev,
> +			"AER capability not present\n");
> +		return;
> +	}
> +
> +	pci_clear_and_set_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, 0, PCI_ERR_UNC_SURPDN);

[Severity: Medium]
Could this lead to a race condition or TOCTOU when modifying shared USP AER
registers? pci_clear_and_set_config_dword() performs a read, modifies the
value, and writes it back without holding a synchronization lock. If multiple
endpoints on the switch experience errors concurrently, their asynchronous
error handlers might race here.

> +	dev_dbg(&usp->dev, "Surprise Link Down masked for cold reset\n");
> +}
> +
>  static void punit_error_handler(struct xe_device *xe)
>  {
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	struct pci_dev *usp;
> +
> +	usp = find_usp_dev(pdev);
> +
> +	if (usp)
> +		pcie_suppress_surprise_link_down(usp);

[Severity: High]
Does modifying a shared physical resource (the USP) from a Virtual Function
(VF) break SR-IOV isolation? The VF driver attempts to mask the Surprise Link
Down error on the physical USP without an IS_SRIOV_VF() check. Following
subsystem guidelines, VF drivers must not access PF-only resources and must
guard such paths.

[Severity: Medium]
Also, could this cause a logic asymmetry where the Surprise Link Down is
masked unconditionally, but unmasked conditionally? If the kernel is
compiled without CONFIG_PCIEAER, the driver will mask SLD during an error
but never unmask it upon recovery, potentially disabling SLD reporting
permanently.

> +
>  	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);
>  	xe_device_declare_wedged(xe);

[Severity: Medium]
If userspace ignores the uevent, crashes, or the driver unbinds before the
reset completes, will the AER configuration remain permanently modified
without cleanup? This modifies the AER mask before an asynchronous userspace
recovery request, leaving no teardown mechanism to revert the mask on
failure.

>  }

[ ... ]

> @@ -604,20 +644,22 @@ static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe)
>  	 * The Uncorrectable Error Severity Register has the 'Uncorrectable
>  	 * Internal Error Severity' set to fatal by default. Set this to
>  	 * non-fatal and unmask the error.
> +	 *
> +	 * Also restore the Surprise Link Down mask that was set in
> +	 * pcie_suppress_surprise_link_down() before the cold reset. The USP is
> +	 * never power-cycled, so the mask bit persists and must be cleared here
> +	 * to ensure link-down events are reported normally going forward.
>  	 */
>  
>  	/* Initialize Uncorrectable Error Severity Register */
> -	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, &aer_uncorr_sev);
> -	aer_uncorr_sev &= ~PCI_ERR_UNC_INTN;
> -	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, aer_uncorr_sev);
> +	pci_clear_and_set_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, PCI_ERR_UNC_INTN, 0);
>  
> -	/* Initialize Uncorrectable Error Mask Register */
> -	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
> -	aer_uncorr_mask &= ~PCI_ERR_UNC_INTN;
> -	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
> +	/* Unmask Uncorrectable Internal Error and restore Surprise Link Down to unmasked */
> +	pci_clear_and_set_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK,
> +				       PCI_ERR_UNC_INTN | PCI_ERR_UNC_SURPDN, 0);

[Severity: High]
Does this unconditional unmasking of Surprise Link Down override the platform
or firmware AER configurations? Forcibly unmasking PCI_ERR_UNC_SURPDN without
restoring its original state can trigger spurious fatal AER events if the
platform firmware, BIOS, or hotplug driver intentionally masked the bit by
default to handle hardware errata.

>  
>  	pci_save_state(usp);
> -	dev_dbg(&usp->dev, "Uncorrectable Internal Errors downgraded and unmasked\n");
> +	dev_dbg(&usp->dev, "AER: INTN downgraded to non-fatal, INTN and SLD unmasked\n");
>  }
>  
>  /**

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260622131458.144873-8-mallesh.koujalagi@intel.com?part=5

next prev parent reply	other threads:[~2026-06-23 22:47 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-22 13:14 [PATCH v9 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-22 13:15 ` [PATCH v9 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-23 20:54   ` sashiko-bot
2026-06-22 13:15 ` [PATCH v9 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-23 21:04   ` sashiko-bot
2026-06-22 13:15 ` [PATCH v9 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-23 21:11   ` sashiko-bot
2026-06-22 13:15 ` [PATCH v9 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-22 13:15 ` [PATCH v9 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-23 22:47   ` sashiko-bot [this message]
2026-06-22 13:15 ` [PATCH v9 6/6] drm/xe/ras: Use fault-inject to trigger punit error handler Mallesh Koujalagi
2026-06-23 22:47   ` sashiko-bot
2026-06-22 17:20 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev9) Patchwork
2026-06-22 17:22 ` ✓ CI.KUnit: success " Patchwork
2026-06-22 18:09 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-22 22:03 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260623224715.43DB11F00A3A@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.