From: sashiko-bot@kernel.org
To: "Mallesh Koujalagi" <mallesh.koujalagi@intel.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v8 5/6] drm/xe: Suppress Surprise Link Down on device
Date: Fri, 12 Jun 2026 08:21:07 +0000 [thread overview]
Message-ID: <20260612082107.8B5531F00A3A@smtp.kernel.org> (raw)
In-Reply-To: <20260612080722.26726-13-mallesh.koujalagi@intel.com>
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [High] Masking Surprise Link Down on the Upstream Switch Port violates PCIe AER specifications.
- [High] Directly mutating AER registers without verifying OS native AER ownership.
- [High] Permanent masking of PCI_ERR_UNC_SURPDN with no restoration mechanism.
- [High] Blind PCIe hierarchy traversal modifies unrelated upstream bridges.
- [Medium] Data race due to unprotected read-modify-write on a shared AER mask register.
--
commit 7a2dc1d13a9370ceb4dd62a9d15aa26d2212058a
Author: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
drm/xe: Suppress Surprise Link Down on device
This commit attempts to prevent fatal AER events during a recovery-induced
cold reset. It masks the Surprise Link Down (SLD) bit in the upstream
port's AER Uncorrectable Error Mask register before triggering the reset.
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 93a56a0269f1e..15c2fa0d323a6 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -209,8 +209,57 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_ras_erro
> return XE_RAS_RECOVERY_ACTION_RECOVERED;
> }
>
> +static struct pci_dev *find_usp_dev(struct pci_dev *pdev)
> +{
> + struct pci_dev *vsp;
> +
> + /*
> + * Device Hierarchy:
> + *
> + * Upstream Switch Port (USP) --> Virtual Switch Port (VSP) --> SGunit (GPU endpoint)
> + */
> + vsp = pci_upstream_bridge(pdev);
> + if (!vsp)
> + return NULL;
> +
> + return pci_upstream_bridge(vsp);
[Severity: High]
Does this blind traversal safely handle diverse PCIe topologies? If the GPU
is connected to a standard motherboard switch, passed through a hypervisor, or
connected directly to a Root Port, might this return an unrelated bridge
device owned by the host system? Should this verify the Vendor ID or Device
ID of the USP before proceeding?
> +}
> +
> +#ifdef CONFIG_PCIEAER
> +static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
> +{
> + u32 aer_uncorr_mask;
> + u16 aer_cap;
> +
> + aer_cap = usp->aer_cap;
> + if (!aer_cap) {
> + dev_dbg(&usp->dev,
> + "AER capability not present\n");
> + return;
> + }
> +
> + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
[Severity: Medium]
Since the Upstream Switch Port is a shared bridge managed by the pcieport
driver, could this lockless read-modify-write sequence race with the AER
driver handling other hotplug events, or with other endpoints on the same
switch experiencing concurrent errors?
> + aer_uncorr_mask |= PCI_ERR_UNC_SURPDN;
[Severity: High]
Is there a mechanism to unmask this bit after the cold reset completes? If
the Surprise Link Down bit is set here but never cleared during driver
probe, resume, or a post-reset callback, will it permanently disable AER
reporting for genuine link failures on this port?
> + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
[Severity: High]
According to the PCIe Base Specification, Surprise Down errors are exclusively
detected and reported by Downstream Ports. On Upstream Ports, the Surprise
Down Error Status and Mask bits are reserved and hardwired to 0. Does masking
PCI_ERR_UNC_SURPDN on the Upstream Switch Port actually prevent the downstream
port (the VSP or Root Port) that detects the link drop from generating the
fatal AER event?
[Severity: High]
Is it safe to mutate the AER mask registers directly without verifying OS
native AER ownership? On platforms where firmware retains control of AER
(such as ACPI APEI Firmware First setups), could mutating these registers
directly from the OS conflict with platform management firmware? Is a check
like pcie_aer_is_native(usp) needed here?
> + dev_dbg(&usp->dev, "Surprise Link Down masked for cold reset\n");
> +}
> +#endif /* CONFIG_PCIEAER */
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260612080722.26726-8-mallesh.koujalagi@intel.com?part=5
next prev parent reply other threads:[~2026-06-12 8:21 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 8:07 [PATCH v8 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-12 8:07 ` [PATCH v8 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-12 8:24 ` sashiko-bot
2026-06-12 8:07 ` [PATCH v8 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-12 8:07 ` [PATCH v8 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-12 8:07 ` [PATCH v8 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-12 8:27 ` sashiko-bot
2026-06-12 8:07 ` [PATCH v8 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-12 8:21 ` sashiko-bot [this message]
2026-06-15 8:06 ` Tauro, Riana
2026-06-18 13:24 ` Raag Jadav
2026-06-12 8:07 ` [PATCH v8 6/6] drm/xe/ras: Add debugfs entry to inject punit error Mallesh Koujalagi
2026-06-12 8:23 ` sashiko-bot
2026-06-12 8:16 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev8) Patchwork
2026-06-12 8:18 ` ✓ CI.KUnit: success " Patchwork
2026-06-12 9:03 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-13 1:18 ` ✓ Xe.CI.FULL: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612082107.8B5531F00A3A@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=mallesh.koujalagi@intel.com \
--cc=sashiko-reviews@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.