All of lore.kernel.org
 help / color / mirror / Atom feed
From: sashiko-bot@kernel.org
To: "Mallesh Koujalagi" <mallesh.koujalagi@intel.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v7 5/6] drm/xe: Suppress Surprise Link Down on device
Date: Fri, 05 Jun 2026 12:41:20 +0000	[thread overview]
Message-ID: <20260605124121.316BF1F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260605122544.1021695-13-mallesh.koujalagi@intel.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] The driver permanently masks the Surprise Link Down AER event (PCI_ERR_UNC_SURPDN) on the upstream switch port without ever restoring it.
- [High] Lockless read-modify-write of the shared Upstream Switch Port's AER mask register creates a race condition.
- [High] Unchecked pci_read_config_dword return value may corrupt the AER mask register.
- [High] Attempting to mask Surprise Link Down on an Upstream Port violates the PCIe specification and fails to suppress the error.
--

commit b4b812ffca5eaea995b148e8a053e6c4fd45bfe9
Author: Mallesh Koujalagi <mallesh.koujalagi@intel.com>

drm/xe: Suppress Surprise Link Down on device

This patch attempts to suppress spurious Surprise Link Down AER events during
a recovery cold reset. It does so by masking the PCI_ERR_UNC_SURPDN bit in
the upstream switch port's AER Uncorrectable Error Mask register before
triggering the cold reset.

> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 9a90a7118e896..acdedf403649f 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c

[ ... ]

> +#ifdef CONFIG_PCIEAER
> +static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
> +{
> +	u32 aer_uncorr_mask;
> +	u16 aer_cap;
> +
> +	aer_cap = usp->aer_cap;
> +	if (!aer_cap) {
> +		dev_dbg(&usp->dev,
> +			"AER capability not present\n");
> +		return;
> +	}
> +
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);

[Severity: High]
Does this code risk corrupting the AER mask register if the read fails?

If pci_read_config_dword() fails, for example due to temporary bus
unavailability, it can leave aer_uncorr_mask uninitialized or set it to all
1s. Writing that value back could inadvertently mask all uncorrectable errors
or write garbage to the register.

> +	aer_uncorr_mask |= PCI_ERR_UNC_SURPDN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);

[Severity: High]
Is it safe to perform a lockless read-modify-write on a shared parent
bridge's configuration space?

Since the upstream switch port is outside the xe driver's ownership, if the
AER core or another driver concurrently modifies this mask register without
holding pci_dev_lock() on the upstream port, could one of the updates be
silently lost?

[Severity: High]
Will this permanently mask the Surprise Link Down AER event on the upstream
switch port?

The configuration space of the port above the cold reset boundary isn't
reset by the slot power cycle, and there does not appear to be any
corresponding code to unmask this bit later. Could this permanently blind
the port to legitimate Surprise Link Down errors until the next system
reboot?

> +	dev_dbg(&usp->dev, "Surprise Link Down masked for cold reset\n");
> +}
> +#endif /* CONFIG_PCIEAER */
> +
> +static void punit_error_handler(struct xe_device *xe)
> +{
> +#ifdef CONFIG_PCIEAER
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	struct pci_dev *vsp, *usp;
> +
> +	/*
> +	 * Device Hierarchy:
> +	 *
> +	 * Root Port --> Upstream Switch Port (USP) --> Virtual Switch Port (VSP) --> SGunit
> +	 *
> +	 * Cold reset power-cycles the slot, dropping the PCIe link. The
> +	 * slot triggers a spurious Surprise Link Down AER event on the USP.
> +	 */
> +	vsp = pci_upstream_bridge(pdev);
> +	usp = vsp ? pci_upstream_bridge(vsp) : NULL;
> +
> +	if (usp)
> +		pcie_suppress_surprise_link_down(usp);

[Severity: High]
Will masking the Surprise Link Down bit on the Upstream Switch Port actually
suppress the intended error?

According to the PCIe Base Specification, the Surprise Down Error Status bit
is mandatory for Downstream Ports and reserved for all other port types.

Since the Downstream Port (vsp) facing the slot is what actually detects the
link drop and triggers the AER event, does writing to a reserved bit in the
upstream port's config space fail to suppress the spurious AER event?

> +#endif
>  	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);
>  	xe_device_declare_wedged(xe);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605122544.1021695-8-mallesh.koujalagi@intel.com?part=5

  reply	other threads:[~2026-06-05 12:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-05 12:25 [PATCH v7 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-05 12:25 ` [PATCH v7 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-05 12:43   ` sashiko-bot
2026-06-05 12:25 ` [PATCH v7 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-05 12:25 ` [PATCH v7 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-08  6:30   ` Raag Jadav
2026-06-05 12:25 ` [PATCH v7 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-05 12:48   ` sashiko-bot
2026-06-05 12:25 ` [PATCH v7 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-05 12:41   ` sashiko-bot [this message]
2026-06-05 12:25 ` [PATCH v7 6/6] drm/xe/ras: Add debugfs entry to inject punit error Mallesh Koujalagi
2026-06-05 13:00 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev7) Patchwork
2026-06-05 13:00 ` ✗ CI.KUnit: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260605124121.316BF1F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.