All of lore.kernel.org
 help / color / mirror / Atom feed
From: Raag Jadav <raag.jadav@intel.com>
To: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	rodrigo.vivi@intel.com, andrealmeid@igalia.com,
	christian.koenig@amd.com, airlied@gmail.com,
	simona.vetter@ffwll.ch, mripard@kernel.org,
	maarten.lankhorst@linux.intel.com, tzimmermann@suse.de,
	anshuman.gupta@intel.com, badal.nilawar@intel.com,
	riana.tauro@intel.com, karthik.poosa@intel.com,
	sk.anirban@intel.com
Subject: Re: [PATCH v5 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device
Date: Thu, 14 May 2026 10:35:57 +0200	[thread overview]
Message-ID: <agWJbWQXA9XJnool@black.igk.intel.com> (raw)
In-Reply-To: <20260512132614.1793083-12-mallesh.koujalagi@intel.com>

On Tue, May 12, 2026 at 06:56:20PM +0530, Mallesh Koujalagi wrote:
> If the slot is not hotplug capable, pcie_suppress_surprise_link_down()
> masks the Surprise Link Down bit (PCI_ERR_UNC_SURPDN) in the USP's AER
> Uncorrectable Error Mask register before punit_error_handler()
> triggers the cold reset.

Can you please elaborate on the "why" part? Is this something Intel
specific?

> Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_ras.c | 51 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 604470565bf3..67b4f25370c9 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -224,8 +224,59 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_device *
>  	return XE_RAS_RECOVERY_ACTION_RECOVERED;
>  }
>  
> +#ifdef CONFIG_PCIEAER
> +static bool pcie_slot_is_hotplug_capable(struct pci_dev *usp)

Shouldn't all of it be part of xe_pci_error.c?

> +{
> +	struct pci_dev *root_port = pci_upstream_bridge(usp);
> +	u32 sltcap;
> +
> +	if (!root_port)
> +		return false;
> +
> +	if (pcie_capability_read_dword(root_port, PCI_EXP_SLTCAP, &sltcap))
> +		return false;
> +
> +	return (sltcap & (PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP)) ==
> +		(PCI_EXP_SLTCAP_HPC | PCI_EXP_SLTCAP_PCP);
> +}
> +
> +static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
> +{
> +	u32 aer_uncorr_mask;
> +	u16 aer_cap;
> +
> +	aer_cap = usp->aer_cap;
> +	if (!aer_cap)
> +		return;
> +
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
> +	aer_uncorr_mask |= PCI_ERR_UNC_SURPDN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
> +	dev_dbg(&usp->dev, "Non-hotplug slot: Surprise Link Down masked for cold reset\n");

So is it required for all devices that want to use cold-reset method
generically? If yes, shouldn't this be part of recovery script or atleast
documented somewhere?

Raag

> +}
> +#endif /* CONFIG_PCIEAER */
> +
>  static void punit_error_handler(struct xe_device *xe)
>  {
> +#ifdef CONFIG_PCIEAER
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	struct pci_dev *vsp, *usp;
> +
> +	/*
> +	 * Device Hierarchy:
> +	 *
> +	 * Root Port --> Upstream Switch Port (USP) --> Virtual Switch Port (VSP) --> SGunit
> +	 *
> +	 * Cold reset power-cycles the slot, dropping the PCIe link. On a non-hotplug
> +	 * slot this triggers a spurious Surprise Link Down AER event on the USP.
> +	 * Suppress it if the slot is not hotplug capable.
> +	 */
> +	vsp = pci_upstream_bridge(pdev);
> +	usp = vsp ? pci_upstream_bridge(vsp) : NULL;
> +
> +	if (usp && !pcie_slot_is_hotplug_capable(usp))
> +		pcie_suppress_surprise_link_down(usp);
> +#endif
>  	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);
>  	xe_device_declare_wedged(xe);
>  }
> -- 
> 2.34.1
> 

  reply	other threads:[~2026-05-14  8:36 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 13:26 [PATCH v5 0/5] Introduce cold reset recovery method Mallesh Koujalagi
2026-05-12 13:26 ` [PATCH v5 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-05-12 13:26 ` [PATCH v5 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-05-14  7:59   ` Raag Jadav
2026-05-14  9:12   ` Tauro, Riana
2026-05-12 13:26 ` [PATCH v5 3/5] drm/doc: Document " Mallesh Koujalagi
2026-05-14  8:50   ` Raag Jadav
2026-05-12 13:26 ` [PATCH v5 4/5] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-05-14  8:13   ` Raag Jadav
2026-05-12 13:26 ` [PATCH v5 5/5] drm/xe: Suppress Surprise Link Down on non-hotplug device Mallesh Koujalagi
2026-05-14  8:35   ` Raag Jadav [this message]
2026-05-14  9:36   ` Tauro, Riana
2026-05-12 20:01 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev4) Patchwork
2026-05-12 20:03 ` ✓ CI.KUnit: success " Patchwork
2026-05-12 21:42 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-13 12:34 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agWJbWQXA9XJnool@black.igk.intel.com \
    --to=raag.jadav@intel.com \
    --cc=airlied@gmail.com \
    --cc=andrealmeid@igalia.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=mallesh.koujalagi@intel.com \
    --cc=mripard@kernel.org \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona.vetter@ffwll.ch \
    --cc=sk.anirban@intel.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.