Re: [PATCH v7 1/6] Introduce Xe Uncorrectable Error Handling

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Mallesh Koujalagi" <mallesh.koujalagi@intel.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v7 1/6] Introduce Xe Uncorrectable Error Handling
Date: Fri, 05 Jun 2026 12:43:11 +0000	[thread overview]
Message-ID: <20260605124311.A8F0D1F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260605122544.1021695-9-mallesh.koujalagi@intel.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Bypassing the Linux driver core by manually invoking probe and remove callbacks causes Use-After-Free on failure.
- [High] Unbalanced decrement of the `xe->wedged.ref` counter incorrectly permanently wedges the device after a non-fatal error recovery.
- [High] Unprotected Read-Modify-Write of Upstream Switch Port (USP) PCIe AER registers causes race conditions and corrupts switch state.
- [Medium] Unbounded firmware query loops in `get_queued_pages()` and `get_offlined_list()` allow a malfunctioning sysctrl firmware to hang the kernel.
--

commit 1dea0dcd39b5b10ff25f34fd2fa55710387abdc7
Author: Riana Tauro <riana.tauro@intel.com>
Subject: Introduce Xe Uncorrectable Error Handling

This patch introduces PCI error handling to the Xe driver to process
and recover from hardware and firmware errors. It queries the system
controller for RAS error details and handles device teardown and re-probe
sequences during slot resets.

Link: https://patchwork.freedesktop.org/series/160482/

> diff --git a/drivers/gpu/drm/xe/xe_pci_error.c b/drivers/gpu/drm/xe/xe_pci_error.c
> new file mode 100644
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pci_error.c
[ ... ]
> +static pci_ers_result_t xe_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +
> +	xe_err(xe, "PCI error: detected state = %u\n", state);
> +
> +	if (state == pci_channel_io_perm_failure)
> +		return PCI_ERS_RESULT_DISCONNECT;
> +
> +	/* If the device is already wedged or in survivability mode, do not attempt recovery */
> +	if (xe_survivability_mode_is_boot_enabled(xe) || xe_device_wedged(xe))
> +		return PCI_ERS_RESULT_DISCONNECT;
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;
[ ... ]
> +static pci_ers_result_t xe_pci_error_slot_reset(struct pci_dev *pdev)
> +{
[ ... ]
> +	/*
> +	 * Secondary Bus Reset causes all VRAM state to be lost along with
> +	 * hardware state. As an initial step, re-probe the device to
> +	 * re-initialize the driver and hardware.
> +	 * TODO: optimize by re-initializing only the hardware state and re-creating
> +	 * kernel BOs.
> +	 */
> +	xe_device_clear_in_reset(xe);
> +	pdev->driver->remove(pdev);
> +	devres_release_group(&pdev->dev, xe->devres_group_id);
> +
> +	if (pdev->driver->probe(pdev, ent))
> +		return PCI_ERS_RESULT_DISCONNECT;

[Severity: High]
Can this manual driver teardown cause a use-after-free?

If pdev->driver->probe() fails before updating the driver data pointer,
pdev->dev.driver_data will remain pointing to the memory freed by
devres_release_group().

The PCI core might later call xe_pci_remove(), which will dereference
this stale pointer. Would it be safer to avoid manually calling driver
lifecycle callbacks?

> +
> +	xe = pdev_to_xe_device(pdev);
> +
> +	/* Wedge the device to prevent I/O operations till the resume callback */
> +	xe_device_wedged_get(xe);
> +
> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +static void xe_pci_error_resume(struct pci_dev *pdev)
> +{
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +
> +	xe_err(xe, "PCI error: resume\n");
> +
> +	/* Resume I/O operations */
> +	xe_device_wedged_put(xe);

[Severity: High]
Does this unconditional put cause an underflow when recovering from non-fatal
errors?

In xe_pci_error_detected(), the pci_channel_io_normal path returns
PCI_ERS_RESULT_CAN_RECOVER without calling prepare_device_for_reset(), so
the wedged refcount is never incremented.

When xe_pci_error_resume() is later called, this put will decrement the
counter to -1, which incorrectly evaluates as true in xe_device_wedged()
and permanently wedges the device.

> +}
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> new file mode 100644
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_ras.c
[ ... ]
> +static void get_queued_pages(struct xe_device *xe)
> +{
> +	struct xe_sysctrl_mailbox_command command = {0};
> +	struct xe_ras_page_offline_queue response = {0};
> +	u32 count = 0;
> +	size_t rlen;
> +	int ret, i;
[ ... ]
> +	do {
> +		memset(&response, 0, sizeof(response));
> +
> +		ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen);
[ ... ]
> +	} while (response.additional_data);

[Severity: Medium]
Could a malfunctioning sysctrl firmware cause a soft lockup here?

The loop relies entirely on the firmware's additional_data flag and
total_pages limit. Unlike xe_ras_process_errors() which enforces an
XE_SYSCTRL_FLOOD_LIMIT, this loop appears to lack a driver-enforced
iteration ceiling.

> +}
> +
> +static void get_offlined_list(struct xe_device *xe)
> +{
> +	struct xe_sysctrl_mailbox_command command = {0};
> +	struct xe_ras_page_offline_list response = {0};
> +	u32 count = 0;
> +	size_t rlen;
> +	int ret, i;
[ ... ]
> +	do {
> +		memset(&response, 0, sizeof(response));
> +
> +		ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen);
[ ... ]
> +	} while (response.additional_data);

[Severity: Medium]
Is there a similar risk of a soft lockup in this loop if the firmware
continuously returns additional_data?

> +}
[ ... ]
> +static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe)
> +{
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	u32 aer_uncorr_mask, aer_uncorr_sev, aer_uncorr_status;
> +	struct pci_dev *usp;
> +	u16 aer_cap;
> +
> +	usp = find_usp_dev(pdev);
> +	if (!usp)
> +		return;
[ ... ]
> +	/* Initialize Uncorrectable Error Severity Register */
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, &aer_uncorr_sev);
> +	aer_uncorr_sev &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, aer_uncorr_sev);
> +
> +	/* Initialize Uncorrectable Error Mask Register */
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
> +	aer_uncorr_mask &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);
> +
> +	pci_save_state(usp);

[Severity: High]
Can this read-modify-write sequence corrupt the switch state if multiple
endpoints probe concurrently?

The upstream switch port is a shared resource, and modifying its AER
capability registers without synchronization could lead to race conditions
when other child devices (such as SR-IOV VFs) execute this same path.

> +	dev_dbg(&usp->dev, "Uncorrectable Internal Errors downgraded and unmasked\n");
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605122544.1021695-8-mallesh.koujalagi@intel.com?part=1

next prev parent reply	other threads:[~2026-06-05 12:43 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-05 12:25 [PATCH v7 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-05 12:25 ` [PATCH v7 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-05 12:43   ` sashiko-bot [this message]
2026-06-05 12:25 ` [PATCH v7 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-05 12:25 ` [PATCH v7 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-08  6:30   ` Raag Jadav
2026-06-05 12:25 ` [PATCH v7 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-05 12:48   ` sashiko-bot
2026-06-05 12:25 ` [PATCH v7 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-05 12:41   ` sashiko-bot
2026-06-05 12:25 ` [PATCH v7 6/6] drm/xe/ras: Add debugfs entry to inject punit error Mallesh Koujalagi
2026-06-05 13:00 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev7) Patchwork
2026-06-05 13:00 ` ✗ CI.KUnit: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260605124311.A8F0D1F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.