Re: [PATCH v9 1/6] Introduce Xe Uncorrectable Error Handling

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Mallesh Koujalagi" <mallesh.koujalagi@intel.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v9 1/6] Introduce Xe Uncorrectable Error Handling
Date: Tue, 23 Jun 2026 20:54:31 +0000	[thread overview]
Message-ID: <20260623205431.ED2401F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260622131458.144873-9-mallesh.koujalagi@intel.com>

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [High] Manual invocation of driver remove and probe callbacks in the PCI error slot_reset handler bypasses the driver core.
- [High] Unclosed devres group captures unintended resources and its forceful release causes Use-After-Free.
- [High] Endpoint driver directly modifies AER registers of the Upstream Switch Port via unsynchronized RMW.
- [High] Unbalanced `xe_device_wedged_put()` during non-fatal PCI error recovery causes refcount underflow and permanently wedges the device.
- [High] Clearing `xe->in_reset` prematurely and manually destroying devres causes TDR and async workers to dereference NULL MMIO pointers.
- [High] Unsynchronized checks of `xe_device_is_in_reset()` cause Time-of-Check to Time-of-Use (TOCTOU) races with PCI reset.
- [High] Calling `pci_match_id()` manually in `xe_pci_error_slot_reset()` returns NULL for devices bound via dynamic IDs, causing a NULL pointer dereference.
- [High] `prepare_device_for_reset()` disables the PCI device without cancelling or flushing active GT reset workers, risking MMIO timeouts.
- [Medium] Flawed accumulation logic and missing flood limits in `get_queued_pages` and `get_offlined_list` cause premature aborts or infinite loops.
- [Medium] Double removal of sysfs group `survivability_info_group` via both `devm_device_add_group` and `xe_survivability_mode_fini`.
--

commit 5c76039736fbbd64a08449cce81a6b27917ce366
Author: Riana Tauro <riana.tauro@intel.com>

Introduce Xe Uncorrectable Error Handling

This commit introduces uncorrectable error handling for the Xe driver, adding
PCI error handlers to recover from AER and system controller errors. It 
includes device wedging logic and hooks into the PCI error recovery flow.

> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index c9d4fb6c4ff6a..352e7eb008d90 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -1078,6 +1079,7 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	const struct xe_device_desc *desc = (const void *)ent->driver_data;
>  	const struct xe_subplatform_desc *subplatform_desc;
>  	struct xe_device *xe;
> +	void *devres_id;
>  	int err;
>  
>  	subplatform_desc = find_subplatform(desc, pdev->device);
> @@ -1105,6 +1107,10 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	if (xe_display_driver_probe_defer(pdev))
>  		return -EPROBE_DEFER;
>  
> +	devres_id = devres_open_group(&pdev->dev, NULL, GFP_KERNEL);
> +	if (!devres_id)
> +		return -ENOMEM;
> +

[Severity: High]
Does this code capture all future devm allocations unintentionally?

A devres group is opened here via devres_open_group() but never closed.
Consequently, all subsequent devres allocations by the driver and subsystem
will be silently added to this group. When devres_release_group() is manually
called later in xe_pci_error_slot_reset(), it forcefully destroys all
devm-managed resources, including the drm_device memory. If userspace holds
an open FD to the DRM device, could subsequent IOCTLs or FD closures access
the freed memory and cause a use-after-free?

> diff --git a/drivers/gpu/drm/xe/xe_pci_error.c b/drivers/gpu/drm/xe/xe_pci_error.c
> new file mode 100644
> index 0000000000000..b08601f470d63
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pci_error.c
> @@ -0,0 +1,135 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#include <linux/pci.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt.h"
> +#include "xe_pci.h"
> +#include "xe_printk.h"
> +#include "xe_ras.h"
> +#include "xe_survivability_mode.h"
> +
> +static void prepare_device_for_reset(struct pci_dev *pdev)
> +{
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +	struct xe_gt *gt;
> +	u8 id;
> +
> +	xe_device_set_in_reset(xe);
> +
> +	/* Wedge the device to prevent userspace access during reset */
> +	xe_device_wedged_get(xe);
> +
> +	for_each_gt(gt, xe, id)
> +		xe_gt_declare_wedged(gt);
> +
> +	pci_disable_device(pdev);
> +}

[Severity: High]
Are active GT reset workers properly synchronized before disabling the device?

This calls pci_disable_device(pdev) while gt_reset_worker or other
hardware-accessing work items may be actively executing. Although
xe_device_set_in_reset() is called, this only prevents new workers from
starting.

If a worker is actively polling MMIO registers when pci_disable_device drops
the memory decode enable bit, the MMIO reads will instantly return 0xFFFFFFFF.
Can this cause polling loops to time out, delay the reset process, or trigger
unexpected hardware warnings?

[ ... ]

> +static pci_ers_result_t xe_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
> +{
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +
> +	xe_err(xe, "PCI error: detected state = %u\n", state);
> +
> +	if (state == pci_channel_io_perm_failure)
> +		return PCI_ERS_RESULT_DISCONNECT;
> +
> +	/* If the device is already wedged or in survivability mode, do not attempt recovery */
> +	if (xe_survivability_mode_is_boot_enabled(xe) || xe_device_wedged(xe))
> +		return PCI_ERS_RESULT_DISCONNECT;
> +
> +	switch (state) {
> +	case pci_channel_io_normal:
> +		return PCI_ERS_RESULT_CAN_RECOVER;

[Severity: High]
Will this cause a refcount underflow for non-fatal errors?

When receiving pci_channel_io_normal, this returns PCI_ERS_RESULT_CAN_RECOVER
without calling prepare_device_for_reset(), which means xe_device_wedged_get()
is never called.

If the subsequent mmio_enabled handler successfully recovers the error, the
PCI core directly invokes xe_pci_error_resume(), which unconditionally calls
xe_device_wedged_put(xe).

Could this underflow the xe->wedged.ref atomic counter to -1, permanently
wedging the device and incorrectly blocking all subsequent IOCTLs and GuC
submissions?

[ ... ]

> +static pci_ers_result_t xe_pci_error_slot_reset(struct pci_dev *pdev)
> +{
> +	const struct pci_device_id *ent = pci_match_id(pdev->driver->id_table, pdev);
> +	struct xe_device *xe = pdev_to_xe_device(pdev);

[Severity: High]
Could this return NULL for devices bound via dynamic IDs?

This retrieves the device ID table entry by manually calling pci_match_id(),
which exclusively searches the driver's statically compiled id_table. If the
device was bound using the dynamic ID mechanism via sysfs new_id,
pci_match_id() returns NULL.

> +
> +	xe_err(xe, "PCI error: slot reset\n");
> +
> +	pci_restore_state(pdev);
> +
> +	if (pci_enable_device(pdev)) {
> +		xe_err(xe, "Cannot re-enable PCI device after reset\n");
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +
> +	/*
> +	 * Secondary Bus Reset causes all VRAM state to be lost along with
> +	 * hardware state. As an initial step, re-probe the device to
> +	 * re-initialize the driver and hardware.
> +	 * TODO: optimize by re-initializing only the hardware state and re-creating
> +	 * kernel BOs.
> +	 */
> +	xe_device_clear_in_reset(xe);
> +	pdev->driver->remove(pdev);
> +	devres_release_group(&pdev->dev, xe->devres_group_id);

[Severity: High]
Does clearing xe->in_reset prematurely cause a race condition with async
workers?

xe_device_clear_in_reset(xe) is called immediately before tearing down the
driver and calling devres_release_group(). The in_reset flag is meant to
prevent asynchronous workers (like the TDR timer) from accessing hardware
during reset.

By clearing it too early, TDR workers will pass the !xe_device_is_in_reset(xe)
check and execute concurrently with driver teardown. Since
devres_release_group() synchronously unmaps MMIO, could a racing TDR worker
attempt to read hardware registers using unmapped pointers (for example,
during xe_devcoredump()), leading to a panic?

> +
> +	if (pdev->driver->probe(pdev, ent))
> +		return PCI_ERS_RESULT_DISCONNECT;

[Severity: High]
Is it safe to pass a potentially NULL 'ent' pointer to the probe function?

If pci_match_id() returned NULL earlier, passing it here will cause
xe_pci_probe() to immediately dereference ent->driver_data, resulting in a
deterministic NULL pointer dereference.

[Severity: High]
Does manually invoking the remove and probe callbacks bypass the Linux
driver core?

Directly calling pdev->driver->remove(pdev) and pdev->driver->probe(pdev, ent)
bypasses the unbind/bind state machine of the Linux driver core. This
detaches the device from proper lifecycle management.

If the manual probe fails, devres is not automatically cleaned up.
Furthermore, the driver core still considers the device bound. Upon eventual
module unload or unbind, the driver core will execute remove() a second time
on the already-freed xe structure, potentially resulting in a use-after-free
or double-free regression.

> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 9458bf477fa6e..12416bfa32557 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1532,7 +1532,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  	 * If devcoredump not captured and GuC capture for the job is not ready
>  	 * do manual capture first and decide later if we need to use it
>  	 */
> -	if (!exec_queue_killed(q) && !xe->devcoredump.captured &&
> +	if (!xe_device_is_in_reset(xe) && !exec_queue_killed(q) && !xe->devcoredump.captured &&
>  	    !xe_guc_capture_get_matching_and_lock(q)) {
>  		/* take force wake before engine register manual capture */
>  		CLASS(xe_force_wake, fw_ref)(gt_to_fw(q->gt), XE_FORCEWAKE_ALL);

[Severity: High]
Is there a time-of-check to time-of-use (TOCTOU) race condition here?

Asynchronous workers rely on the atomic check xe_device_is_in_reset(xe) to
avoid interacting with hardware during PCI error recovery, but this flag is
checked without holding any synchronization lock.

Immediately after the worker verifies the device is not in reset, the PCI
error handler could concurrently invoke prepare_device_for_reset(), setting
the flag and disabling the PCI device. The worker would then blindly proceed
to access MMIO or manipulate firmware state on a disabled device. Could this
cause MMIO timeouts or undefined hardware behavior?

> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 44f4e1a3455b7..d6db0e98586fe 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c
> @@ -131,6 +126,305 @@ static inline const char *comp_to_str(u8 component)
>  	return xe_ras_components[component];
>  }
>  
> +static int send_page_offline(struct xe_device *xe, enum xe_ras_page_action action, u64 page_address)

[ ... ]

> +static void get_queued_pages(struct xe_device *xe)
> +{
> +	struct xe_sysctrl_mailbox_command command = {0};
> +	struct xe_ras_page_offline_queue response = {0};
> +	u32 count = 0;
> +	size_t rlen;
> +	int ret, i;
> +
> +	/* Supported only on platforms with system controller */
> +	if (!xe->info.has_sysctrl)
> +		return;
> +
> +	xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP,
> +				  XE_SYSCTRL_CMD_GET_OFFLINE_QUEUE, NULL, 0, &response,
> +				  sizeof(response));
> +
> +	do {
> +		memset(&response, 0, sizeof(response));
> +
> +		ret = xe_sysctrl_send_command(&xe->sc, &command, &rlen);
> +		if (ret) {
> +			xe_err(xe, "sysctrl: failed to get page offline queue %d\n", ret);
> +			return;
> +		}
> +
> +		if (rlen != sizeof(response)) {
> +			xe_err(xe, "sysctrl: unexpected page offline queue response length %zu (expected %zu)\n",
> +			       rlen, sizeof(response));
> +			return;
> +		}
> +
> +		for (i = 0; i < response.pages_returned && i < XE_RAS_NUM_PAGES; i++)
> +			handle_page_offline(xe, response.page_addresses[i], true);
> +
> +		count += response.pages_returned;
> +		if (!response.pages_returned)
> +			break;
> +
> +		if (count > response.total_pages) {
> +			xe_err(xe, "sysctrl: Pages returned from queue exceed total pages %u, returned %u\n",
> +			       response.total_pages, count);
> +			return;
> +		}
> +	} while (response.additional_data);

[Severity: Medium]
Could the accumulation logic here lead to a premature abort or infinite loop?

The loop monotonically accumulates response.pages_returned into count, and
compares it against response.total_pages. If the firmware concurrently updates
total_pages to a smaller value as pages are dequeued, this check could falsely
trigger and prematurely abort the recovery.

Additionally, there is no maximum iteration limit (like
XE_SYSCTRL_FLOOD_LIMIT). If a buggy firmware implementation keeps setting
additional_data = 1 indefinitely, could this force the kernel into an infinite
loop or cause the u32 count to overflow?

[ ... ]

> +static void aer_unmask_and_downgrade_internal_error(struct xe_device *xe)
> +{
> +	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +	u32 aer_uncorr_mask, aer_uncorr_sev, aer_uncorr_status;
> +	struct pci_dev *usp;
> +	u16 aer_cap;
> +
> +	usp = find_usp_dev(pdev);
> +	if (!usp)
> +		return;
> +
> +	aer_cap = pci_find_ext_capability(usp, PCI_EXT_CAP_ID_ERR);
> +	if (!aer_cap) {
> +		dev_info(&usp->dev, "AER capability unavailable\n");
> +		return;
> +	}
> +
> +	/*
> +	 * Clear any stale Uncorrectable Internal Error Status event in Uncorrectable Error
> +	 * Status Register.
> +	 */
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, &aer_uncorr_status);
> +	if (aer_uncorr_status & PCI_ERR_UNC_INTN)
> +		pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_STATUS, PCI_ERR_UNC_INTN);
> +
> +	/*
> +	 * All errors are steered to USP which is a PCIe AER Compliant device.
> +	 * Downgrade all the errors to non-fatal to prevent PCIe bus driver
> +	 * from triggering a Secondary Bus Reset (SBR). This allows error
> +	 * detection, containment and recovery in the driver.
> +	 *
> +	 * The Uncorrectable Error Severity Register has the 'Uncorrectable
> +	 * Internal Error Severity' set to fatal by default. Set this to
> +	 * non-fatal and unmask the error.
> +	 */
> +
> +	/* Initialize Uncorrectable Error Severity Register */
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, &aer_uncorr_sev);
> +	aer_uncorr_sev &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_SEVER, aer_uncorr_sev);
> +
> +	/* Initialize Uncorrectable Error Mask Register */
> +	pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, &aer_uncorr_mask);
> +	aer_uncorr_mask &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, aer_uncorr_mask);

[Severity: High]
Is it safe for the endpoint driver to directly modify AER registers of the
Upstream Switch Port using an unsynchronized read-modify-write sequence?

Retrieving the parent Upstream Switch Port (USP) device and directly modifying
its AER Configuration space bypasses the pcieport driver which manages it.

Because the modification uses an unlocked read-modify-write sequence,
concurrent probing of multiple endpoint functions or concurrent AER event
handling by the PCI core could result in lost updates and permanently
corrupted AER configurations on the PCIe switch.

> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index 427afd144f3ac..4c506027fa94c 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -296,7 +304,8 @@ static int create_survivability_sysfs(struct pci_dev *pdev)
>  	if (ret)
>  		return ret;
>  
> -	if (check_boot_failure(xe)) {
> +	/* Survivability info is not required if enabled via configfs */
> +	if (!xe_configfs_get_survivability_mode(pdev)) {
>  		ret = devm_device_add_group(dev, &survivability_info_group);
>  		if (ret)
>  			return ret;

[Severity: Medium]
Will this result in a double-removal of the sysfs group?

The survivability_info_group is registered using devm_device_add_group(),
which sets up an automatic devres cleanup action. Additionally,
xe_survivability_mode_fini() is registered via devm_add_action_or_reset(),
which calls device_remove_group() for the exact same group.

During device teardown, devres executes the automatically added
devm_device_remove_group first. It then executes xe_survivability_mode_fini,
which removes the group again. Furthermore, if the configfs check prevents the
group from being added initially, xe_survivability_mode_fini will still
unconditionally attempt to remove it.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260622131458.144873-8-mallesh.koujalagi@intel.com?part=1

next prev parent reply	other threads:[~2026-06-23 20:54 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-22 13:14 [PATCH v9 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-22 13:15 ` [PATCH v9 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-23 20:54   ` sashiko-bot [this message]
2026-06-22 13:15 ` [PATCH v9 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-23 21:04   ` sashiko-bot
2026-06-22 13:15 ` [PATCH v9 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-23 21:11   ` sashiko-bot
2026-06-22 13:15 ` [PATCH v9 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-22 13:15 ` [PATCH v9 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-23 22:47   ` sashiko-bot
2026-06-22 13:15 ` [PATCH v9 6/6] drm/xe/ras: Use fault-inject to trigger punit error handler Mallesh Koujalagi
2026-06-23 22:47   ` sashiko-bot
2026-06-22 17:20 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev9) Patchwork
2026-06-22 17:22 ` ✓ CI.KUnit: success " Patchwork
2026-06-22 18:09 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-22 22:03 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260623205431.ED2401F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.