Re: [PATCH v7 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Mallesh Koujalagi" <mallesh.koujalagi@intel.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v7 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery
Date: Fri, 05 Jun 2026 12:48:22 +0000	[thread overview]
Message-ID: <20260605124822.81BEE1F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260605122544.1021695-12-mallesh.koujalagi@intel.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Emitting a DRM wedged uevent before the device is registered in sysfs causes the uevent to be silently dropped, losing the cold-reset notification.
- [High] Concurrent modification of unprotected `xe->wedged.method` can cause the `COLD_RESET` recovery method to be overwritten.
--

commit d8f11e3cba2116967c1b1af489f30a2fd996f664
Author: Mallesh Koujalagi <mallesh.koujalagi@intel.com>

drm/xe: Handle PUNIT errors by requesting cold-reset recovery

When PUNIT errors are detected that persist across warm resets, this
commit marks the device as wedged with DRM_WEDGE_RECOVERY_COLD_RESET
and notifies userspace that a complete device power cycle is required.

> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c

[ ... ]

> @@ -207,6 +209,12 @@ static enum xe_ras_recovery_action handle_core_compute_errors(struct xe_ras_erro
>  	return XE_RAS_RECOVERY_ACTION_RECOVERED;
>  }
>  
> +static void punit_error_handler(struct xe_device *xe)
> +{
> +	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);

[Severity: High]
This is a pre-existing issue, but does concurrent modification of the
unprotected xe->wedged.method here cause the COLD_RESET recovery method to
be overwritten?

If a concurrent event (such as a GT hang timeout) triggers
xe_device_declare_wedged() concurrently, there appears to be a race
condition. The concurrent thread may evaluate if (!xe->wedged.method) as
true just before punit_error_handler() executes xe_device_set_wedged_method()
with DRM_WEDGE_RECOVERY_COLD_RESET.

The concurrent thread would then execute:

xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND |
                                DRM_WEDGE_RECOVERY_BUS_RESET);

Will this overwrite the COLD_RESET method and result in an incorrect
recovery uevent being broadcast to userspace?

> +	xe_device_declare_wedged(xe);

[Severity: High]
This isn't a bug introduced by this patch, but does emitting a DRM wedged
uevent before the device is registered in sysfs cause the uevent to be
silently dropped?

During driver initialization, xe_device_probe() calls xe_ras_init(), which
processes preexisting hardware errors. If a PUNIT error is detected at this
stage, punit_error_handler() marks the device as wedged and calls
xe_device_declare_wedged(), which in turn calls drm_dev_wedged_event().

Since drm_dev_register() has not yet been called at this stage of the probe,
device_add() has not executed, leaving state_in_sysfs = 0 for the kobject.
The core kobject_uevent_env() explicitly drops events for unlinked kobjects.

Does this cause the WEDGED=cold-reset uevent notification intended for
userspace to be permanently lost, leaving the device unusable without
userspace automation knowing it needs to power-cycle the slot?

> +}
> +
>  static enum xe_ras_recovery_action handle_soc_internal_errors(struct xe_device *xe,

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605122544.1021695-8-mallesh.koujalagi@intel.com?part=4

next prev parent reply	other threads:[~2026-06-05 12:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-05 12:25 [PATCH v7 0/6] Introduce cold reset recovery method Mallesh Koujalagi
2026-06-05 12:25 ` [PATCH v7 1/6] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-06-05 12:43   ` sashiko-bot
2026-06-05 12:25 ` [PATCH v7 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-06-05 12:25 ` [PATCH v7 3/6] drm/doc: Document " Mallesh Koujalagi
2026-06-08  6:30   ` Raag Jadav
2026-06-05 12:25 ` [PATCH v7 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery Mallesh Koujalagi
2026-06-05 12:48   ` sashiko-bot [this message]
2026-06-05 12:25 ` [PATCH v7 5/6] drm/xe: Suppress Surprise Link Down on device Mallesh Koujalagi
2026-06-05 12:41   ` sashiko-bot
2026-06-05 12:25 ` [PATCH v7 6/6] drm/xe/ras: Add debugfs entry to inject punit error Mallesh Koujalagi
2026-06-05 13:00 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method (rev7) Patchwork
2026-06-05 13:00 ` ✗ CI.KUnit: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260605124822.81BEE1F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=mallesh.koujalagi@intel.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.