From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <matthew.brost@intel.com>,
<riana.tauro@intel.com>, <michal.wajdeczko@intel.com>,
<matthew.d.roper@intel.com>, <mallesh.koujalagi@intel.com>
Subject: Re: [PATCH v1] drm/xe: Improve wedged state management
Date: Wed, 17 Jun 2026 10:06:18 -0400 [thread overview]
Message-ID: <ajKp2t9kdlAQ1NdJ@intel.com> (raw)
In-Reply-To: <20260617120542.96444-1-raag.jadav@intel.com>
On Wed, Jun 17, 2026 at 05:33:45PM +0530, Raag Jadav wrote:
> Currently, wedged state is serving a single usecase where the device is
> permanently declared wedged, but this doesn't allow any wedged state
> management for runtime usecases. In preparation of usecases which require
> to facilitate temporary device wedging, convert wedged.flag to wedged.ref
> which serves as a driver internal refcount for wedged state and blocks
> critical path execution during device lifetime. While at it, introduce
> wedged.perm which signifies permanent device wedging and operates
> independent of the refcount allowing relevant cleanup action on unwind
> path.
>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> ---
> Split from FLR series[1].
>
> [1] https://lore.kernel.org/intel-xe/20260603101814.916948-9-raag.jadav@intel.com/
> ---
> drivers/gpu/drm/xe/xe_device.c | 5 +++--
> drivers/gpu/drm/xe/xe_device.h | 18 +++++++++++++++++-
> drivers/gpu/drm/xe/xe_device_types.h | 6 ++++--
> 3 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index ef730f2bdf32..00ade433a23b 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -916,7 +916,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> {
> struct xe_device *xe = arg;
>
> - if (atomic_read(&xe->wedged.flag))
> + if (atomic_read(&xe->wedged.perm))
> xe_pm_runtime_put(xe);
> }
>
> @@ -1421,7 +1421,8 @@ void xe_device_declare_wedged(struct xe_device *xe)
> return;
> }
>
> - if (!atomic_xchg(&xe->wedged.flag, 1)) {
> + if (!atomic_xchg(&xe->wedged.perm, 1)) {
Sashiko doesn't like this change. Specially like a standalone patch.
https://sashiko.dev/#/patchset/20260617120542.96444-1-raag.jadav%40intel.com
Opus-4.8 is trying to convince me here that although this race does exist it
is not problematic.
I'm wondering if we should go first with increasing the refcount and having
an aux function to get_permanent() or something like that to differentiate
the cases. Opus believe it is not necessary.
Anyway, let's hold this patch for now.... Please, at least think about this case
when you are refreshing the whole series and let's merge together with the
series.
If you decide to go with this patch as is you still have my reviewed-by
and my ack to ignore Sashiko.
Thanks,
Rodrigo.
> + xe_device_wedged_get(xe);
> xe->needs_flr_on_fini = true;
> xe_pm_runtime_get_noresume(xe);
> drm_err(&xe->drm,
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 975768a6a9c8..1aea83e3517c 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -192,9 +192,25 @@ bool xe_device_is_l2_flush_optimized(struct xe_device *xe);
> void xe_device_td_flush(struct xe_device *xe);
> void xe_device_l2_flush(struct xe_device *xe);
>
> +static inline void xe_device_wedged_get(struct xe_device *xe)
> +{
> + int ref;
> +
> + ref = atomic_inc_return(&xe->wedged.ref);
> + xe_assert(xe, ref > 0);
> +}
> +
> +static inline void xe_device_wedged_put(struct xe_device *xe)
> +{
> + int ref;
> +
> + ref = atomic_dec_return(&xe->wedged.ref);
> + xe_assert(xe, ref >= 0);
> +}
> +
> static inline bool xe_device_wedged(struct xe_device *xe)
> {
> - return atomic_read(&xe->wedged.flag);
> + return atomic_read(&xe->wedged.ref);
> }
>
> void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 32dd2ffbc796..f13e0fb2f18e 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -485,8 +485,10 @@ struct xe_device {
>
> /** @wedged: Struct to control Wedged States and mode */
> struct {
> - /** @wedged.flag: Xe device faced a critical error and is now blocked. */
> - atomic_t flag;
> + /** @wedged.perm: Permanently wedged, needs cleanup on fini */
> + atomic_t perm;
> + /** @wedged.ref: Refcount for wedged device, blocks critical path execution */
> + atomic_t ref;
> /** @wedged.mode: Mode controlled by kernel parameter and debugfs */
> enum xe_wedged_mode mode;
> /** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
> --
> 2.43.0
>
prev parent reply other threads:[~2026-06-17 14:06 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-17 12:03 [PATCH v1] drm/xe: Improve wedged state management Raag Jadav
2026-06-17 12:16 ` ✓ CI.KUnit: success for " Patchwork
2026-06-17 12:54 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-17 14:06 ` Rodrigo Vivi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajKp2t9kdlAQ1NdJ@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=mallesh.koujalagi@intel.com \
--cc=matthew.brost@intel.com \
--cc=matthew.d.roper@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.