From: Raag Jadav <raag.jadav@intel.com>
To: "Tauro, Riana" <riana.tauro@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>,
intel-xe@lists.freedesktop.org, matthew.brost@intel.com,
thomas.hellstrom@linux.intel.com, michal.wajdeczko@intel.com,
matthew.d.roper@intel.com, michal.winiarski@intel.com,
matthew.auld@intel.com, dev@lankhorst.se, jani.nikula@intel.com,
lukasz.laguna@intel.com, zhanjun.dong@intel.com, lukas@wunner.de,
daniele.ceraolospurio@intel.com, badal.nilawar@intel.com
Subject: Re: [PATCH v8 08/10] drm/xe: Improve wedged state management
Date: Thu, 4 Jun 2026 10:39:54 +0200 [thread overview]
Message-ID: <aiE52oh3F-5OOqLh@black.igk.intel.com> (raw)
In-Reply-To: <9da04b52-c43e-4afd-9d1b-8248034f58ab@intel.com>
On Thu, Jun 04, 2026 at 12:22:35PM +0530, Tauro, Riana wrote:
> On 6/3/2026 4:26 PM, Rodrigo Vivi wrote:
> > On Wed, Jun 03, 2026 at 03:47:08PM +0530, Raag Jadav wrote:
> > > Currently, wedged state is serving a single usecase where the device is
> > > declared wedged, but this doesn't allow any wedged state management for
> > > runtime usecases. In preparation of usecases which requires to facilitate
> > > temporary device wedging, convert wedged.flag to wedged.ref which serves
> > > as a driver internal refcount and blocks critical path execution during
> > > device lifetime and while at it, introduce wedged.fini and use it as a
> > > cleanup indicator during driver unwind which operates independent of the
> > > refcount.
> > >
> > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_device.c | 5 +++--
> > > drivers/gpu/drm/xe/xe_device.h | 12 +++++++++++-
> > > drivers/gpu/drm/xe/xe_device_types.h | 6 ++++--
> > > 3 files changed, 18 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > > index 9e0cbad8adb0..d77f8f054a1a 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -917,7 +917,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> > > {
> > > struct xe_device *xe = arg;
> > > - if (atomic_read(&xe->wedged.flag))
> > > + if (atomic_read(&xe->wedged.fini))
> > > xe_pm_runtime_put(xe);
> > > }
> > > @@ -1413,7 +1413,8 @@ void xe_device_declare_wedged(struct xe_device *xe)
> > > return;
> > > }
> > > - if (!atomic_xchg(&xe->wedged.flag, 1)) {
> > > + if (!atomic_xchg(&xe->wedged.fini, 1)) {
>
> Just curious. Why do we need runtime pm reference held throughout till the
> end of driver unload?
> Even if this is needed why not associate releasing runtime pm wakeref with
> the flr_on_fini flag.
I guess flr_on_fini has multiple setters and it's not always clear how many.
> I was facing a similar issue for uncorrectable series and i was thinking of
> releasing runtime
> pm wakeref based on flr_fini.
Sure but I'm not touching any existing behaviour in this series. Probably
worth exploring as a follow up.
Raag
> > > + xe_device_wedged_get(xe);
> > > xe->needs_flr_on_fini = true;
> > > xe_pm_runtime_get_noresume(xe);
> > > drm_err(&xe->drm,
> > > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > > index d61fdb362f91..14677729aa24 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > @@ -194,9 +194,19 @@ bool xe_device_is_l2_flush_optimized(struct xe_device *xe);
> > > void xe_device_td_flush(struct xe_device *xe);
> > > void xe_device_l2_flush(struct xe_device *xe);
> > > +static inline void xe_device_wedged_get(struct xe_device *xe)
> > > +{
> > > + atomic_inc(&xe->wedged.ref);
> > > +}
> > > +
> > > +static inline void xe_device_wedged_put(struct xe_device *xe)
> > > +{
> > > + atomic_dec(&xe->wedged.ref);
> > > +}
> > > +
> > > static inline bool xe_device_wedged(struct xe_device *xe)
> > > {
> > > - return atomic_read(&xe->wedged.flag);
> > > + return atomic_read(&xe->wedged.ref);
> > > }
> > > void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > index 32dd2ffbc796..66e673e4e3e7 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -485,8 +485,10 @@ struct xe_device {
> > > /** @wedged: Struct to control Wedged States and mode */
> > > struct {
> > > - /** @wedged.flag: Xe device faced a critical error and is now blocked. */
> > > - atomic_t flag;
> > > + /** @wedged.fini: Needs cleanup on fini */
> > > + atomic_t fini;
> > to me it is easier to see this as a 'permanent' flag than
> > a flag to 'clean' on 'fini'.
> >
> > but up to you...
> >
> > either way:
> >
> > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> >
> > > + /** @wedged.ref: Refcount for wedged device, blocks critical path execution */
> > > + atomic_t ref;
> > > /** @wedged.mode: Mode controlled by kernel parameter and debugfs */
> > > enum xe_wedged_mode mode;
> > > /** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
> > > --
> > > 2.43.0
> > >
next prev parent reply other threads:[~2026-06-04 8:40 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-03 10:17 [PATCH v8 00/10] Introduce Xe PCIe FLR Raag Jadav
2026-06-03 10:17 ` [PATCH v8 01/10] drm/xe/uc_fw: Allow re-initializing firmware Raag Jadav
2026-06-03 10:17 ` [PATCH v8 02/10] drm/xe/guc_submit: Introduce guc_exec_queue_reinit_kernel() Raag Jadav
2026-06-03 10:17 ` [PATCH v8 03/10] drm/xe/gt: Introduce FLR helpers Raag Jadav
2026-06-03 10:17 ` [PATCH v8 04/10] drm/xe/bo_evict: Introduce xe_bo_restore_map() Raag Jadav
2026-06-03 10:17 ` [PATCH v8 05/10] drm/xe/exec_queue: Introduce xe_exec_queue_reinit() Raag Jadav
2026-06-03 10:17 ` [PATCH v8 06/10] drm/xe/migrate: Introduce xe_migrate_reinit() Raag Jadav
2026-06-03 10:17 ` [PATCH v8 07/10] drm/xe/pm: Introduce xe_device_suspend/resume() Raag Jadav
2026-06-03 10:58 ` Rodrigo Vivi
2026-06-03 10:17 ` [PATCH v8 08/10] drm/xe: Improve wedged state management Raag Jadav
2026-06-03 10:56 ` Rodrigo Vivi
2026-06-04 6:52 ` Tauro, Riana
2026-06-04 8:39 ` Raag Jadav [this message]
2026-06-03 10:17 ` [PATCH v8 09/10] drm/xe/pci: Introduce PCIe FLR Raag Jadav
2026-06-03 10:40 ` Rodrigo Vivi
2026-06-03 10:17 ` [PATCH v8 10/10] drm/xe/doc: Wire up PCI Error Handling Raag Jadav
2026-06-03 10:28 ` ✗ CI.checkpatch: warning for Introduce Xe PCIe FLR (rev8) Patchwork
2026-06-03 10:29 ` ✓ CI.KUnit: success " Patchwork
2026-06-03 11:11 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-03 22:55 ` ✓ Xe.CI.FULL: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aiE52oh3F-5OOqLh@black.igk.intel.com \
--to=raag.jadav@intel.com \
--cc=badal.nilawar@intel.com \
--cc=daniele.ceraolospurio@intel.com \
--cc=dev@lankhorst.se \
--cc=intel-xe@lists.freedesktop.org \
--cc=jani.nikula@intel.com \
--cc=lukas@wunner.de \
--cc=lukasz.laguna@intel.com \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=matthew.d.roper@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=michal.winiarski@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=zhanjun.dong@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.