From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Matthew Auld <matthew.auld@intel.com>, intel-xe@lists.freedesktop.org
Cc: Matthew Brost <matthew.brost@intel.com>,
Tomasz Lis <tomasz.lis@intel.com>,
Rodrigo Vivi <rodrigo.vivi@intel.com>,
stable@vger.kernel.org,
Francois Dugast <francois.dugast@intel.com>,
Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Subject: Re: [PATCH v3 2/5] drm/xe/guc: Don't ban LR VM exec queues on PM suspend
Date: Wed, 27 May 2026 12:19:45 +0200 [thread overview]
Message-ID: <b3a30dfcf5eb035aca3e7e836b985155c419d9fe.camel@linux.intel.com> (raw)
In-Reply-To: <f0df867c-d4f4-4a9f-b2f0-58d05e5f8926@intel.com>
On Tue, 2026-05-26 at 16:38 +0100, Matthew Auld wrote:
> On 25/05/2026 14:30, Thomas Hellström wrote:
> > When xe_guc_submit_stop() is called during an S3/S4 suspend or GT
> > reset, guc_exec_queue_stop() bans any user exec queue that has a
> > job
> > which has started but not yet completed. For normal (non-LR) exec
> > queues this is the correct behaviour: a started-but-incomplete job
> > at
> > reset time may indicate a hung workload.
>
> Is it not too harsh to ban the user job for that? Say you are a well
> behaved 3D workload, and forced suspend is triggered by the user, if
> you
> are very unlucky you can get banned, if you hit the queue_stop flow
> with
> a WIP job?
Actually (and this has bearing also on patch 1, I think) suspend /
resume must not sit in the critical path of any dma-fence job. Meaning
we explicitly have to wait for all outstanding dma-fences before
suspending, and add that if that's not already done.
/Thomas
>
> >
> > For exec queues attached to Long Running (LR) VMs the same
> > condition
> > is always true during normal operation: LR jobs are designed to run
> > indefinitely and are never "completed" in the DRM scheduler sense —
> > they are preempted and resumed via the preempt-fence mechanism.
> > Banning such an exec queue on PM suspend permanently prevents the
> > job
> > from restarting after resume, causing the userspace compute
> > workload to
> > fail silently.
> >
> > Fix this by not banning LR VM exec queues when a system suspend or
> > hibernation is in progress, while preserving the ban for GT reset
> > where
> > a started-but-incomplete job is a legitimate indicator of a hang.
> >
> > Fixes: f6375fb3aa94 ("drm/xe: Track LR jobs in DRM scheduler
> > pending list")
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Tomasz Lis <tomasz.lis@intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Cc: <stable@vger.kernel.org> # v6.19+
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Assisted-by: GitHub_Copilot:claude-sonnet-4.6
> > ---
> > drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
> > drivers/gpu/drm/xe/xe_guc_submit.c | 10 +++++++++-
> > drivers/gpu/drm/xe/xe_pm.c | 5 ++++-
> > 3 files changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > index 32dd2ffbc796..9dbf7b3a0c49 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -433,6 +433,14 @@ struct xe_device {
> > struct notifier_block pm_notifier;
> > /** @pm_block: Completion to block validating tasks on
> > suspend / hibernate prepare */
> > struct completion pm_block;
> > + /**
> > + * @pm_suspend_in_progress: True while the device is going
> > through
> > + * system suspend or hibernation (set at xe_pm_suspend()
> > entry, cleared
> > + * at xe_pm_resume() entry or on suspend error). Used to
> > suppress exec
> > + * queue bans that should only apply during GT reset, not
> > PM suspend.
> > + * Serialised by the PM suspend sequence; no lock
> > required.
> > + */
> > + bool pm_suspend_in_progress;
> > /** @rebind_resume_list: List of wq items to kick on
> > resume. */
> > struct list_head rebind_resume_list;
> > /** @rebind_resume_lock: Lock to protect the
> > rebind_resume_list */
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 2b8b316c0ca3..f1a6f13011b5 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -2268,8 +2268,16 @@ static void guc_exec_queue_stop(struct
> > xe_guc *guc, struct xe_exec_queue *q)
> > * Ban any engine (aside from kernel and engines used for
> > VM ops) with a
> > * started but not complete job or if a job has gone
> > through a GT reset
> > * more than twice.
> > + *
> > + * LR VM exec queues are excluded from this ban during PM
> > suspend: their
> > + * jobs are intentionally long-running and are preempted
> > and resumed via
> > + * the preempt-fence mechanism. Banning them on PM suspend
> > would
> > + * permanently prevent the job from restarting after
> > resume.
> > + * On GT reset however we do want to ban them, as that may
> > indicate a
> > + * genuinely hung workload.
> > */
> > - if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL |
> > EXEC_QUEUE_FLAG_VM))) {
> > + if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL |
> > EXEC_QUEUE_FLAG_VM)) &&
> > + !(q->vm && xe_vm_in_lr_mode(q->vm) && guc_to_xe(guc)-
> > >pm_suspend_in_progress)) {
> > struct xe_sched_job *job =
> > xe_sched_first_pending_job(sched);
> > bool ban = false;
> >
> > diff --git a/drivers/gpu/drm/xe/xe_pm.c
> > b/drivers/gpu/drm/xe/xe_pm.c
> > index c203a59d7000..76d211986822 100644
> > --- a/drivers/gpu/drm/xe/xe_pm.c
> > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > @@ -176,6 +176,7 @@ int xe_pm_suspend(struct xe_device *xe)
> > int err;
> >
> > drm_dbg(&xe->drm, "Suspending device\n");
> > + xe->pm_suspend_in_progress = true;
> > xe_pm_block_begin_signalling();
> > trace_xe_pm_suspend(xe, __builtin_return_address(0));
> >
> > @@ -217,6 +218,7 @@ int xe_pm_suspend(struct xe_device *xe)
> > xe_pxp_pm_resume(xe->pxp);
> > err:
> > drm_dbg(&xe->drm, "Device suspend failed %d\n", err);
> > + xe->pm_suspend_in_progress = false;
> > xe_pm_block_end_signalling();
> > return err;
> > }
> > @@ -234,8 +236,9 @@ int xe_pm_resume(struct xe_device *xe)
> > u8 id;
> > int err;
> >
> > - xe_pm_block_begin_signalling();
> > + xe->pm_suspend_in_progress = false;
> > drm_dbg(&xe->drm, "Resuming device\n");
> > + xe_pm_block_begin_signalling();
> > trace_xe_pm_resume(xe, __builtin_return_address(0));
> >
> > for_each_gt(gt, xe, id)
next prev parent reply other threads:[~2026-05-27 10:19 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-25 13:30 [PATCH v3 0/5] drm/xe: Fix LR exec queue suspend/resume for S3/S4 Thomas Hellström
2026-05-25 13:30 ` [PATCH v3 1/5] drm/xe/guc: Defer user exec queue scheduler start until after page table restore Thomas Hellström
2026-05-26 15:27 ` Matthew Auld
2026-05-27 10:15 ` Thomas Hellström
2026-05-25 13:30 ` [PATCH v3 2/5] drm/xe/guc: Don't ban LR VM exec queues on PM suspend Thomas Hellström
2026-05-26 15:38 ` Matthew Auld
2026-05-27 10:19 ` Thomas Hellström [this message]
2026-05-27 16:35 ` Matthew Auld
2026-05-25 13:30 ` [PATCH v3 3/5] drm/xe/guc: Add suspend refcount to exec queue ops Thomas Hellström
2026-05-25 13:30 ` [PATCH v3 4/5] drm/xe: Rename EXEC_MODE_LR to EXEC_MODE_FAULT in hw engine group Thomas Hellström
2026-05-28 17:06 ` Rodrigo Vivi
2026-05-28 17:33 ` Francois Dugast
2026-05-25 13:30 ` [PATCH v3 5/5] drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4 Thomas Hellström
2026-05-25 15:56 ` ✓ CI.KUnit: success for drm/xe: Fix LR exec queue suspend/resume for S3/S4 (rev3) Patchwork
2026-05-25 16:35 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-25 20:38 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b3a30dfcf5eb035aca3e7e836b985155c419d9fe.camel@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=francois.dugast@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=maarten.lankhorst@linux.intel.com \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=stable@vger.kernel.org \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.