Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] drm/xe: Fix LR exec queue suspend/resume for S3/S4
@ 2026-05-21 14:48 Thomas Hellström
  2026-05-21 14:48 ` [PATCH 1/4] drm/xe/guc: Add suspend refcount to exec queue ops Thomas Hellström
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Thomas Hellström @ 2026-05-21 14:48 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Matthew Brost, Francois Dugast,
	Matthew Auld, Rodrigo Vivi, Maarten Lankhorst

Long Running (LR) exec queues — used by compute workloads with SVM
(fault-mode) and by preempt-fence-mode — were not surviving S3/S4
suspend/resume correctly.  Four distinct problems are addressed:

1. Exec queue ops (guc_exec_queue_suspend/resume) lacked coordination
   when multiple paths (PM, mode switching, preempt fences) needed to
   hold the queue suspended simultaneously.  A suspend refcount ensures
   the GuC SUSPEND message is only sent when the first caller suspends,
   and the RESUME message only when the last caller resumes.

2. During PM suspend, guc_exec_queue_stop() banned any user exec queue
   that had a started-but-incomplete job.  For LR queues this is always
   true — their jobs are designed to run indefinitely — so every PM
   suspend permanently banned the queue.  The ban is now suppressed for
   LR VM exec queues during PM suspend or hibernation while being
   preserved for GT reset (legitimate hang detection).

3. Userspace LRC buffer objects carried XE_BO_FLAG_PINNED_LATE_RESTORE,
   deferring their VRAM restore to after xe_gt_resume().  However,
   xe_gt_resume() drives context registration, which requires valid LRC
   VRAM.  Dropping the flag moves the restore to xe_bo_restore_early(),
   a CPU/BAR copy that runs before xe_gt_resume(), fixing the ordering.

4. Fault-mode (SVM) VMs use GPU page faults to access memory.  A
   running fault-mode job can re-fault pages torn down by VRAM eviction,
   racing with the eviction.  A new xe_suspend_all_faulting_lr_jobs()
   call in the PM notifier stops all fault-mode queues and waits for GuC
   acknowledgement before eviction begins.  On resume,
   xe_resume_all_faulting_lr_jobs() mirrors the same iteration to
   re-register and resume exactly those queues.  A per-group pm_suspended
   flag (protected by mode_sem) prevents new fault-mode exec queues from
   slipping through unsuspended while PM suspend is in progress.

Note: A prerequisite revert ("Revert drm/xe: Skip exec queue schedule
toggle if queue is idle during suspend") was already sent as a separate
patch and is not included here.

Thomas Hellström (4):
  drm/xe/guc: Add suspend refcount to exec queue ops
  drm/xe/guc: Don't ban LR VM exec queues on PM suspend
  drm/xe: Restore userspace LRC BOs early on resume
  drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4

 drivers/gpu/drm/xe/xe_exec_queue_types.h      |   7 +
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h  |   7 +
 drivers/gpu/drm/xe/xe_guc_submit.c            |  60 +++++--
 drivers/gpu/drm/xe/xe_guc_submit.h            |   1 +
 drivers/gpu/drm/xe/xe_hw_engine_group.c       | 158 +++++++++++++++++-
 drivers/gpu/drm/xe/xe_hw_engine_group.h       |   3 +
 drivers/gpu/drm/xe/xe_hw_engine_group_types.h |   7 +
 drivers/gpu/drm/xe/xe_lrc.c                   |   2 +-
 drivers/gpu/drm/xe/xe_pm.c                    |  15 +-
 9 files changed, 239 insertions(+), 21 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-22 10:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 14:48 [PATCH 0/4] drm/xe: Fix LR exec queue suspend/resume for S3/S4 Thomas Hellström
2026-05-21 14:48 ` [PATCH 1/4] drm/xe/guc: Add suspend refcount to exec queue ops Thomas Hellström
2026-05-21 14:48 ` [PATCH 2/4] drm/xe/guc: Don't ban LR VM exec queues on PM suspend Thomas Hellström
2026-05-21 14:48 ` [PATCH 3/4] drm/xe: Restore userspace LRC BOs early on resume Thomas Hellström
2026-05-21 16:09   ` Matthew Auld
2026-05-21 16:31     ` Thomas Hellström
2026-05-22  9:51       ` Thomas Hellström
2026-05-22 10:05         ` Matthew Auld
2026-05-21 14:48 ` [PATCH 4/4] drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4 Thomas Hellström
2026-05-21 14:56 ` ✓ CI.KUnit: success for drm/xe: Fix LR exec queue suspend/resume for S3/S4 Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox