Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] drm/xe: Fix LR exec queue suspend/resume for S3/S4
@ 2026-05-22 16:43 Thomas Hellström
  2026-05-22 16:43 ` [PATCH 1/5] drm/xe/guc: Defer user exec queue scheduler start until after page table restore Thomas Hellström
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Thomas Hellström @ 2026-05-22 16:43 UTC (permalink / raw)
  To: intel-xe; +Cc: Thomas Hellström

Long Running (LR) exec queues — used by compute workloads with SVM
(fault-mode) and by preempt-fence-mode — were not surviving S3/S4
suspend/resume correctly.  Five distinct problems are addressed:

1. Exec queue scheduler start during resume was not deferred: user exec
   queue schedulers were started before page table BOs and LRC BOs were
   restored.  A job in this window would cause GuC to load a context
   from stale or invalid VRAM.  User exec queue schedulers are now
   deferred until after page tables and LRC BOs are restored.  Migrate
   and kernel VM queues are still started immediately as they are
   required by the restore process itself.

2. Exec queue suspend/resume lacked coordination when multiple paths
   (PM, mode switching, preempt fences) needed to hold the queue
   suspended simultaneously.  A resume from one path could prematurely
   re-enable a queue still held suspended by another.  Each caller can
   now independently hold a suspend; the queue resumes only when all
   callers have released it.

3. During PM suspend, any user exec queue with a started-but-incomplete
   job was banned.  For LR queues this is always true — their jobs are
   designed to run indefinitely — so every PM suspend permanently
   banned the queue.  The ban is now suppressed for LR VM exec queues
   during PM suspend or hibernation while being preserved for GT reset
   (legitimate hang detection).

4. The execution mode constant EXEC_MODE_LR in xe_hw_engine_group was
   misleading since not all long-running queues use fault mode.  It is
   renamed to EXEC_MODE_FAULT.  No functional change.

5. Fault-mode (SVM) VMs use GPU page faults to access memory.  A
   running fault-mode job can re-fault pages torn down by VRAM
   eviction, racing with the eviction.  Fault-mode exec queues are now
   suspended and drained before any VRAM eviction begins.  On resume,
   they are re-registered and restarted once hardware is restored.
   Exec queues created concurrently with PM suspend are immediately
   suspended so the resume path picks them up.

Note: A prerequisite revert ("Revert drm/xe: Skip exec queue schedule
toggle if queue is idle during suspend") was already sent as a separate
patch and is not included here.

v2:
 - Dropped "Restore userspace LRC BOs early on resume": replaced by
   patch 1/5 which defers user exec queue scheduler start until after
   page tables are restored, achieving the same ordering guarantee.
 - Added patch 1/5: Defer user exec queue scheduler start until after
   page table restore.
 - Added patch 4/5: Rename EXEC_MODE_LR to EXEC_MODE_FAULT.
 - Patch 5/5: see per-patch v2 changelog.

Thomas Hellström (5):
  drm/xe/guc: Defer user exec queue scheduler start until after page
    table restore
  drm/xe/guc: Don't ban LR VM exec queues on PM suspend
  drm/xe/guc: Add suspend refcount to exec queue ops
  drm/xe: Rename EXEC_MODE_LR to EXEC_MODE_FAULT in hw engine group
  drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4

 drivers/gpu/drm/xe/xe_device_types.h          |   8 +
 drivers/gpu/drm/xe/xe_exec.c                  |   2 +-
 drivers/gpu/drm/xe/xe_exec_queue_types.h      |   7 +
 drivers/gpu/drm/xe/xe_gt.c                    |  16 ++
 drivers/gpu/drm/xe/xe_gt.h                    |   2 +
 drivers/gpu/drm/xe/xe_guc.c                   |  13 ++
 drivers/gpu/drm/xe/xe_guc.h                   |   1 +
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h  |   7 +
 drivers/gpu/drm/xe/xe_guc_submit.c            | 103 ++++++++++-
 drivers/gpu/drm/xe/xe_guc_submit.h            |   2 +
 drivers/gpu/drm/xe/xe_hw_engine_group.c       | 171 ++++++++++++++++--
 drivers/gpu/drm/xe/xe_hw_engine_group.h       |   3 +
 drivers/gpu/drm/xe/xe_hw_engine_group_types.h |  11 +-
 drivers/gpu/drm/xe/xe_pm.c                    |  26 ++-
 drivers/gpu/drm/xe/xe_uc.c                    |  16 ++
 drivers/gpu/drm/xe/xe_uc.h                    |   1 +
 16 files changed, 357 insertions(+), 32 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-23  3:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 16:43 [PATCH 0/5] drm/xe: Fix LR exec queue suspend/resume for S3/S4 Thomas Hellström
2026-05-22 16:43 ` [PATCH 1/5] drm/xe/guc: Defer user exec queue scheduler start until after page table restore Thomas Hellström
2026-05-22 16:43 ` [PATCH 2/5] drm/xe/guc: Don't ban LR VM exec queues on PM suspend Thomas Hellström
2026-05-22 16:43 ` [PATCH 3/5] drm/xe/guc: Add suspend refcount to exec queue ops Thomas Hellström
2026-05-22 16:43 ` [PATCH 4/5] drm/xe: Rename EXEC_MODE_LR to EXEC_MODE_FAULT in hw engine group Thomas Hellström
2026-05-22 16:43 ` [PATCH 5/5] drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4 Thomas Hellström
2026-05-22 18:46 ` ✓ CI.KUnit: success for drm/xe: Fix LR exec queue suspend/resume for S3/S4 (rev2) Patchwork
2026-05-22 19:23 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-05-23  3:23 ` ✗ Xe.CI.FULL: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox