From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: intel-xe@lists.freedesktop.org
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Subject: [PATCH 0/5] drm/xe: Fix LR exec queue suspend/resume for S3/S4
Date: Fri, 22 May 2026 18:43:50 +0200 [thread overview]
Message-ID: <20260522164355.2773-1-thomas.hellstrom@linux.intel.com> (raw)
Long Running (LR) exec queues — used by compute workloads with SVM
(fault-mode) and by preempt-fence-mode — were not surviving S3/S4
suspend/resume correctly. Five distinct problems are addressed:
1. Exec queue scheduler start during resume was not deferred: user exec
queue schedulers were started before page table BOs and LRC BOs were
restored. A job in this window would cause GuC to load a context
from stale or invalid VRAM. User exec queue schedulers are now
deferred until after page tables and LRC BOs are restored. Migrate
and kernel VM queues are still started immediately as they are
required by the restore process itself.
2. Exec queue suspend/resume lacked coordination when multiple paths
(PM, mode switching, preempt fences) needed to hold the queue
suspended simultaneously. A resume from one path could prematurely
re-enable a queue still held suspended by another. Each caller can
now independently hold a suspend; the queue resumes only when all
callers have released it.
3. During PM suspend, any user exec queue with a started-but-incomplete
job was banned. For LR queues this is always true — their jobs are
designed to run indefinitely — so every PM suspend permanently
banned the queue. The ban is now suppressed for LR VM exec queues
during PM suspend or hibernation while being preserved for GT reset
(legitimate hang detection).
4. The execution mode constant EXEC_MODE_LR in xe_hw_engine_group was
misleading since not all long-running queues use fault mode. It is
renamed to EXEC_MODE_FAULT. No functional change.
5. Fault-mode (SVM) VMs use GPU page faults to access memory. A
running fault-mode job can re-fault pages torn down by VRAM
eviction, racing with the eviction. Fault-mode exec queues are now
suspended and drained before any VRAM eviction begins. On resume,
they are re-registered and restarted once hardware is restored.
Exec queues created concurrently with PM suspend are immediately
suspended so the resume path picks them up.
Note: A prerequisite revert ("Revert drm/xe: Skip exec queue schedule
toggle if queue is idle during suspend") was already sent as a separate
patch and is not included here.
v2:
- Dropped "Restore userspace LRC BOs early on resume": replaced by
patch 1/5 which defers user exec queue scheduler start until after
page tables are restored, achieving the same ordering guarantee.
- Added patch 1/5: Defer user exec queue scheduler start until after
page table restore.
- Added patch 4/5: Rename EXEC_MODE_LR to EXEC_MODE_FAULT.
- Patch 5/5: see per-patch v2 changelog.
Thomas Hellström (5):
drm/xe/guc: Defer user exec queue scheduler start until after page
table restore
drm/xe/guc: Don't ban LR VM exec queues on PM suspend
drm/xe/guc: Add suspend refcount to exec queue ops
drm/xe: Rename EXEC_MODE_LR to EXEC_MODE_FAULT in hw engine group
drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4
drivers/gpu/drm/xe/xe_device_types.h | 8 +
drivers/gpu/drm/xe/xe_exec.c | 2 +-
drivers/gpu/drm/xe/xe_exec_queue_types.h | 7 +
drivers/gpu/drm/xe/xe_gt.c | 16 ++
drivers/gpu/drm/xe/xe_gt.h | 2 +
drivers/gpu/drm/xe/xe_guc.c | 13 ++
drivers/gpu/drm/xe/xe_guc.h | 1 +
drivers/gpu/drm/xe/xe_guc_exec_queue_types.h | 7 +
drivers/gpu/drm/xe/xe_guc_submit.c | 103 ++++++++++-
drivers/gpu/drm/xe/xe_guc_submit.h | 2 +
drivers/gpu/drm/xe/xe_hw_engine_group.c | 171 ++++++++++++++++--
drivers/gpu/drm/xe/xe_hw_engine_group.h | 3 +
drivers/gpu/drm/xe/xe_hw_engine_group_types.h | 11 +-
drivers/gpu/drm/xe/xe_pm.c | 26 ++-
drivers/gpu/drm/xe/xe_uc.c | 16 ++
drivers/gpu/drm/xe/xe_uc.h | 1 +
16 files changed, 357 insertions(+), 32 deletions(-)
--
2.54.0
next reply other threads:[~2026-05-22 16:44 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 16:43 Thomas Hellström [this message]
2026-05-22 16:43 ` [PATCH 1/5] drm/xe/guc: Defer user exec queue scheduler start until after page table restore Thomas Hellström
2026-05-22 16:43 ` [PATCH 2/5] drm/xe/guc: Don't ban LR VM exec queues on PM suspend Thomas Hellström
2026-05-22 16:43 ` [PATCH 3/5] drm/xe/guc: Add suspend refcount to exec queue ops Thomas Hellström
2026-05-22 16:43 ` [PATCH 4/5] drm/xe: Rename EXEC_MODE_LR to EXEC_MODE_FAULT in hw engine group Thomas Hellström
2026-05-22 16:43 ` [PATCH 5/5] drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4 Thomas Hellström
2026-05-22 18:46 ` ✓ CI.KUnit: success for drm/xe: Fix LR exec queue suspend/resume for S3/S4 (rev2) Patchwork
2026-05-22 19:23 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-05-23 3:23 ` ✗ Xe.CI.FULL: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260522164355.2773-1-thomas.hellstrom@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=intel-xe@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox