Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: intel-xe@lists.freedesktop.org
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Francois Dugast" <francois.dugast@intel.com>,
	"Matthew Auld" <matthew.auld@intel.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>
Subject: [PATCH 0/4] drm/xe: Fix LR exec queue suspend/resume for S3/S4
Date: Thu, 21 May 2026 16:48:33 +0200	[thread overview]
Message-ID: <20260521144837.7363-1-thomas.hellstrom@linux.intel.com> (raw)

Long Running (LR) exec queues — used by compute workloads with SVM
(fault-mode) and by preempt-fence-mode — were not surviving S3/S4
suspend/resume correctly.  Four distinct problems are addressed:

1. Exec queue ops (guc_exec_queue_suspend/resume) lacked coordination
   when multiple paths (PM, mode switching, preempt fences) needed to
   hold the queue suspended simultaneously.  A suspend refcount ensures
   the GuC SUSPEND message is only sent when the first caller suspends,
   and the RESUME message only when the last caller resumes.

2. During PM suspend, guc_exec_queue_stop() banned any user exec queue
   that had a started-but-incomplete job.  For LR queues this is always
   true — their jobs are designed to run indefinitely — so every PM
   suspend permanently banned the queue.  The ban is now suppressed for
   LR VM exec queues during PM suspend or hibernation while being
   preserved for GT reset (legitimate hang detection).

3. Userspace LRC buffer objects carried XE_BO_FLAG_PINNED_LATE_RESTORE,
   deferring their VRAM restore to after xe_gt_resume().  However,
   xe_gt_resume() drives context registration, which requires valid LRC
   VRAM.  Dropping the flag moves the restore to xe_bo_restore_early(),
   a CPU/BAR copy that runs before xe_gt_resume(), fixing the ordering.

4. Fault-mode (SVM) VMs use GPU page faults to access memory.  A
   running fault-mode job can re-fault pages torn down by VRAM eviction,
   racing with the eviction.  A new xe_suspend_all_faulting_lr_jobs()
   call in the PM notifier stops all fault-mode queues and waits for GuC
   acknowledgement before eviction begins.  On resume,
   xe_resume_all_faulting_lr_jobs() mirrors the same iteration to
   re-register and resume exactly those queues.  A per-group pm_suspended
   flag (protected by mode_sem) prevents new fault-mode exec queues from
   slipping through unsuspended while PM suspend is in progress.

Note: A prerequisite revert ("Revert drm/xe: Skip exec queue schedule
toggle if queue is idle during suspend") was already sent as a separate
patch and is not included here.

Thomas Hellström (4):
  drm/xe/guc: Add suspend refcount to exec queue ops
  drm/xe/guc: Don't ban LR VM exec queues on PM suspend
  drm/xe: Restore userspace LRC BOs early on resume
  drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4

 drivers/gpu/drm/xe/xe_exec_queue_types.h      |   7 +
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h  |   7 +
 drivers/gpu/drm/xe/xe_guc_submit.c            |  60 +++++--
 drivers/gpu/drm/xe/xe_guc_submit.h            |   1 +
 drivers/gpu/drm/xe/xe_hw_engine_group.c       | 158 +++++++++++++++++-
 drivers/gpu/drm/xe/xe_hw_engine_group.h       |   3 +
 drivers/gpu/drm/xe/xe_hw_engine_group_types.h |   7 +
 drivers/gpu/drm/xe/xe_lrc.c                   |   2 +-
 drivers/gpu/drm/xe/xe_pm.c                    |  15 +-
 9 files changed, 239 insertions(+), 21 deletions(-)

-- 
2.54.0


             reply	other threads:[~2026-05-21 14:49 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-21 14:48 Thomas Hellström [this message]
2026-05-21 14:48 ` [PATCH 1/4] drm/xe/guc: Add suspend refcount to exec queue ops Thomas Hellström
2026-05-21 14:48 ` [PATCH 2/4] drm/xe/guc: Don't ban LR VM exec queues on PM suspend Thomas Hellström
2026-05-21 14:48 ` [PATCH 3/4] drm/xe: Restore userspace LRC BOs early on resume Thomas Hellström
2026-05-21 16:09   ` Matthew Auld
2026-05-21 16:31     ` Thomas Hellström
2026-05-22  9:51       ` Thomas Hellström
2026-05-22 10:05         ` Matthew Auld
2026-05-21 14:48 ` [PATCH 4/4] drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4 Thomas Hellström
2026-05-21 14:56 ` ✓ CI.KUnit: success for drm/xe: Fix LR exec queue suspend/resume for S3/S4 Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260521144837.7363-1-thomas.hellstrom@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox