Linux kernel -stable discussions
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: intel-xe@lists.freedesktop.org
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Tomasz Lis" <tomasz.lis@intel.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	stable@vger.kernel.org
Subject: [PATCH 2/5] drm/xe/guc: Don't ban LR VM exec queues on PM suspend
Date: Fri, 22 May 2026 18:43:52 +0200	[thread overview]
Message-ID: <20260522164355.2773-3-thomas.hellstrom@linux.intel.com> (raw)
In-Reply-To: <20260522164355.2773-1-thomas.hellstrom@linux.intel.com>

When xe_guc_submit_stop() is called during an S3/S4 suspend or GT
reset, guc_exec_queue_stop() bans any user exec queue that has a job
which has started but not yet completed.  For normal (non-LR) exec
queues this is the correct behaviour: a started-but-incomplete job at
reset time may indicate a hung workload.

For exec queues attached to Long Running (LR) VMs the same condition
is always true during normal operation: LR jobs are designed to run
indefinitely and are never "completed" in the DRM scheduler sense —
they are preempted and resumed via the preempt-fence mechanism.
Banning such an exec queue on PM suspend permanently prevents the job
from restarting after resume, causing the userspace compute workload to
fail silently.

Fix this by not banning LR VM exec queues when a system suspend or
hibernation is in progress, while preserving the ban for GT reset where
a started-but-incomplete job is a legitimate indicator of a hang.

Fixes: f6375fb3aa94 ("drm/xe: Track LR jobs in DRM scheduler pending list")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Tomasz Lis <tomasz.lis@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.19+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
---
 drivers/gpu/drm/xe/xe_device_types.h |  8 ++++++++
 drivers/gpu/drm/xe/xe_guc_submit.c   | 10 +++++++++-
 drivers/gpu/drm/xe/xe_pm.c           |  5 ++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 32dd2ffbc796..9dbf7b3a0c49 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -433,6 +433,14 @@ struct xe_device {
 	struct notifier_block pm_notifier;
 	/** @pm_block: Completion to block validating tasks on suspend / hibernate prepare */
 	struct completion pm_block;
+	/**
+	 * @pm_suspend_in_progress: True while the device is going through
+	 * system suspend or hibernation (set at xe_pm_suspend() entry, cleared
+	 * at xe_pm_resume() entry or on suspend error). Used to suppress exec
+	 * queue bans that should only apply during GT reset, not PM suspend.
+	 * Serialised by the PM suspend sequence; no lock required.
+	 */
+	bool pm_suspend_in_progress;
 	/** @rebind_resume_list: List of wq items to kick on resume. */
 	struct list_head rebind_resume_list;
 	/** @rebind_resume_lock: Lock to protect the rebind_resume_list */
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 084ecc8e7efa..42bc7425de0d 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2268,8 +2268,16 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 	 * Ban any engine (aside from kernel and engines used for VM ops) with a
 	 * started but not complete job or if a job has gone through a GT reset
 	 * more than twice.
+	 *
+	 * LR VM exec queues are excluded from this ban during PM suspend: their
+	 * jobs are intentionally long-running and are preempted and resumed via
+	 * the preempt-fence mechanism. Banning them on PM suspend would
+	 * permanently prevent the job from restarting after resume.
+	 * On GT reset however we do want to ban them, as that may indicate a
+	 * genuinely hung workload.
 	 */
-	if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL | EXEC_QUEUE_FLAG_VM))) {
+	if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL | EXEC_QUEUE_FLAG_VM)) &&
+	    !(q->vm && xe_vm_in_lr_mode(q->vm) && guc_to_xe(guc)->pm_suspend_in_progress)) {
 		struct xe_sched_job *job = xe_sched_first_pending_job(sched);
 		bool ban = false;
 
diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
index c203a59d7000..76d211986822 100644
--- a/drivers/gpu/drm/xe/xe_pm.c
+++ b/drivers/gpu/drm/xe/xe_pm.c
@@ -176,6 +176,7 @@ int xe_pm_suspend(struct xe_device *xe)
 	int err;
 
 	drm_dbg(&xe->drm, "Suspending device\n");
+	xe->pm_suspend_in_progress = true;
 	xe_pm_block_begin_signalling();
 	trace_xe_pm_suspend(xe, __builtin_return_address(0));
 
@@ -217,6 +218,7 @@ int xe_pm_suspend(struct xe_device *xe)
 	xe_pxp_pm_resume(xe->pxp);
 err:
 	drm_dbg(&xe->drm, "Device suspend failed %d\n", err);
+	xe->pm_suspend_in_progress = false;
 	xe_pm_block_end_signalling();
 	return err;
 }
@@ -234,8 +236,9 @@ int xe_pm_resume(struct xe_device *xe)
 	u8 id;
 	int err;
 
-	xe_pm_block_begin_signalling();
+	xe->pm_suspend_in_progress = false;
 	drm_dbg(&xe->drm, "Resuming device\n");
+	xe_pm_block_begin_signalling();
 	trace_xe_pm_resume(xe, __builtin_return_address(0));
 
 	for_each_gt(gt, xe, id)
-- 
2.54.0


  parent reply	other threads:[~2026-05-22 16:44 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260522164355.2773-1-thomas.hellstrom@linux.intel.com>
2026-05-22 16:43 ` [PATCH 1/5] drm/xe/guc: Defer user exec queue scheduler start until after page table restore Thomas Hellström
2026-05-22 16:43 ` Thomas Hellström [this message]
2026-05-22 16:43 ` [PATCH 5/5] drm/xe: Suspend fault-mode LR jobs before VRAM eviction on S3/S4 Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260522164355.2773-3-thomas.hellstrom@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tomasz.lis@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox