Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues
@ 2026-06-09 14:44 Rodrigo Vivi
  2026-06-09 14:44 ` [PATCH 2/2] drm/xe/lrc: fix spurious warning when reading context timestamp Rodrigo Vivi
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Rodrigo Vivi @ 2026-06-09 14:44 UTC (permalink / raw)
  To: intel-xe
  Cc: Rodrigo Vivi, Matthew Auld, Matthew Brost, Sanjay Yadav,
	Himal Prasad Ghimiray

Jobs that GuC never scheduled were silently errored out instead of
triggering a GT reset. Kernel jobs that exhaust all recovery attempts
should wedge the device rather than silently fail, and userspace VM bind
queues should stay permanently banned rather than being reset and retried.

The queue is banned early in the timeout handler to signal the G2H
scheduling-done handler so it wakes the disable-scheduling waiter; without
it the waiter sleeps the full 5s timeout. For not started works the ban is
cleared before rearming so that guc_exec_queue_start() can resubmit jobs
after the GT reset — a banned queue would block resubmission and cause an
infinite TDR loop.

v2: (Himal) Do it for any queue type, not just kernel/migration

Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Assisted-by: GitHub-Copilot:claude-sonnet-4.6
Assisted-by: GitHub-Copilot:claude-opus-4.8
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 41 ++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 4b247a3019d2..5c40eee41103 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -157,6 +157,11 @@ static void set_exec_queue_banned(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
 }
 
+static void clear_exec_queue_banned(struct xe_exec_queue *q)
+{
+	atomic_andnot(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
+}
+
 static bool exec_queue_suspended(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_SUSPENDED;
@@ -1363,7 +1368,8 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
 			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
 			   q->guc->id);
 
-		return xe_sched_invalidate_job(job, 2);
+		/* GuC never scheduled this job - let the caller trigger a GT reset. */
+		return true;
 	}
 
 	ctx_timestamp = lower_32_bits(xe_lrc_timestamp(q->lrc[0]));
@@ -1460,6 +1466,12 @@ static void disable_scheduling(struct xe_exec_queue *q, bool immediate)
 			       G2H_LEN_DW_SCHED_CONTEXT_MODE_SET, 1);
 }
 
+/* Unstarted jobs (GuC scheduling failure) and kernel queues recover via GT reset */
+static bool timeout_needs_gt_reset(struct xe_exec_queue *q, struct xe_sched_job *job)
+{
+	return !xe_sched_job_started(job) || (q->flags & EXEC_QUEUE_FLAG_KERNEL);
+}
+
 static enum drm_gpu_sched_stat
 guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 {
@@ -1608,19 +1620,20 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 			       xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
 			       q->guc->id, q->flags);
 
-	/*
-	 * Kernel jobs should never fail, nor should VM jobs if they do
-	 * somethings has gone wrong and the GT needs a reset
-	 */
-	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_KERNEL,
-		   "Kernel-submitted job timed out\n");
-	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q),
-		   "VM job timed out on non-killed execqueue\n");
-	if (!wedged && (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
-			(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)))) {
-		if (!xe_sched_invalidate_job(job, 2)) {
-			xe_gt_reset_async(q->gt);
-			goto rearm;
+	if (!wedged) {
+		if (timeout_needs_gt_reset(q, job)) {
+			/* Retry after a GT reset; wedge a kernel queue once karma is exhausted */
+			if (!xe_sched_invalidate_job(job, 2)) {
+				clear_exec_queue_banned(q);
+				xe_gt_reset_async(q->gt);
+				goto rearm;
+			}
+			if (q->flags & EXEC_QUEUE_FLAG_KERNEL) {
+				xe_gt_WARN(q->gt, true, "Kernel-submitted job timed out\n");
+				xe_device_declare_wedged(gt_to_xe(q->gt));
+			}
+		} else if (q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)) {
+			xe_gt_WARN(q->gt, true, "VM job timed out on non-killed execqueue\n");
 		}
 	}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread
* [PATCH 1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues
@ 2026-06-10 15:25 Rodrigo Vivi
  2026-06-10 16:07 ` Ghimiray, Himal Prasad
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Rodrigo Vivi @ 2026-06-10 15:25 UTC (permalink / raw)
  To: intel-xe
  Cc: Rodrigo Vivi, Matthew Auld, Matthew Brost, Sanjay Yadav,
	Himal Prasad Ghimiray

A job that GuC never scheduled (never started) indicates a GuC
scheduling failure; previously such jobs were silently errored out
instead of triggering a GT reset to recover. Trigger a GT reset and
resubmit them, but only when the queue was not already killed or banned:
an unstarted job on an already banned queue is the ban working as
intended and must neither clear the ban nor kick off a reset, otherwise
a banned userspace queue could be resurrected and spam GT resets.

Kernel queues are always recovered this way and wedge the device once
recovery attempts are exhausted, since kernel work must not silently
fail. A started job that times out on a userspace VM bind queue stays
banned rather than being reset and retried.

The queue is banned early in the timeout handler to signal the G2H
scheduling-done handler so it wakes the disable-scheduling waiter;
without it the waiter sleeps the full 5s timeout. When a reset is
warranted the ban is cleared before rearming so that
guc_exec_queue_start() can resubmit jobs after the GT reset - a
still-banned queue would block resubmission and cause an infinite TDR
loop. The already-banned case is gated out before this point via
skip_timeout_check, so it is unaffected.

v2: (Himal) Do it for any queue type, not just kernel/migration
v3: - (Sashiko and Sanjay): don't clear the ban / GT reset for already
      killed/banned queues on unstarted-job timeout
    - Update commit message
    - (Matt) Add Fixes tag

Fixes: fe05cee4d953 ("drm/xe: Don't short circuit TDR on jobs not started")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Assisted-by: GitHub-Copilot:claude-sonnet-4.6
Assisted-by: GitHub-Copilot:claude-opus-4.8
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 49 +++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index b29cc08e6291..e82018445b7c 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -157,6 +157,11 @@ static void set_exec_queue_banned(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
 }
 
+static void clear_exec_queue_banned(struct xe_exec_queue *q)
+{
+	atomic_andnot(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
+}
+
 static bool exec_queue_suspended(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_SUSPENDED;
@@ -1363,7 +1368,8 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
 			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
 			   q->guc->id);
 
-		return xe_sched_invalidate_job(job, 2);
+		/* GuC never scheduled this job - let the caller trigger a GT reset. */
+		return true;
 	}
 
 	ctx_timestamp = lower_32_bits(xe_lrc_timestamp(q->lrc[0]));
@@ -1460,6 +1466,21 @@ static void disable_scheduling(struct xe_exec_queue *q, bool immediate)
 			       G2H_LEN_DW_SCHED_CONTEXT_MODE_SET, 1);
 }
 
+/*
+ * Recover via GT reset for a kernel queue, or for a GuC scheduling failure (job
+ * never started) on a queue that was not already killed or banned. An already
+ * banned queue must stay banned, so its unstarted jobs do not clear the ban or
+ * trigger a reset.
+ */
+static bool timeout_needs_gt_reset(struct xe_exec_queue *q, struct xe_sched_job *job,
+				   bool skip_timeout_check)
+{
+	if (q->flags & EXEC_QUEUE_FLAG_KERNEL)
+		return true;
+
+	return !skip_timeout_check && !xe_sched_job_started(job);
+}
+
 static enum drm_gpu_sched_stat
 guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 {
@@ -1608,19 +1629,19 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 			       xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
 			       q->guc->id, q->flags);
 
-	/*
-	 * Kernel jobs should never fail, nor should VM jobs if they do
-	 * somethings has gone wrong and the GT needs a reset
-	 */
-	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_KERNEL,
-		   "Kernel-submitted job timed out\n");
-	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q),
-		   "VM job timed out on non-killed execqueue\n");
-	if (!wedged && (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
-			(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)))) {
-		if (!xe_sched_invalidate_job(job, 2)) {
-			xe_gt_reset_async(q->gt);
-			goto rearm;
+	if (!wedged) {
+		if (timeout_needs_gt_reset(q, job, skip_timeout_check)) {
+			if (!xe_sched_invalidate_job(job, 2)) {
+				clear_exec_queue_banned(q);
+				xe_gt_reset_async(q->gt);
+				goto rearm;
+			}
+			if (q->flags & EXEC_QUEUE_FLAG_KERNEL) {
+				xe_gt_WARN(q->gt, true, "Kernel-submitted job timed out\n");
+				xe_device_declare_wedged(gt_to_xe(q->gt));
+			}
+		} else if (q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)) {
+			xe_gt_WARN(q->gt, true, "VM job timed out on non-killed execqueue\n");
 		}
 	}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-06-11  4:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09 14:44 [PATCH 1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues Rodrigo Vivi
2026-06-09 14:44 ` [PATCH 2/2] drm/xe/lrc: fix spurious warning when reading context timestamp Rodrigo Vivi
2026-06-09 16:23   ` Ghimiray, Himal Prasad
2026-06-09 14:49 ` ✗ CI.checkpatch: warning for series starting with [1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues Patchwork
2026-06-09 14:51 ` ✓ CI.KUnit: success " Patchwork
2026-06-09 15:34 ` [PATCH 1/2] " Matthew Brost
2026-06-09 16:05   ` Rodrigo Vivi
2026-06-09 16:12     ` Matthew Brost
2026-06-09 15:46 ` ✓ Xe.CI.BAT: success for series starting with [1/2] " Patchwork
2026-06-09 16:16 ` [PATCH 1/2] " Ghimiray, Himal Prasad
2026-06-10  3:38 ` ✗ Xe.CI.FULL: failure for series starting with [1/2] " Patchwork
2026-06-10  9:47 ` [PATCH 1/2] " Yadav, Sanjay Kumar
2026-06-10 15:24   ` Rodrigo Vivi
  -- strict thread matches above, loose matches on Subject: below --
2026-06-10 15:25 Rodrigo Vivi
2026-06-10 16:07 ` Ghimiray, Himal Prasad
2026-06-10 16:30 ` Matthew Brost
2026-06-11  4:55 ` Yadav, Sanjay Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox