dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe
@ 2025-12-01 18:39 Matthew Brost
  2025-12-01 18:39 ` [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state Matthew Brost
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

At XDC, we discussed that drivers should avoid accessing DRM scheduler
internals, misusing DRM scheduler locks, and adopt a well-defined
pending job list iterator. This series proposes the necessary changes to
the DRM scheduler to bring Xe in line with that agreement and updates Xe
to use the new DRM scheduler API.

While here, cleanup LR queue handling and simplify GuC state machine in
Xe too. Also rework LRC timestamp sampling to avoid scheduling toggle.

v2:
 - Fix checkpatch / naming issues
v3:
 - Only allow pending job list iterator to be called on stopped schedulers
 - Cleanup LR queue handling / fix a few misselanous Xe scheduler issues
v4:
 - Address Niranjana's feedback
 - Add patch to avoid toggling scheduler state in the TDR
v5:
 - Rebase
 - Fixup LRC timeout check (Umesh)
v6:
 - Fix VF bugs (Testing)
v7:
 - Disable timestamp WA on VF

Matt

Matthew Brost (9):
  drm/sched: Add several job helpers to avoid drivers touching scheduler
    state
  drm/sched: Add pending job list iterator
  drm/xe: Add dedicated message lock
  drm/xe: Stop abusing DRM scheduler internals
  drm/xe: Only toggle scheduling in TDR if GuC is running
  drm/xe: Do not deregister queues in TDR
  drm/xe: Remove special casing for LR queues in submission
  drm/xe: Disable timestamp WA on VFs
  drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR

 drivers/gpu/drm/scheduler/sched_main.c       |   4 +-
 drivers/gpu/drm/xe/xe_gpu_scheduler.c        |   9 +-
 drivers/gpu/drm/xe/xe_gpu_scheduler.h        |  37 +-
 drivers/gpu/drm/xe/xe_gpu_scheduler_types.h  |   2 +
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |   2 -
 drivers/gpu/drm/xe/xe_guc_submit.c           | 362 +++----------------
 drivers/gpu/drm/xe/xe_guc_submit_types.h     |  11 -
 drivers/gpu/drm/xe/xe_hw_fence.c             |  16 -
 drivers/gpu/drm/xe/xe_hw_fence.h             |   2 -
 drivers/gpu/drm/xe/xe_lrc.c                  |  45 ++-
 drivers/gpu/drm/xe/xe_lrc.h                  |   3 +-
 drivers/gpu/drm/xe/xe_ring_ops.c             |  25 +-
 drivers/gpu/drm/xe/xe_sched_job.c            |   1 +
 drivers/gpu/drm/xe/xe_sched_job_types.h      |   2 +
 drivers/gpu/drm/xe/xe_trace.h                |   5 -
 include/drm/gpu_scheduler.h                  |  82 +++++
 16 files changed, 211 insertions(+), 397 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-03  8:56   ` Philipp Stanner
  2025-12-01 18:39 ` [PATCH v7 2/9] drm/sched: Add pending job list iterator Matthew Brost
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

Add helpers to see if scheduler is stopped and a jobs signaled state.
Expected to be used driver side on recovery and debug flows.

v4:
 - Reorder patch to first in series (Niranjana)
 - Also check parent fence for signaling (Niranjana)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c |  4 ++--
 include/drm/gpu_scheduler.h            | 32 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1d4f1b822e7b..cf40c18ab433 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -344,7 +344,7 @@ drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
  */
 static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
 {
-	if (!READ_ONCE(sched->pause_submit))
+	if (!drm_sched_is_stopped(sched))
 		queue_work(sched->submit_wq, &sched->work_run_job);
 }
 
@@ -354,7 +354,7 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
  */
 static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
 {
-	if (!READ_ONCE(sched->pause_submit))
+	if (!drm_sched_is_stopped(sched))
 		queue_work(sched->submit_wq, &sched->work_free_job);
 }
 
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index fb88301b3c45..385bf34e76fe 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -698,4 +698,36 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				   struct drm_gpu_scheduler **sched_list,
 				   unsigned int num_sched_list);
 
+/* Inlines */
+
+/**
+ * drm_sched_is_stopped() - DRM is stopped
+ * @sched: DRM scheduler
+ *
+ * Return: True if sched is stopped, False otherwise
+ */
+static inline bool drm_sched_is_stopped(struct drm_gpu_scheduler *sched)
+{
+	return READ_ONCE(sched->pause_submit);
+}
+
+/**
+ * drm_sched_job_is_signaled() - DRM scheduler job is signaled
+ * @job: DRM scheduler job
+ *
+ * Determine if DRM scheduler job is signaled. DRM scheduler should be stopped
+ * to obtain a stable snapshot of state. Both parent fence (hardware fence) and
+ * finished fence (software fence) are check to determine signaling state.
+ *
+ * Return: True if job is signaled, False otherwise
+ */
+static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
+{
+	struct drm_sched_fence *s_fence = job->s_fence;
+
+	WARN_ON(!drm_sched_is_stopped(job->sched));
+	return (s_fence->parent && dma_fence_is_signaled(s_fence->parent)) ||
+		dma_fence_is_signaled(&s_fence->finished);
+}
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 2/9] drm/sched: Add pending job list iterator
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
  2025-12-01 18:39 ` [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-03  9:07   ` Philipp Stanner
  2025-12-01 18:39 ` [PATCH v7 3/9] drm/xe: Add dedicated message lock Matthew Brost
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

Stop open coding pending job list in drivers. Add pending job list
iterator which safely walks DRM scheduler list asserting DRM scheduler
is stopped.

v2:
 - Fix checkpatch (CI)
v3:
 - Drop locked version (Christian)
v4:
 - Reorder patch (Niranjana)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 385bf34e76fe..9d228513d06c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
 		dma_fence_is_signaled(&s_fence->finished);
 }
 
+/**
+ * struct drm_sched_pending_job_iter - DRM scheduler pending job iterator state
+ * @sched: DRM scheduler associated with pending job iterator
+ */
+struct drm_sched_pending_job_iter {
+	struct drm_gpu_scheduler *sched;
+};
+
+/* Drivers should never call this directly */
+static inline struct drm_sched_pending_job_iter
+__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_pending_job_iter iter = {
+		.sched = sched,
+	};
+
+	WARN_ON(!drm_sched_is_stopped(sched));
+	return iter;
+}
+
+/* Drivers should never call this directly */
+static inline void
+__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter iter)
+{
+	WARN_ON(!drm_sched_is_stopped(iter.sched));
+}
+
+DEFINE_CLASS(drm_sched_pending_job_iter, struct drm_sched_pending_job_iter,
+	     __drm_sched_pending_job_iter_end(_T),
+	     __drm_sched_pending_job_iter_begin(__sched),
+	     struct drm_gpu_scheduler *__sched);
+static inline void *
+class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t *_T)
+{ return _T; }
+#define class_drm_sched_pending_job_iter_is_conditional false
+
+/**
+ * drm_sched_for_each_pending_job() - Iterator for each pending job in scheduler
+ * @__job: Current pending job being iterated over
+ * @__sched: DRM scheduler to iterate over pending jobs
+ * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
+ *
+ * Iterator for each pending job in scheduler, filtering on an entity, and
+ * enforcing scheduler is fully stopped
+ */
+#define drm_sched_for_each_pending_job(__job, __sched, __entity)		\
+	scoped_guard(drm_sched_pending_job_iter, (__sched))			\
+		list_for_each_entry((__job), &(__sched)->pending_list, list)	\
+			for_each_if(!(__entity) || (__job)->entity == (__entity))
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 3/9] drm/xe: Add dedicated message lock
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
  2025-12-01 18:39 ` [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state Matthew Brost
  2025-12-01 18:39 ` [PATCH v7 2/9] drm/sched: Add pending job list iterator Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-03  9:38   ` Philipp Stanner
  2025-12-01 18:39 ` [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals Matthew Brost
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

Stop abusing DRM scheduler job list lock for messages, add dedicated
message lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/xe/xe_gpu_scheduler.c       | 5 +++--
 drivers/gpu/drm/xe/xe_gpu_scheduler.h       | 4 ++--
 drivers/gpu/drm/xe/xe_gpu_scheduler_types.h | 2 ++
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
index f91e06d03511..f4f23317191f 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
@@ -77,6 +77,7 @@ int xe_sched_init(struct xe_gpu_scheduler *sched,
 	};
 
 	sched->ops = xe_ops;
+	spin_lock_init(&sched->msg_lock);
 	INIT_LIST_HEAD(&sched->msgs);
 	INIT_WORK(&sched->work_process_msg, xe_sched_process_msg_work);
 
@@ -117,7 +118,7 @@ void xe_sched_add_msg(struct xe_gpu_scheduler *sched,
 void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
 			     struct xe_sched_msg *msg)
 {
-	lockdep_assert_held(&sched->base.job_list_lock);
+	lockdep_assert_held(&sched->msg_lock);
 
 	list_add_tail(&msg->link, &sched->msgs);
 	xe_sched_process_msg_queue(sched);
@@ -131,7 +132,7 @@ void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
 void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
 			   struct xe_sched_msg *msg)
 {
-	lockdep_assert_held(&sched->base.job_list_lock);
+	lockdep_assert_held(&sched->msg_lock);
 
 	list_add(&msg->link, &sched->msgs);
 	xe_sched_process_msg_queue(sched);
diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
index c7a77a3a9681..dceb2cd0ee5b 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
@@ -33,12 +33,12 @@ void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
 
 static inline void xe_sched_msg_lock(struct xe_gpu_scheduler *sched)
 {
-	spin_lock(&sched->base.job_list_lock);
+	spin_lock(&sched->msg_lock);
 }
 
 static inline void xe_sched_msg_unlock(struct xe_gpu_scheduler *sched)
 {
-	spin_unlock(&sched->base.job_list_lock);
+	spin_unlock(&sched->msg_lock);
 }
 
 static inline void xe_sched_stop(struct xe_gpu_scheduler *sched)
diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h b/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h
index 6731b13da8bb..63d9bf92583c 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h
@@ -47,6 +47,8 @@ struct xe_gpu_scheduler {
 	const struct xe_sched_backend_ops	*ops;
 	/** @msgs: list of messages to be processed in @work_process_msg */
 	struct list_head			msgs;
+	/** @msg_lock: Message lock */
+	spinlock_t				msg_lock;
 	/** @work_process_msg: processes messages */
 	struct work_struct		work_process_msg;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
                   ` (2 preceding siblings ...)
  2025-12-01 18:39 ` [PATCH v7 3/9] drm/xe: Add dedicated message lock Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-03 10:56   ` Philipp Stanner
  2025-12-01 18:39 ` [PATCH v7 5/9] drm/xe: Only toggle scheduling in TDR if GuC is running Matthew Brost
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

Use new pending job list iterator and new helper functions in Xe to
avoid reaching into DRM scheduler internals.

Part of this change involves removing pending jobs debug information
from debugfs and devcoredump. As agreed, the pending job list should
only be accessed when the scheduler is stopped. However, it's not
straightforward to determine whether the scheduler is stopped from the
shared debugfs/devcoredump code path. Additionally, the pending job list
provides little useful information, as pending jobs can be inferred from
seqnos and ring head/tail positions. Therefore, this debug information
is being removed.

v4:
 - Add comment around DRM_GPU_SCHED_STAT_NO_HANG (Niranjana)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/xe/xe_gpu_scheduler.c    |  4 +-
 drivers/gpu/drm/xe/xe_gpu_scheduler.h    | 33 ++--------
 drivers/gpu/drm/xe/xe_guc_submit.c       | 81 ++++++------------------
 drivers/gpu/drm/xe/xe_guc_submit_types.h | 11 ----
 drivers/gpu/drm/xe/xe_hw_fence.c         | 16 -----
 drivers/gpu/drm/xe/xe_hw_fence.h         |  2 -
 6 files changed, 27 insertions(+), 120 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
index f4f23317191f..9c8004d5dd91 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
@@ -7,7 +7,7 @@
 
 static void xe_sched_process_msg_queue(struct xe_gpu_scheduler *sched)
 {
-	if (!READ_ONCE(sched->base.pause_submit))
+	if (!drm_sched_is_stopped(&sched->base))
 		queue_work(sched->base.submit_wq, &sched->work_process_msg);
 }
 
@@ -43,7 +43,7 @@ static void xe_sched_process_msg_work(struct work_struct *w)
 		container_of(w, struct xe_gpu_scheduler, work_process_msg);
 	struct xe_sched_msg *msg;
 
-	if (READ_ONCE(sched->base.pause_submit))
+	if (drm_sched_is_stopped(&sched->base))
 		return;
 
 	msg = xe_sched_get_msg(sched);
diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
index dceb2cd0ee5b..664c2db56af3 100644
--- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
+++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
@@ -56,12 +56,9 @@ static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
 	struct drm_sched_job *s_job;
 	bool restore_replay = false;
 
-	list_for_each_entry(s_job, &sched->base.pending_list, list) {
-		struct drm_sched_fence *s_fence = s_job->s_fence;
-		struct dma_fence *hw_fence = s_fence->parent;
-
+	drm_sched_for_each_pending_job(s_job, &sched->base, NULL) {
 		restore_replay |= to_xe_sched_job(s_job)->restore_replay;
-		if (restore_replay || (hw_fence && !dma_fence_is_signaled(hw_fence)))
+		if (restore_replay || !drm_sched_job_is_signaled(s_job))
 			sched->base.ops->run_job(s_job);
 	}
 }
@@ -72,14 +69,6 @@ xe_sched_invalidate_job(struct xe_sched_job *job, int threshold)
 	return drm_sched_invalidate_job(&job->drm, threshold);
 }
 
-static inline void xe_sched_add_pending_job(struct xe_gpu_scheduler *sched,
-					    struct xe_sched_job *job)
-{
-	spin_lock(&sched->base.job_list_lock);
-	list_add(&job->drm.list, &sched->base.pending_list);
-	spin_unlock(&sched->base.job_list_lock);
-}
-
 /**
  * xe_sched_first_pending_job() - Find first pending job which is unsignaled
  * @sched: Xe GPU scheduler
@@ -89,21 +78,13 @@ static inline void xe_sched_add_pending_job(struct xe_gpu_scheduler *sched,
 static inline
 struct xe_sched_job *xe_sched_first_pending_job(struct xe_gpu_scheduler *sched)
 {
-	struct xe_sched_job *job, *r_job = NULL;
-
-	spin_lock(&sched->base.job_list_lock);
-	list_for_each_entry(job, &sched->base.pending_list, drm.list) {
-		struct drm_sched_fence *s_fence = job->drm.s_fence;
-		struct dma_fence *hw_fence = s_fence->parent;
+	struct drm_sched_job *job;
 
-		if (hw_fence && !dma_fence_is_signaled(hw_fence)) {
-			r_job = job;
-			break;
-		}
-	}
-	spin_unlock(&sched->base.job_list_lock);
+	drm_sched_for_each_pending_job(job, &sched->base, NULL)
+		if (!drm_sched_job_is_signaled(job))
+			return to_xe_sched_job(job);
 
-	return r_job;
+	return NULL;
 }
 
 static inline int
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 3ca2558c8c96..c8027ccaec81 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1032,7 +1032,7 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	struct xe_exec_queue *q = ge->q;
 	struct xe_guc *guc = exec_queue_to_guc(q);
 	struct xe_gpu_scheduler *sched = &ge->sched;
-	struct xe_sched_job *job;
+	struct drm_sched_job *job;
 	bool wedged = false;
 
 	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
@@ -1091,16 +1091,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
 	if (!exec_queue_killed(q) && !xe_lrc_ring_is_idle(q->lrc[0]))
 		xe_devcoredump(q, NULL, "LR job cleanup, guc_id=%d", q->guc->id);
 
-	xe_hw_fence_irq_stop(q->fence_irq);
+	drm_sched_for_each_pending_job(job, &sched->base, NULL)
+		xe_sched_job_set_error(to_xe_sched_job(job), -ECANCELED);
 
 	xe_sched_submission_start(sched);
-
-	spin_lock(&sched->base.job_list_lock);
-	list_for_each_entry(job, &sched->base.pending_list, drm.list)
-		xe_sched_job_set_error(job, -ECANCELED);
-	spin_unlock(&sched->base.job_list_lock);
-
-	xe_hw_fence_irq_start(q->fence_irq);
 }
 
 #define ADJUST_FIVE_PERCENT(__t)	mul_u64_u32_div(__t, 105, 100)
@@ -1219,7 +1213,7 @@ static enum drm_gpu_sched_stat
 guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 {
 	struct xe_sched_job *job = to_xe_sched_job(drm_job);
-	struct xe_sched_job *tmp_job;
+	struct drm_sched_job *tmp_job;
 	struct xe_exec_queue *q = job->q;
 	struct xe_gpu_scheduler *sched = &q->guc->sched;
 	struct xe_guc *guc = exec_queue_to_guc(q);
@@ -1227,7 +1221,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	struct xe_device *xe = guc_to_xe(guc);
 	int err = -ETIME;
 	pid_t pid = -1;
-	int i = 0;
 	bool wedged = false, skip_timeout_check;
 
 	xe_gt_assert(guc_to_gt(guc), !xe_exec_queue_is_lr(q));
@@ -1392,28 +1385,19 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		__deregister_exec_queue(guc, q);
 	}
 
-	/* Stop fence signaling */
-	xe_hw_fence_irq_stop(q->fence_irq);
+	/* Mark all outstanding jobs as bad, thus completing them */
+	xe_sched_job_set_error(job, err);
+	drm_sched_for_each_pending_job(tmp_job, &sched->base, NULL)
+		xe_sched_job_set_error(to_xe_sched_job(tmp_job), -ECANCELED);
 
-	/*
-	 * Fence state now stable, stop / start scheduler which cleans up any
-	 * fences that are complete
-	 */
-	xe_sched_add_pending_job(sched, job);
 	xe_sched_submission_start(sched);
-
 	xe_guc_exec_queue_trigger_cleanup(q);
 
-	/* Mark all outstanding jobs as bad, thus completing them */
-	spin_lock(&sched->base.job_list_lock);
-	list_for_each_entry(tmp_job, &sched->base.pending_list, drm.list)
-		xe_sched_job_set_error(tmp_job, !i++ ? err : -ECANCELED);
-	spin_unlock(&sched->base.job_list_lock);
-
-	/* Start fence signaling */
-	xe_hw_fence_irq_start(q->fence_irq);
-
-	return DRM_GPU_SCHED_STAT_RESET;
+	/*
+	 * We want the job added back to the pending list so it gets freed; this
+	 * is what DRM_GPU_SCHED_STAT_NO_HANG does.
+	 */
+	return DRM_GPU_SCHED_STAT_NO_HANG;
 
 sched_enable:
 	set_exec_queue_pending_tdr_exit(q);
@@ -2265,9 +2249,12 @@ static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
 {
 	struct xe_gpu_scheduler *sched = &q->guc->sched;
 	struct xe_sched_job *job = NULL;
+	struct drm_sched_job *s_job;
 	bool restore_replay = false;
 
-	list_for_each_entry(job, &sched->base.pending_list, drm.list) {
+	drm_sched_for_each_pending_job(s_job, &sched->base, NULL) {
+		job = to_xe_sched_job(s_job);
+
 		restore_replay |= job->restore_replay;
 		if (restore_replay) {
 			xe_gt_dbg(guc_to_gt(guc), "Replay JOB - guc_id=%d, seqno=%d",
@@ -2391,7 +2378,7 @@ void xe_guc_submit_unpause_vf(struct xe_guc *guc)
 		 * created after resfix done.
 		 */
 		if (q->guc->id != index ||
-		    !READ_ONCE(q->guc->sched.base.pause_submit))
+		    !drm_sched_is_stopped(&q->guc->sched.base))
 			continue;
 
 		guc_exec_queue_unpause(guc, q);
@@ -2813,30 +2800,6 @@ xe_guc_exec_queue_snapshot_capture(struct xe_exec_queue *q)
 	if (snapshot->parallel_execution)
 		guc_exec_queue_wq_snapshot_capture(q, snapshot);
 
-	spin_lock(&sched->base.job_list_lock);
-	snapshot->pending_list_size = list_count_nodes(&sched->base.pending_list);
-	snapshot->pending_list = kmalloc_array(snapshot->pending_list_size,
-					       sizeof(struct pending_list_snapshot),
-					       GFP_ATOMIC);
-
-	if (snapshot->pending_list) {
-		struct xe_sched_job *job_iter;
-
-		i = 0;
-		list_for_each_entry(job_iter, &sched->base.pending_list, drm.list) {
-			snapshot->pending_list[i].seqno =
-				xe_sched_job_seqno(job_iter);
-			snapshot->pending_list[i].fence =
-				dma_fence_is_signaled(job_iter->fence) ? 1 : 0;
-			snapshot->pending_list[i].finished =
-				dma_fence_is_signaled(&job_iter->drm.s_fence->finished)
-				? 1 : 0;
-			i++;
-		}
-	}
-
-	spin_unlock(&sched->base.job_list_lock);
-
 	return snapshot;
 }
 
@@ -2894,13 +2857,6 @@ xe_guc_exec_queue_snapshot_print(struct xe_guc_submit_exec_queue_snapshot *snaps
 
 	if (snapshot->parallel_execution)
 		guc_exec_queue_wq_snapshot_print(snapshot, p);
-
-	for (i = 0; snapshot->pending_list && i < snapshot->pending_list_size;
-	     i++)
-		drm_printf(p, "\tJob: seqno=%d, fence=%d, finished=%d\n",
-			   snapshot->pending_list[i].seqno,
-			   snapshot->pending_list[i].fence,
-			   snapshot->pending_list[i].finished);
 }
 
 /**
@@ -2923,7 +2879,6 @@ void xe_guc_exec_queue_snapshot_free(struct xe_guc_submit_exec_queue_snapshot *s
 			xe_lrc_snapshot_free(snapshot->lrc[i]);
 		kfree(snapshot->lrc);
 	}
-	kfree(snapshot->pending_list);
 	kfree(snapshot);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit_types.h b/drivers/gpu/drm/xe/xe_guc_submit_types.h
index dc7456c34583..0b08c79cf3b9 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_submit_types.h
@@ -61,12 +61,6 @@ struct guc_submit_parallel_scratch {
 	u32 wq[WQ_SIZE / sizeof(u32)];
 };
 
-struct pending_list_snapshot {
-	u32 seqno;
-	bool fence;
-	bool finished;
-};
-
 /**
  * struct xe_guc_submit_exec_queue_snapshot - Snapshot for devcoredump
  */
@@ -134,11 +128,6 @@ struct xe_guc_submit_exec_queue_snapshot {
 		/** @wq: Workqueue Items */
 		u32 wq[WQ_SIZE / sizeof(u32)];
 	} parallel;
-
-	/** @pending_list_size: Size of the pending list snapshot array */
-	int pending_list_size;
-	/** @pending_list: snapshot of the pending list info */
-	struct pending_list_snapshot *pending_list;
 };
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_hw_fence.c b/drivers/gpu/drm/xe/xe_hw_fence.c
index b2a0c46dfcd4..e65dfcdfdbc5 100644
--- a/drivers/gpu/drm/xe/xe_hw_fence.c
+++ b/drivers/gpu/drm/xe/xe_hw_fence.c
@@ -110,22 +110,6 @@ void xe_hw_fence_irq_run(struct xe_hw_fence_irq *irq)
 	irq_work_queue(&irq->work);
 }
 
-void xe_hw_fence_irq_stop(struct xe_hw_fence_irq *irq)
-{
-	spin_lock_irq(&irq->lock);
-	irq->enabled = false;
-	spin_unlock_irq(&irq->lock);
-}
-
-void xe_hw_fence_irq_start(struct xe_hw_fence_irq *irq)
-{
-	spin_lock_irq(&irq->lock);
-	irq->enabled = true;
-	spin_unlock_irq(&irq->lock);
-
-	irq_work_queue(&irq->work);
-}
-
 void xe_hw_fence_ctx_init(struct xe_hw_fence_ctx *ctx, struct xe_gt *gt,
 			  struct xe_hw_fence_irq *irq, const char *name)
 {
diff --git a/drivers/gpu/drm/xe/xe_hw_fence.h b/drivers/gpu/drm/xe/xe_hw_fence.h
index f13a1c4982c7..599492c13f80 100644
--- a/drivers/gpu/drm/xe/xe_hw_fence.h
+++ b/drivers/gpu/drm/xe/xe_hw_fence.h
@@ -17,8 +17,6 @@ void xe_hw_fence_module_exit(void);
 void xe_hw_fence_irq_init(struct xe_hw_fence_irq *irq);
 void xe_hw_fence_irq_finish(struct xe_hw_fence_irq *irq);
 void xe_hw_fence_irq_run(struct xe_hw_fence_irq *irq);
-void xe_hw_fence_irq_stop(struct xe_hw_fence_irq *irq);
-void xe_hw_fence_irq_start(struct xe_hw_fence_irq *irq);
 
 void xe_hw_fence_ctx_init(struct xe_hw_fence_ctx *ctx, struct xe_gt *gt,
 			  struct xe_hw_fence_irq *irq, const char *name);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 5/9] drm/xe: Only toggle scheduling in TDR if GuC is running
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
                   ` (3 preceding siblings ...)
  2025-12-01 18:39 ` [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-01 18:39 ` [PATCH v7 6/9] drm/xe: Do not deregister queues in TDR Matthew Brost
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

If the firmware is not running during TDR (e.g., when the driver is
unloading), there's no need to toggle scheduling in the GuC. In such
cases, skip this step.

v4:
 - Bail on wait UC not running (Niranjana)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index c8027ccaec81..d432a39cfbfd 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1274,7 +1274,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		if (exec_queue_reset(q))
 			err = -EIO;
 
-		if (!exec_queue_destroyed(q)) {
+		if (!exec_queue_destroyed(q) && xe_uc_fw_is_running(&guc->fw)) {
 			/*
 			 * Wait for any pending G2H to flush out before
 			 * modifying state
@@ -1309,6 +1309,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		 */
 		smp_rmb();
 		ret = wait_event_timeout(guc->ct.wq,
+					 !xe_uc_fw_is_running(&guc->fw) ||
 					 !exec_queue_pending_disable(q) ||
 					 xe_guc_read_stopped(guc) ||
 					 vf_recovery(guc), HZ * 5);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 6/9] drm/xe: Do not deregister queues in TDR
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
                   ` (4 preceding siblings ...)
  2025-12-01 18:39 ` [PATCH v7 5/9] drm/xe: Only toggle scheduling in TDR if GuC is running Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-01 18:39 ` [PATCH v7 7/9] drm/xe: Remove special casing for LR queues in submission Matthew Brost
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

Deregistering queues in the TDR introduces unnecessary complexity,
requiring reference-counting techniques to function correctly,
particularly to prevent use-after-free (UAF) issues while a
deregistration initiated from the TDR is in progress.

All that's needed in the TDR is to kick the queue off the hardware,
which is achieved by disabling scheduling. Queue deregistration should
be handled in a single, well-defined point in the cleanup path, tied to
the queue's reference count.

v4:
 - Explain why extra ref were needed prior to this patch (Niranjana)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 65 +++++-------------------------
 1 file changed, 9 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index d432a39cfbfd..622b3d92ba41 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -69,9 +69,8 @@ exec_queue_to_guc(struct xe_exec_queue *q)
 #define EXEC_QUEUE_STATE_WEDGED			(1 << 8)
 #define EXEC_QUEUE_STATE_BANNED			(1 << 9)
 #define EXEC_QUEUE_STATE_CHECK_TIMEOUT		(1 << 10)
-#define EXEC_QUEUE_STATE_EXTRA_REF		(1 << 11)
-#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 12)
-#define EXEC_QUEUE_STATE_PENDING_TDR_EXIT	(1 << 13)
+#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 11)
+#define EXEC_QUEUE_STATE_PENDING_TDR_EXIT	(1 << 12)
 
 static bool exec_queue_registered(struct xe_exec_queue *q)
 {
@@ -218,21 +217,6 @@ static void clear_exec_queue_check_timeout(struct xe_exec_queue *q)
 	atomic_and(~EXEC_QUEUE_STATE_CHECK_TIMEOUT, &q->guc->state);
 }
 
-static bool exec_queue_extra_ref(struct xe_exec_queue *q)
-{
-	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_EXTRA_REF;
-}
-
-static void set_exec_queue_extra_ref(struct xe_exec_queue *q)
-{
-	atomic_or(EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
-}
-
-static void clear_exec_queue_extra_ref(struct xe_exec_queue *q)
-{
-	atomic_and(~EXEC_QUEUE_STATE_EXTRA_REF, &q->guc->state);
-}
-
 static bool exec_queue_pending_resume(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
@@ -1190,25 +1174,6 @@ static void disable_scheduling(struct xe_exec_queue *q, bool immediate)
 		       G2H_LEN_DW_SCHED_CONTEXT_MODE_SET, 1);
 }
 
-static void __deregister_exec_queue(struct xe_guc *guc, struct xe_exec_queue *q)
-{
-	u32 action[] = {
-		XE_GUC_ACTION_DEREGISTER_CONTEXT,
-		q->guc->id,
-	};
-
-	xe_gt_assert(guc_to_gt(guc), !exec_queue_destroyed(q));
-	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
-	xe_gt_assert(guc_to_gt(guc), !exec_queue_pending_enable(q));
-	xe_gt_assert(guc_to_gt(guc), !exec_queue_pending_disable(q));
-
-	set_exec_queue_destroyed(q);
-	trace_xe_exec_queue_deregister(q);
-
-	xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action),
-		       G2H_LEN_DW_DEREGISTER_CONTEXT, 1);
-}
-
 static enum drm_gpu_sched_stat
 guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 {
@@ -1224,6 +1189,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	bool wedged = false, skip_timeout_check;
 
 	xe_gt_assert(guc_to_gt(guc), !xe_exec_queue_is_lr(q));
+	xe_gt_assert(guc_to_gt(guc), !exec_queue_destroyed(q));
 
 	/*
 	 * TDR has fired before free job worker. Common if exec queue
@@ -1240,8 +1206,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 
 	/* Must check all state after stopping scheduler */
 	skip_timeout_check = exec_queue_reset(q) ||
-		exec_queue_killed_or_banned_or_wedged(q) ||
-		exec_queue_destroyed(q);
+		exec_queue_killed_or_banned_or_wedged(q);
 
 	/*
 	 * If devcoredump not captured and GuC capture for the job is not ready
@@ -1268,13 +1233,13 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
 
 	/* Engine state now stable, disable scheduling to check timestamp */
-	if (!wedged && exec_queue_registered(q)) {
+	if (!wedged && (exec_queue_enabled(q) || exec_queue_pending_disable(q))) {
 		int ret;
 
 		if (exec_queue_reset(q))
 			err = -EIO;
 
-		if (!exec_queue_destroyed(q) && xe_uc_fw_is_running(&guc->fw)) {
+		if (xe_uc_fw_is_running(&guc->fw)) {
 			/*
 			 * Wait for any pending G2H to flush out before
 			 * modifying state
@@ -1324,8 +1289,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 			xe_devcoredump(q, job,
 				       "Schedule disable failed to respond, guc_id=%d, ret=%d, guc_read=%d",
 				       q->guc->id, ret, xe_guc_read_stopped(guc));
-			set_exec_queue_extra_ref(q);
-			xe_exec_queue_get(q);	/* GT reset owns this */
 			set_exec_queue_banned(q);
 			xe_gt_reset_async(q->gt);
 			xe_sched_tdr_queue_imm(sched);
@@ -1378,13 +1341,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		}
 	}
 
-	/* Finish cleaning up exec queue via deregister */
 	set_exec_queue_banned(q);
-	if (!wedged && exec_queue_registered(q) && !exec_queue_destroyed(q)) {
-		set_exec_queue_extra_ref(q);
-		xe_exec_queue_get(q);
-		__deregister_exec_queue(guc, q);
-	}
 
 	/* Mark all outstanding jobs as bad, thus completing them */
 	xe_sched_job_set_error(job, err);
@@ -1928,7 +1885,7 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 
 	/* Clean up lost G2H + reset engine state */
 	if (exec_queue_registered(q)) {
-		if (exec_queue_extra_ref(q) || xe_exec_queue_is_lr(q))
+		if (xe_exec_queue_is_lr(q))
 			xe_exec_queue_put(q);
 		else if (exec_queue_destroyed(q))
 			__guc_exec_queue_destroy(guc, q);
@@ -2062,11 +2019,7 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
 
 	if (exec_queue_destroyed(q) && exec_queue_registered(q)) {
 		clear_exec_queue_destroyed(q);
-		if (exec_queue_extra_ref(q))
-			xe_exec_queue_put(q);
-		else
-			q->guc->needs_cleanup = true;
-		clear_exec_queue_extra_ref(q);
+		q->guc->needs_cleanup = true;
 		xe_gt_dbg(guc_to_gt(guc), "Replay CLEANUP - guc_id=%d",
 			  q->guc->id);
 	}
@@ -2533,7 +2486,7 @@ static void handle_deregister_done(struct xe_guc *guc, struct xe_exec_queue *q)
 
 	clear_exec_queue_registered(q);
 
-	if (exec_queue_extra_ref(q) || xe_exec_queue_is_lr(q))
+	if (xe_exec_queue_is_lr(q))
 		xe_exec_queue_put(q);
 	else
 		__guc_exec_queue_destroy(guc, q);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 7/9] drm/xe: Remove special casing for LR queues in submission
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
                   ` (5 preceding siblings ...)
  2025-12-01 18:39 ` [PATCH v7 6/9] drm/xe: Do not deregister queues in TDR Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-01 18:39 ` [PATCH v7 8/9] drm/xe: Disable timestamp WA on VFs Matthew Brost
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

Now that LR jobs are tracked by the DRM scheduler, there's no longer a
need to special-case LR queues. This change removes all LR
queue-specific handling, including dedicated TDR logic, reference
counting schemes, and other related mechanisms.

v4:
 - Remove xe_exec_queue_lr_cleanup tracepoint (Niranjana)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |   2 -
 drivers/gpu/drm/xe/xe_guc_submit.c           | 132 ++-----------------
 drivers/gpu/drm/xe/xe_trace.h                |   5 -
 3 files changed, 11 insertions(+), 128 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
index a3b034e4b205..fd0915ed8eb1 100644
--- a/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_exec_queue_types.h
@@ -33,8 +33,6 @@ struct xe_guc_exec_queue {
 	 */
 #define MAX_STATIC_MSG_TYPE	3
 	struct xe_sched_msg static_msgs[MAX_STATIC_MSG_TYPE];
-	/** @lr_tdr: long running TDR worker */
-	struct work_struct lr_tdr;
 	/** @destroy_async: do final destroy async from this worker */
 	struct work_struct destroy_async;
 	/** @resume_time: time of last resume */
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 622b3d92ba41..8190f2afbaed 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -674,14 +674,6 @@ static void register_exec_queue(struct xe_exec_queue *q, int ctx_type)
 		parallel_write(xe, map, wq_desc.wq_status, WQ_STATUS_ACTIVE);
 	}
 
-	/*
-	 * We must keep a reference for LR engines if engine is registered with
-	 * the GuC as jobs signal immediately and can't destroy an engine if the
-	 * GuC has a reference to it.
-	 */
-	if (xe_exec_queue_is_lr(q))
-		xe_exec_queue_get(q);
-
 	set_exec_queue_registered(q);
 	trace_xe_exec_queue_register(q);
 	if (xe_exec_queue_is_parallel(q))
@@ -854,7 +846,7 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	struct xe_sched_job *job = to_xe_sched_job(drm_job);
 	struct xe_exec_queue *q = job->q;
 	struct xe_guc *guc = exec_queue_to_guc(q);
-	bool lr = xe_exec_queue_is_lr(q), killed_or_banned_or_wedged =
+	bool killed_or_banned_or_wedged =
 		exec_queue_killed_or_banned_or_wedged(q);
 
 	xe_gt_assert(guc_to_gt(guc), !(exec_queue_destroyed(q) || exec_queue_pending_disable(q)) ||
@@ -871,15 +863,6 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 		job->restore_replay = false;
 	}
 
-	/*
-	 * We don't care about job-fence ordering in LR VMs because these fences
-	 * are never exported; they are used solely to keep jobs on the pending
-	 * list. Once a queue enters an error state, there's no need to track
-	 * them.
-	 */
-	if (killed_or_banned_or_wedged && lr)
-		xe_sched_job_set_error(job, -ECANCELED);
-
 	return job->fence;
 }
 
@@ -923,8 +906,7 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
 		xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n");
 		xe_sched_submission_start(sched);
 		xe_gt_reset_async(q->gt);
-		if (!xe_exec_queue_is_lr(q))
-			xe_sched_tdr_queue_imm(sched);
+		xe_sched_tdr_queue_imm(sched);
 		return;
 	}
 
@@ -950,10 +932,7 @@ static void xe_guc_exec_queue_trigger_cleanup(struct xe_exec_queue *q)
 	/** to wakeup xe_wait_user_fence ioctl if exec queue is reset */
 	wake_up_all(&xe->ufence_wq);
 
-	if (xe_exec_queue_is_lr(q))
-		queue_work(guc_to_gt(guc)->ordered_wq, &q->guc->lr_tdr);
-	else
-		xe_sched_tdr_queue_imm(&q->guc->sched);
+	xe_sched_tdr_queue_imm(&q->guc->sched);
 }
 
 /**
@@ -1009,78 +988,6 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
 	return true;
 }
 
-static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
-{
-	struct xe_guc_exec_queue *ge =
-		container_of(w, struct xe_guc_exec_queue, lr_tdr);
-	struct xe_exec_queue *q = ge->q;
-	struct xe_guc *guc = exec_queue_to_guc(q);
-	struct xe_gpu_scheduler *sched = &ge->sched;
-	struct drm_sched_job *job;
-	bool wedged = false;
-
-	xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q));
-
-	if (vf_recovery(guc))
-		return;
-
-	trace_xe_exec_queue_lr_cleanup(q);
-
-	if (!exec_queue_killed(q))
-		wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
-
-	/* Kill the run_job / process_msg entry points */
-	xe_sched_submission_stop(sched);
-
-	/*
-	 * Engine state now mostly stable, disable scheduling / deregister if
-	 * needed. This cleanup routine might be called multiple times, where
-	 * the actual async engine deregister drops the final engine ref.
-	 * Calling disable_scheduling_deregister will mark the engine as
-	 * destroyed and fire off the CT requests to disable scheduling /
-	 * deregister, which we only want to do once. We also don't want to mark
-	 * the engine as pending_disable again as this may race with the
-	 * xe_guc_deregister_done_handler() which treats it as an unexpected
-	 * state.
-	 */
-	if (!wedged && exec_queue_registered(q) && !exec_queue_destroyed(q)) {
-		struct xe_guc *guc = exec_queue_to_guc(q);
-		int ret;
-
-		set_exec_queue_banned(q);
-		disable_scheduling_deregister(guc, q);
-
-		/*
-		 * Must wait for scheduling to be disabled before signalling
-		 * any fences, if GT broken the GT reset code should signal us.
-		 */
-		ret = wait_event_timeout(guc->ct.wq,
-					 !exec_queue_pending_disable(q) ||
-					 xe_guc_read_stopped(guc) ||
-					 vf_recovery(guc), HZ * 5);
-		if (vf_recovery(guc))
-			return;
-
-		if (!ret) {
-			xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n",
-				   q->guc->id);
-			xe_devcoredump(q, NULL, "Schedule disable failed to respond, guc_id=%d\n",
-				       q->guc->id);
-			xe_sched_submission_start(sched);
-			xe_gt_reset_async(q->gt);
-			return;
-		}
-	}
-
-	if (!exec_queue_killed(q) && !xe_lrc_ring_is_idle(q->lrc[0]))
-		xe_devcoredump(q, NULL, "LR job cleanup, guc_id=%d", q->guc->id);
-
-	drm_sched_for_each_pending_job(job, &sched->base, NULL)
-		xe_sched_job_set_error(to_xe_sched_job(job), -ECANCELED);
-
-	xe_sched_submission_start(sched);
-}
-
 #define ADJUST_FIVE_PERCENT(__t)	mul_u64_u32_div(__t, 105, 100)
 
 static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
@@ -1150,8 +1057,7 @@ static void enable_scheduling(struct xe_exec_queue *q)
 		xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond");
 		set_exec_queue_banned(q);
 		xe_gt_reset_async(q->gt);
-		if (!xe_exec_queue_is_lr(q))
-			xe_sched_tdr_queue_imm(&q->guc->sched);
+		xe_sched_tdr_queue_imm(&q->guc->sched);
 	}
 }
 
@@ -1188,7 +1094,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	pid_t pid = -1;
 	bool wedged = false, skip_timeout_check;
 
-	xe_gt_assert(guc_to_gt(guc), !xe_exec_queue_is_lr(q));
 	xe_gt_assert(guc_to_gt(guc), !exec_queue_destroyed(q));
 
 	/*
@@ -1208,6 +1113,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	skip_timeout_check = exec_queue_reset(q) ||
 		exec_queue_killed_or_banned_or_wedged(q);
 
+	/* LR jobs can only get here if queue has been killed or hit an error */
+	if (xe_exec_queue_is_lr(q))
+		xe_gt_assert(guc_to_gt(guc), skip_timeout_check);
+
 	/*
 	 * If devcoredump not captured and GuC capture for the job is not ready
 	 * do manual capture first and decide later if we need to use it
@@ -1397,8 +1306,6 @@ static void __guc_exec_queue_destroy_async(struct work_struct *w)
 	guard(xe_pm_runtime)(guc_to_xe(guc));
 	trace_xe_exec_queue_destroy(q);
 
-	if (xe_exec_queue_is_lr(q))
-		cancel_work_sync(&ge->lr_tdr);
 	/* Confirm no work left behind accessing device structures */
 	cancel_delayed_work_sync(&ge->sched.base.work_tdr);
 
@@ -1629,9 +1536,6 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 	if (err)
 		goto err_sched;
 
-	if (xe_exec_queue_is_lr(q))
-		INIT_WORK(&q->guc->lr_tdr, xe_guc_exec_queue_lr_cleanup);
-
 	mutex_lock(&guc->submission_state.lock);
 
 	err = alloc_guc_id(guc, q);
@@ -1885,9 +1789,7 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 
 	/* Clean up lost G2H + reset engine state */
 	if (exec_queue_registered(q)) {
-		if (xe_exec_queue_is_lr(q))
-			xe_exec_queue_put(q);
-		else if (exec_queue_destroyed(q))
+		if (exec_queue_destroyed(q))
 			__guc_exec_queue_destroy(guc, q);
 	}
 	if (q->guc->suspend_pending) {
@@ -1917,9 +1819,6 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 				trace_xe_sched_job_ban(job);
 				ban = true;
 			}
-		} else if (xe_exec_queue_is_lr(q) &&
-			   !xe_lrc_ring_is_idle(q->lrc[0])) {
-			ban = true;
 		}
 
 		if (ban) {
@@ -2002,8 +1901,6 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
 	if (pending_enable && !pending_resume &&
 	    !exec_queue_pending_tdr_exit(q)) {
 		clear_exec_queue_registered(q);
-		if (xe_exec_queue_is_lr(q))
-			xe_exec_queue_put(q);
 		xe_gt_dbg(guc_to_gt(guc), "Replay REGISTER - guc_id=%d",
 			  q->guc->id);
 	}
@@ -2072,10 +1969,7 @@ static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
 
 	/* Stop scheduling + flush any DRM scheduler operations */
 	xe_sched_submission_stop(sched);
-	if (xe_exec_queue_is_lr(q))
-		cancel_work_sync(&q->guc->lr_tdr);
-	else
-		cancel_delayed_work_sync(&sched->base.work_tdr);
+	cancel_delayed_work_sync(&sched->base.work_tdr);
 
 	guc_exec_queue_revert_pending_state_change(guc, q);
 
@@ -2485,11 +2379,7 @@ static void handle_deregister_done(struct xe_guc *guc, struct xe_exec_queue *q)
 	trace_xe_exec_queue_deregister_done(q);
 
 	clear_exec_queue_registered(q);
-
-	if (xe_exec_queue_is_lr(q))
-		xe_exec_queue_put(q);
-	else
-		__guc_exec_queue_destroy(guc, q);
+	__guc_exec_queue_destroy(guc, q);
 }
 
 int xe_guc_deregister_done_handler(struct xe_guc *guc, u32 *msg, u32 len)
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 79a97b086cb2..cf2ef70fb7ce 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -182,11 +182,6 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_resubmit,
 	     TP_ARGS(q)
 );
 
-DEFINE_EVENT(xe_exec_queue, xe_exec_queue_lr_cleanup,
-	     TP_PROTO(struct xe_exec_queue *q),
-	     TP_ARGS(q)
-);
-
 DECLARE_EVENT_CLASS(xe_sched_job,
 		    TP_PROTO(struct xe_sched_job *job),
 		    TP_ARGS(job),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 8/9] drm/xe: Disable timestamp WA on VFs
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
                   ` (6 preceding siblings ...)
  2025-12-01 18:39 ` [PATCH v7 7/9] drm/xe: Remove special casing for LR queues in submission Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-02  6:42   ` Umesh Nerlige Ramappa
  2025-12-01 18:39 ` [PATCH v7 9/9] drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR Matthew Brost
  2025-12-03  1:23 ` [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
  9 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

The timestamp WA does not work on a VF because it requires reading MMIO
registers, which are inaccessible on a VF. This timestamp WA confuses
LRC sampling on a VF during TDR, as the LRC timestamp would always read
as 1 for any active context. Disable the timestamp WA on VFs to avoid
this confusion.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_lrc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index a05060f75e7e..166353455f8f 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1063,6 +1063,9 @@ static ssize_t setup_utilization_wa(struct xe_lrc *lrc,
 {
 	u32 *cmd = batch;
 
+	if (IS_SRIOV_VF(gt_to_xe(lrc->gt)))
+		return 0;
+
 	if (xe_gt_WARN_ON(lrc->gt, max_len < 12))
 		return -ENOSPC;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 9/9] drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
                   ` (7 preceding siblings ...)
  2025-12-01 18:39 ` [PATCH v7 8/9] drm/xe: Disable timestamp WA on VFs Matthew Brost
@ 2025-12-01 18:39 ` Matthew Brost
  2025-12-02  7:31   ` Umesh Nerlige Ramappa
  2025-12-03  1:23 ` [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
  9 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-01 18:39 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel

We now have proper infrastructure to accurately check the LRC timestamp
without toggling the scheduling state for non-VFs. For VFs, it is still
possible to get an inaccurate view if the context is on hardware. We
guard against free-running contexts on VFs by banning jobs whose
timestamps are not moving. In addition, VFs have a timeslice quantum
that naturally triggers context switches when more than one VF is
running, thus updating the LRC timestamp.

For multi-queue, it is desirable to avoid scheduling toggling in the TDR
because this scheduling state is shared among many queues. Furthermore,
this change simplifies the GuC state machine. The trade-off for VF cases
seems worthwhile.

v5:
 - Add xe_lrc_timestamp helper (Umesh)
v6:
 - Reduce number of tries on stuck timestamp (VF testing)
 - Convert job timestamp save to a memory copy (VF testing)
v7:
 - Save ctx timestamp to LRC when start VF job (VF testing)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c      | 97 ++++++-------------------
 drivers/gpu/drm/xe/xe_lrc.c             | 42 +++++++----
 drivers/gpu/drm/xe/xe_lrc.h             |  3 +-
 drivers/gpu/drm/xe/xe_ring_ops.c        | 25 +++++--
 drivers/gpu/drm/xe/xe_sched_job.c       |  1 +
 drivers/gpu/drm/xe/xe_sched_job_types.h |  2 +
 6 files changed, 76 insertions(+), 94 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 8190f2afbaed..dc4bf3126450 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -68,9 +68,7 @@ exec_queue_to_guc(struct xe_exec_queue *q)
 #define EXEC_QUEUE_STATE_KILLED			(1 << 7)
 #define EXEC_QUEUE_STATE_WEDGED			(1 << 8)
 #define EXEC_QUEUE_STATE_BANNED			(1 << 9)
-#define EXEC_QUEUE_STATE_CHECK_TIMEOUT		(1 << 10)
-#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 11)
-#define EXEC_QUEUE_STATE_PENDING_TDR_EXIT	(1 << 12)
+#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 10)
 
 static bool exec_queue_registered(struct xe_exec_queue *q)
 {
@@ -202,21 +200,6 @@ static void set_exec_queue_wedged(struct xe_exec_queue *q)
 	atomic_or(EXEC_QUEUE_STATE_WEDGED, &q->guc->state);
 }
 
-static bool exec_queue_check_timeout(struct xe_exec_queue *q)
-{
-	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_CHECK_TIMEOUT;
-}
-
-static void set_exec_queue_check_timeout(struct xe_exec_queue *q)
-{
-	atomic_or(EXEC_QUEUE_STATE_CHECK_TIMEOUT, &q->guc->state);
-}
-
-static void clear_exec_queue_check_timeout(struct xe_exec_queue *q)
-{
-	atomic_and(~EXEC_QUEUE_STATE_CHECK_TIMEOUT, &q->guc->state);
-}
-
 static bool exec_queue_pending_resume(struct xe_exec_queue *q)
 {
 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
@@ -232,21 +215,6 @@ static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
 	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
 }
 
-static bool exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
-{
-	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
-}
-
-static void set_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
-{
-	atomic_or(EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
-}
-
-static void clear_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
-{
-	atomic_and(~EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
-}
-
 static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
 {
 	return (atomic_read(&q->guc->state) &
@@ -1006,7 +974,16 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
 		return xe_sched_invalidate_job(job, 2);
 	}
 
-	ctx_timestamp = lower_32_bits(xe_lrc_ctx_timestamp(q->lrc[0]));
+	ctx_timestamp = lower_32_bits(xe_lrc_timestamp(q->lrc[0]));
+	if (ctx_timestamp == job->sample_timestamp) {
+		xe_gt_warn(gt, "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, timestamp stuck",
+			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
+			   q->guc->id);
+
+		return xe_sched_invalidate_job(job, 0);
+	}
+
+	job->sample_timestamp = ctx_timestamp;
 	ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]);
 
 	/*
@@ -1132,16 +1109,17 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	}
 
 	/*
-	 * XXX: Sampling timeout doesn't work in wedged mode as we have to
-	 * modify scheduling state to read timestamp. We could read the
-	 * timestamp from a register to accumulate current running time but this
-	 * doesn't work for SRIOV. For now assuming timeouts in wedged mode are
-	 * genuine timeouts.
+	 * Check if job is actually timed out, if so restart job execution and TDR
 	 */
+	if (!skip_timeout_check && !check_timeout(q, job))
+		goto rearm;
+
 	if (!exec_queue_killed(q))
 		wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
 
-	/* Engine state now stable, disable scheduling to check timestamp */
+	set_exec_queue_banned(q);
+
+	/* Kick job / queue off hardware */
 	if (!wedged && (exec_queue_enabled(q) || exec_queue_pending_disable(q))) {
 		int ret;
 
@@ -1163,13 +1141,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 			if (!ret || xe_guc_read_stopped(guc))
 				goto trigger_reset;
 
-			/*
-			 * Flag communicates to G2H handler that schedule
-			 * disable originated from a timeout check. The G2H then
-			 * avoid triggering cleanup or deregistering the exec
-			 * queue.
-			 */
-			set_exec_queue_check_timeout(q);
 			disable_scheduling(q, skip_timeout_check);
 		}
 
@@ -1198,22 +1169,12 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 			xe_devcoredump(q, job,
 				       "Schedule disable failed to respond, guc_id=%d, ret=%d, guc_read=%d",
 				       q->guc->id, ret, xe_guc_read_stopped(guc));
-			set_exec_queue_banned(q);
 			xe_gt_reset_async(q->gt);
 			xe_sched_tdr_queue_imm(sched);
 			goto rearm;
 		}
 	}
 
-	/*
-	 * Check if job is actually timed out, if so restart job execution and TDR
-	 */
-	if (!wedged && !skip_timeout_check && !check_timeout(q, job) &&
-	    !exec_queue_reset(q) && exec_queue_registered(q)) {
-		clear_exec_queue_check_timeout(q);
-		goto sched_enable;
-	}
-
 	if (q->vm && q->vm->xef) {
 		process_name = q->vm->xef->process_name;
 		pid = q->vm->xef->pid;
@@ -1244,14 +1205,11 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	if (!wedged && (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
 			(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)))) {
 		if (!xe_sched_invalidate_job(job, 2)) {
-			clear_exec_queue_check_timeout(q);
 			xe_gt_reset_async(q->gt);
 			goto rearm;
 		}
 	}
 
-	set_exec_queue_banned(q);
-
 	/* Mark all outstanding jobs as bad, thus completing them */
 	xe_sched_job_set_error(job, err);
 	drm_sched_for_each_pending_job(tmp_job, &sched->base, NULL)
@@ -1266,9 +1224,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 */
 	return DRM_GPU_SCHED_STAT_NO_HANG;
 
-sched_enable:
-	set_exec_queue_pending_tdr_exit(q);
-	enable_scheduling(q);
 rearm:
 	/*
 	 * XXX: Ideally want to adjust timeout based on current execution time
@@ -1898,8 +1853,7 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
 			  q->guc->id);
 	}
 
-	if (pending_enable && !pending_resume &&
-	    !exec_queue_pending_tdr_exit(q)) {
+	if (pending_enable && !pending_resume) {
 		clear_exec_queue_registered(q);
 		xe_gt_dbg(guc_to_gt(guc), "Replay REGISTER - guc_id=%d",
 			  q->guc->id);
@@ -1908,7 +1862,6 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
 	if (pending_enable) {
 		clear_exec_queue_enabled(q);
 		clear_exec_queue_pending_resume(q);
-		clear_exec_queue_pending_tdr_exit(q);
 		clear_exec_queue_pending_enable(q);
 		xe_gt_dbg(guc_to_gt(guc), "Replay ENABLE - guc_id=%d",
 			  q->guc->id);
@@ -1934,7 +1887,6 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
 		if (!pending_enable)
 			set_exec_queue_enabled(q);
 		clear_exec_queue_pending_disable(q);
-		clear_exec_queue_check_timeout(q);
 		xe_gt_dbg(guc_to_gt(guc), "Replay DISABLE - guc_id=%d",
 			  q->guc->id);
 	}
@@ -2308,13 +2260,10 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
 
 		q->guc->resume_time = ktime_get();
 		clear_exec_queue_pending_resume(q);
-		clear_exec_queue_pending_tdr_exit(q);
 		clear_exec_queue_pending_enable(q);
 		smp_wmb();
 		wake_up_all(&guc->ct.wq);
 	} else {
-		bool check_timeout = exec_queue_check_timeout(q);
-
 		xe_gt_assert(guc_to_gt(guc), runnable_state == 0);
 		xe_gt_assert(guc_to_gt(guc), exec_queue_pending_disable(q));
 
@@ -2322,11 +2271,11 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
 			suspend_fence_signal(q);
 			clear_exec_queue_pending_disable(q);
 		} else {
-			if (exec_queue_banned(q) || check_timeout) {
+			if (exec_queue_banned(q)) {
 				smp_wmb();
 				wake_up_all(&guc->ct.wq);
 			}
-			if (!check_timeout && exec_queue_destroyed(q)) {
+			if (exec_queue_destroyed(q)) {
 				/*
 				 * Make sure to clear the pending_disable only
 				 * after sampling the destroyed state. We want
@@ -2436,7 +2385,7 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len)
 	 * guc_exec_queue_timedout_job.
 	 */
 	set_exec_queue_reset(q);
-	if (!exec_queue_banned(q) && !exec_queue_check_timeout(q))
+	if (!exec_queue_banned(q))
 		xe_guc_exec_queue_trigger_cleanup(q);
 
 	return 0;
@@ -2517,7 +2466,7 @@ int xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32 *msg,
 
 	/* Treat the same as engine reset */
 	set_exec_queue_reset(q);
-	if (!exec_queue_banned(q) && !exec_queue_check_timeout(q))
+	if (!exec_queue_banned(q))
 		xe_guc_exec_queue_trigger_cleanup(q);
 
 	return 0;
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 166353455f8f..38b0c536f6fb 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -852,7 +852,7 @@ u32 xe_lrc_ctx_timestamp_udw_ggtt_addr(struct xe_lrc *lrc)
  *
  * Returns: ctx timestamp value
  */
-u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc)
+static u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc)
 {
 	struct xe_device *xe = lrc_to_xe(lrc);
 	struct iosys_map map;
@@ -2380,35 +2380,31 @@ static int get_ctx_timestamp(struct xe_lrc *lrc, u32 engine_id, u64 *reg_ctx_ts)
 }
 
 /**
- * xe_lrc_update_timestamp() - Update ctx timestamp
+ * xe_lrc_timestamp() - Current ctx timestamp
  * @lrc: Pointer to the lrc.
- * @old_ts: Old timestamp value
  *
- * Populate @old_ts current saved ctx timestamp, read new ctx timestamp and
- * update saved value. With support for active contexts, the calculation may be
- * slightly racy, so follow a read-again logic to ensure that the context is
- * still active before returning the right timestamp.
+ * Return latest ctx timestamp. With support for active contexts, the
+ * calculation may bb slightly racy, so follow a read-again logic to ensure that
+ * the context is still active before returning the right timestamp.
  *
  * Returns: New ctx timestamp value
  */
-u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
+u64 xe_lrc_timestamp(struct xe_lrc *lrc)
 {
-	u64 lrc_ts, reg_ts;
+	u64 lrc_ts, reg_ts, new_ts;
 	u32 engine_id;
 
-	*old_ts = lrc->ctx_timestamp;
-
 	lrc_ts = xe_lrc_ctx_timestamp(lrc);
 	/* CTX_TIMESTAMP mmio read is invalid on VF, so return the LRC value */
 	if (IS_SRIOV_VF(lrc_to_xe(lrc))) {
-		lrc->ctx_timestamp = lrc_ts;
+		new_ts = lrc_ts;
 		goto done;
 	}
 
 	if (lrc_ts == CONTEXT_ACTIVE) {
 		engine_id = xe_lrc_engine_id(lrc);
 		if (!get_ctx_timestamp(lrc, engine_id, &reg_ts))
-			lrc->ctx_timestamp = reg_ts;
+			new_ts = reg_ts;
 
 		/* read lrc again to ensure context is still active */
 		lrc_ts = xe_lrc_ctx_timestamp(lrc);
@@ -2419,9 +2415,27 @@ u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
 	 * be a separate if condition.
 	 */
 	if (lrc_ts != CONTEXT_ACTIVE)
-		lrc->ctx_timestamp = lrc_ts;
+		new_ts = lrc_ts;
 
 done:
+	return new_ts;
+}
+
+/**
+ * xe_lrc_update_timestamp() - Update ctx timestamp
+ * @lrc: Pointer to the lrc.
+ * @old_ts: Old timestamp value
+ *
+ * Populate @old_ts current saved ctx timestamp, read new ctx timestamp and
+ * update saved value.
+ *
+ * Returns: New ctx timestamp value
+ */
+u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
+{
+	*old_ts = lrc->ctx_timestamp;
+	lrc->ctx_timestamp = xe_lrc_timestamp(lrc);
+
 	trace_xe_lrc_update_timestamp(lrc, *old_ts);
 
 	return lrc->ctx_timestamp;
diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
index a32472b92242..93c1234e2706 100644
--- a/drivers/gpu/drm/xe/xe_lrc.h
+++ b/drivers/gpu/drm/xe/xe_lrc.h
@@ -142,7 +142,6 @@ void xe_lrc_snapshot_free(struct xe_lrc_snapshot *snapshot);
 
 u32 xe_lrc_ctx_timestamp_ggtt_addr(struct xe_lrc *lrc);
 u32 xe_lrc_ctx_timestamp_udw_ggtt_addr(struct xe_lrc *lrc);
-u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc);
 u32 xe_lrc_ctx_job_timestamp_ggtt_addr(struct xe_lrc *lrc);
 u32 xe_lrc_ctx_job_timestamp(struct xe_lrc *lrc);
 int xe_lrc_setup_wa_bb_with_scratch(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
@@ -162,4 +161,6 @@ int xe_lrc_setup_wa_bb_with_scratch(struct xe_lrc *lrc, struct xe_hw_engine *hwe
  */
 u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts);
 
+u64 xe_lrc_timestamp(struct xe_lrc *lrc);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index ac0c6dcffe15..3dacfc2da75c 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -233,13 +233,26 @@ static u32 get_ppgtt_flag(struct xe_sched_job *job)
 	return 0;
 }
 
-static int emit_copy_timestamp(struct xe_lrc *lrc, u32 *dw, int i)
+static int emit_copy_timestamp(struct xe_device *xe, struct xe_lrc *lrc,
+			       u32 *dw, int i)
 {
 	dw[i++] = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT | MI_SRM_ADD_CS_OFFSET;
 	dw[i++] = RING_CTX_TIMESTAMP(0).addr;
 	dw[i++] = xe_lrc_ctx_job_timestamp_ggtt_addr(lrc);
 	dw[i++] = 0;
 
+	/*
+	 * Ensure CTX timestamp >= Job timestamp during VF sampling to avoid
+	 * arithmetic wraparound in TDR.
+	 */
+	if (IS_SRIOV_VF(xe)) {
+		dw[i++] = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT |
+			MI_SRM_ADD_CS_OFFSET;
+		dw[i++] = RING_CTX_TIMESTAMP(0).addr;
+		dw[i++] = xe_lrc_ctx_timestamp_ggtt_addr(lrc);
+		dw[i++] = 0;
+	}
+
 	return i;
 }
 
@@ -253,7 +266,7 @@ static void __emit_job_gen12_simple(struct xe_sched_job *job, struct xe_lrc *lrc
 
 	*head = lrc->ring.tail;
 
-	i = emit_copy_timestamp(lrc, dw, i);
+	i = emit_copy_timestamp(gt_to_xe(gt), lrc, dw, i);
 
 	if (job->ring_ops_flush_tlb) {
 		dw[i++] = preparser_disable(true);
@@ -308,7 +321,7 @@ static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
 
 	*head = lrc->ring.tail;
 
-	i = emit_copy_timestamp(lrc, dw, i);
+	i = emit_copy_timestamp(xe, lrc, dw, i);
 
 	dw[i++] = preparser_disable(true);
 
@@ -362,7 +375,7 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 
 	*head = lrc->ring.tail;
 
-	i = emit_copy_timestamp(lrc, dw, i);
+	i = emit_copy_timestamp(xe, lrc, dw, i);
 
 	dw[i++] = preparser_disable(true);
 	if (lacks_render)
@@ -406,12 +419,14 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
 				     struct xe_lrc *lrc, u32 *head,
 				     u32 seqno)
 {
+	struct xe_gt *gt = job->q->gt;
+	struct xe_device *xe = gt_to_xe(gt);
 	u32 saddr = xe_lrc_start_seqno_ggtt_addr(lrc);
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 
 	*head = lrc->ring.tail;
 
-	i = emit_copy_timestamp(lrc, dw, i);
+	i = emit_copy_timestamp(xe, lrc, dw, i);
 
 	i = emit_store_imm_ggtt(saddr, seqno, dw, i);
 
diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
index cb674a322113..39aec7f6d86d 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -110,6 +110,7 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 		return ERR_PTR(-ENOMEM);
 
 	job->q = q;
+	job->sample_timestamp = U64_MAX;
 	kref_init(&job->refcount);
 	xe_exec_queue_get(job->q);
 
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index 7c4c54fe920a..13c2970e81a8 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -59,6 +59,8 @@ struct xe_sched_job {
 	u32 lrc_seqno;
 	/** @migrate_flush_flags: Additional flush flags for migration jobs */
 	u32 migrate_flush_flags;
+	/** @sample_timestamp: Sampling of job timestamp in TDR */
+	u64 sample_timestamp;
 	/** @ring_ops_flush_tlb: The ring ops need to flush TLB before payload. */
 	bool ring_ops_flush_tlb;
 	/** @ggtt: mapped in ggtt. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 8/9] drm/xe: Disable timestamp WA on VFs
  2025-12-01 18:39 ` [PATCH v7 8/9] drm/xe: Disable timestamp WA on VFs Matthew Brost
@ 2025-12-02  6:42   ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 27+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-12-02  6:42 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, dri-devel

On Mon, Dec 01, 2025 at 10:39:53AM -0800, Matthew Brost wrote:
>The timestamp WA does not work on a VF because it requires reading MMIO
>registers, which are inaccessible on a VF. This timestamp WA confuses
>LRC sampling on a VF during TDR, as the LRC timestamp would always read
>as 1 for any active context. Disable the timestamp WA on VFs to avoid
>this confusion.
>
>Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Thanks for fixing this.

Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

I think this needs a Fixes tag.

Fixes: 617d824c5323 ("drm/xe: Add WA BB to capture active context utilization")

This patch may not easily backport since functions are renamed later in 
a refactor. May need to be manually sent to stable or needs Maintainer's 
inputs.

Regards,
Umesh


>---
> drivers/gpu/drm/xe/xe_lrc.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
>index a05060f75e7e..166353455f8f 100644
>--- a/drivers/gpu/drm/xe/xe_lrc.c
>+++ b/drivers/gpu/drm/xe/xe_lrc.c
>@@ -1063,6 +1063,9 @@ static ssize_t setup_utilization_wa(struct xe_lrc *lrc,
> {
> 	u32 *cmd = batch;
>
>+	if (IS_SRIOV_VF(gt_to_xe(lrc->gt)))
>+		return 0;
>+
> 	if (xe_gt_WARN_ON(lrc->gt, max_len < 12))
> 		return -ENOSPC;
>
>-- 
>2.34.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 9/9] drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR
  2025-12-01 18:39 ` [PATCH v7 9/9] drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR Matthew Brost
@ 2025-12-02  7:31   ` Umesh Nerlige Ramappa
  2025-12-02 15:14     ` Matthew Brost
  0 siblings, 1 reply; 27+ messages in thread
From: Umesh Nerlige Ramappa @ 2025-12-02  7:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, dri-devel

On Mon, Dec 01, 2025 at 10:39:54AM -0800, Matthew Brost wrote:
>We now have proper infrastructure to accurately check the LRC timestamp
>without toggling the scheduling state for non-VFs. For VFs, it is still
>possible to get an inaccurate view if the context is on hardware. We
>guard against free-running contexts on VFs by banning jobs whose
>timestamps are not moving. In addition, VFs have a timeslice quantum
>that naturally triggers context switches when more than one VF is
>running, thus updating the LRC timestamp.

I guess some workloads are configured to just keep running on VFs 
without switching. I am assuming they are classified as Long Running and 
are not affected by TDR. If so, the timeouts should still work 
reasonably on a VF considering switching is usually in milliseconds and 
timeouts are much larger.

>
>For multi-queue, it is desirable to avoid scheduling toggling in the TDR
>because this scheduling state is shared among many queues. Furthermore,
>this change simplifies the GuC state machine. The trade-off for VF cases
>seems worthwhile.
>
>v5:
> - Add xe_lrc_timestamp helper (Umesh)
>v6:
> - Reduce number of tries on stuck timestamp (VF testing)
> - Convert job timestamp save to a memory copy (VF testing)
>v7:
> - Save ctx timestamp to LRC when start VF job (VF testing)
>
>Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>---
> drivers/gpu/drm/xe/xe_guc_submit.c      | 97 ++++++-------------------
> drivers/gpu/drm/xe/xe_lrc.c             | 42 +++++++----
> drivers/gpu/drm/xe/xe_lrc.h             |  3 +-
> drivers/gpu/drm/xe/xe_ring_ops.c        | 25 +++++--
> drivers/gpu/drm/xe/xe_sched_job.c       |  1 +
> drivers/gpu/drm/xe/xe_sched_job_types.h |  2 +
> 6 files changed, 76 insertions(+), 94 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>index 8190f2afbaed..dc4bf3126450 100644
>--- a/drivers/gpu/drm/xe/xe_guc_submit.c
>+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>@@ -68,9 +68,7 @@ exec_queue_to_guc(struct xe_exec_queue *q)
> #define EXEC_QUEUE_STATE_KILLED			(1 << 7)
> #define EXEC_QUEUE_STATE_WEDGED			(1 << 8)
> #define EXEC_QUEUE_STATE_BANNED			(1 << 9)
>-#define EXEC_QUEUE_STATE_CHECK_TIMEOUT		(1 << 10)
>-#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 11)
>-#define EXEC_QUEUE_STATE_PENDING_TDR_EXIT	(1 << 12)
>+#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 10)
>
> static bool exec_queue_registered(struct xe_exec_queue *q)
> {
>@@ -202,21 +200,6 @@ static void set_exec_queue_wedged(struct xe_exec_queue *q)
> 	atomic_or(EXEC_QUEUE_STATE_WEDGED, &q->guc->state);
> }
>
>-static bool exec_queue_check_timeout(struct xe_exec_queue *q)
>-{
>-	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_CHECK_TIMEOUT;
>-}
>-
>-static void set_exec_queue_check_timeout(struct xe_exec_queue *q)
>-{
>-	atomic_or(EXEC_QUEUE_STATE_CHECK_TIMEOUT, &q->guc->state);
>-}
>-
>-static void clear_exec_queue_check_timeout(struct xe_exec_queue *q)
>-{
>-	atomic_and(~EXEC_QUEUE_STATE_CHECK_TIMEOUT, &q->guc->state);
>-}
>-
> static bool exec_queue_pending_resume(struct xe_exec_queue *q)
> {
> 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
>@@ -232,21 +215,6 @@ static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
> 	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
> }
>
>-static bool exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
>-{
>-	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
>-}
>-
>-static void set_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
>-{
>-	atomic_or(EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
>-}
>-
>-static void clear_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
>-{
>-	atomic_and(~EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
>-}
>-
> static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
> {
> 	return (atomic_read(&q->guc->state) &
>@@ -1006,7 +974,16 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
> 		return xe_sched_invalidate_job(job, 2);
> 	}
>
>-	ctx_timestamp = lower_32_bits(xe_lrc_ctx_timestamp(q->lrc[0]));
>+	ctx_timestamp = lower_32_bits(xe_lrc_timestamp(q->lrc[0]));
>+	if (ctx_timestamp == job->sample_timestamp) {
>+		xe_gt_warn(gt, "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, timestamp stuck",
>+			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
>+			   q->guc->id);
>+
>+		return xe_sched_invalidate_job(job, 0);
>+	}
>+
>+	job->sample_timestamp = ctx_timestamp;
> 	ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]);
>
> 	/*
>@@ -1132,16 +1109,17 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> 	}
>
> 	/*
>-	 * XXX: Sampling timeout doesn't work in wedged mode as we have to
>-	 * modify scheduling state to read timestamp. We could read the
>-	 * timestamp from a register to accumulate current running time but this
>-	 * doesn't work for SRIOV. For now assuming timeouts in wedged mode are
>-	 * genuine timeouts.
>+	 * Check if job is actually timed out, if so restart job execution and TDR
> 	 */
>+	if (!skip_timeout_check && !check_timeout(q, job))
>+		goto rearm;
>+
> 	if (!exec_queue_killed(q))
> 		wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
>
>-	/* Engine state now stable, disable scheduling to check timestamp */
>+	set_exec_queue_banned(q);
>+
>+	/* Kick job / queue off hardware */
> 	if (!wedged && (exec_queue_enabled(q) || exec_queue_pending_disable(q))) {
> 		int ret;
>
>@@ -1163,13 +1141,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> 			if (!ret || xe_guc_read_stopped(guc))
> 				goto trigger_reset;
>
>-			/*
>-			 * Flag communicates to G2H handler that schedule
>-			 * disable originated from a timeout check. The G2H then
>-			 * avoid triggering cleanup or deregistering the exec
>-			 * queue.
>-			 */
>-			set_exec_queue_check_timeout(q);
> 			disable_scheduling(q, skip_timeout_check);
> 		}
>
>@@ -1198,22 +1169,12 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> 			xe_devcoredump(q, job,
> 				       "Schedule disable failed to respond, guc_id=%d, ret=%d, guc_read=%d",
> 				       q->guc->id, ret, xe_guc_read_stopped(guc));
>-			set_exec_queue_banned(q);
> 			xe_gt_reset_async(q->gt);
> 			xe_sched_tdr_queue_imm(sched);
> 			goto rearm;
> 		}
> 	}
>
>-	/*
>-	 * Check if job is actually timed out, if so restart job execution and TDR
>-	 */
>-	if (!wedged && !skip_timeout_check && !check_timeout(q, job) &&
>-	    !exec_queue_reset(q) && exec_queue_registered(q)) {
>-		clear_exec_queue_check_timeout(q);
>-		goto sched_enable;
>-	}
>-
> 	if (q->vm && q->vm->xef) {
> 		process_name = q->vm->xef->process_name;
> 		pid = q->vm->xef->pid;
>@@ -1244,14 +1205,11 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> 	if (!wedged && (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
> 			(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)))) {
> 		if (!xe_sched_invalidate_job(job, 2)) {
>-			clear_exec_queue_check_timeout(q);
> 			xe_gt_reset_async(q->gt);
> 			goto rearm;
> 		}
> 	}
>
>-	set_exec_queue_banned(q);
>-
> 	/* Mark all outstanding jobs as bad, thus completing them */
> 	xe_sched_job_set_error(job, err);
> 	drm_sched_for_each_pending_job(tmp_job, &sched->base, NULL)
>@@ -1266,9 +1224,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> 	 */
> 	return DRM_GPU_SCHED_STAT_NO_HANG;
>
>-sched_enable:
>-	set_exec_queue_pending_tdr_exit(q);
>-	enable_scheduling(q);
> rearm:
> 	/*
> 	 * XXX: Ideally want to adjust timeout based on current execution time
>@@ -1898,8 +1853,7 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> 			  q->guc->id);
> 	}
>
>-	if (pending_enable && !pending_resume &&
>-	    !exec_queue_pending_tdr_exit(q)) {
>+	if (pending_enable && !pending_resume) {
> 		clear_exec_queue_registered(q);
> 		xe_gt_dbg(guc_to_gt(guc), "Replay REGISTER - guc_id=%d",
> 			  q->guc->id);
>@@ -1908,7 +1862,6 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> 	if (pending_enable) {
> 		clear_exec_queue_enabled(q);
> 		clear_exec_queue_pending_resume(q);
>-		clear_exec_queue_pending_tdr_exit(q);
> 		clear_exec_queue_pending_enable(q);
> 		xe_gt_dbg(guc_to_gt(guc), "Replay ENABLE - guc_id=%d",
> 			  q->guc->id);
>@@ -1934,7 +1887,6 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> 		if (!pending_enable)
> 			set_exec_queue_enabled(q);
> 		clear_exec_queue_pending_disable(q);
>-		clear_exec_queue_check_timeout(q);
> 		xe_gt_dbg(guc_to_gt(guc), "Replay DISABLE - guc_id=%d",
> 			  q->guc->id);
> 	}
>@@ -2308,13 +2260,10 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
>
> 		q->guc->resume_time = ktime_get();
> 		clear_exec_queue_pending_resume(q);
>-		clear_exec_queue_pending_tdr_exit(q);
> 		clear_exec_queue_pending_enable(q);
> 		smp_wmb();
> 		wake_up_all(&guc->ct.wq);
> 	} else {
>-		bool check_timeout = exec_queue_check_timeout(q);
>-
> 		xe_gt_assert(guc_to_gt(guc), runnable_state == 0);
> 		xe_gt_assert(guc_to_gt(guc), exec_queue_pending_disable(q));
>
>@@ -2322,11 +2271,11 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
> 			suspend_fence_signal(q);
> 			clear_exec_queue_pending_disable(q);
> 		} else {
>-			if (exec_queue_banned(q) || check_timeout) {
>+			if (exec_queue_banned(q)) {
> 				smp_wmb();
> 				wake_up_all(&guc->ct.wq);
> 			}
>-			if (!check_timeout && exec_queue_destroyed(q)) {
>+			if (exec_queue_destroyed(q)) {
> 				/*
> 				 * Make sure to clear the pending_disable only
> 				 * after sampling the destroyed state. We want
>@@ -2436,7 +2385,7 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len)
> 	 * guc_exec_queue_timedout_job.
> 	 */
> 	set_exec_queue_reset(q);
>-	if (!exec_queue_banned(q) && !exec_queue_check_timeout(q))
>+	if (!exec_queue_banned(q))
> 		xe_guc_exec_queue_trigger_cleanup(q);
>
> 	return 0;
>@@ -2517,7 +2466,7 @@ int xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32 *msg,
>
> 	/* Treat the same as engine reset */
> 	set_exec_queue_reset(q);
>-	if (!exec_queue_banned(q) && !exec_queue_check_timeout(q))
>+	if (!exec_queue_banned(q))
> 		xe_guc_exec_queue_trigger_cleanup(q);
>
> 	return 0;
>diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
>index 166353455f8f..38b0c536f6fb 100644
>--- a/drivers/gpu/drm/xe/xe_lrc.c
>+++ b/drivers/gpu/drm/xe/xe_lrc.c
>@@ -852,7 +852,7 @@ u32 xe_lrc_ctx_timestamp_udw_ggtt_addr(struct xe_lrc *lrc)
>  *
>  * Returns: ctx timestamp value
>  */
>-u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc)
>+static u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc)
> {
> 	struct xe_device *xe = lrc_to_xe(lrc);
> 	struct iosys_map map;
>@@ -2380,35 +2380,31 @@ static int get_ctx_timestamp(struct xe_lrc *lrc, u32 engine_id, u64 *reg_ctx_ts)
> }
>
> /**
>- * xe_lrc_update_timestamp() - Update ctx timestamp
>+ * xe_lrc_timestamp() - Current ctx timestamp
>  * @lrc: Pointer to the lrc.
>- * @old_ts: Old timestamp value
>  *
>- * Populate @old_ts current saved ctx timestamp, read new ctx timestamp and
>- * update saved value. With support for active contexts, the calculation may be
>- * slightly racy, so follow a read-again logic to ensure that the context is
>- * still active before returning the right timestamp.
>+ * Return latest ctx timestamp. With support for active contexts, the
>+ * calculation may bb slightly racy, so follow a read-again logic to ensure that
>+ * the context is still active before returning the right timestamp.
>  *
>  * Returns: New ctx timestamp value
>  */
>-u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
>+u64 xe_lrc_timestamp(struct xe_lrc *lrc)
> {
>-	u64 lrc_ts, reg_ts;
>+	u64 lrc_ts, reg_ts, new_ts;
> 	u32 engine_id;
>
>-	*old_ts = lrc->ctx_timestamp;
>-
> 	lrc_ts = xe_lrc_ctx_timestamp(lrc);
> 	/* CTX_TIMESTAMP mmio read is invalid on VF, so return the LRC value */
> 	if (IS_SRIOV_VF(lrc_to_xe(lrc))) {
>-		lrc->ctx_timestamp = lrc_ts;
>+		new_ts = lrc_ts;
> 		goto done;
> 	}
>
> 	if (lrc_ts == CONTEXT_ACTIVE) {
> 		engine_id = xe_lrc_engine_id(lrc);
> 		if (!get_ctx_timestamp(lrc, engine_id, &reg_ts))
>-			lrc->ctx_timestamp = reg_ts;
>+			new_ts = reg_ts;
>
> 		/* read lrc again to ensure context is still active */
> 		lrc_ts = xe_lrc_ctx_timestamp(lrc);
>@@ -2419,9 +2415,27 @@ u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
> 	 * be a separate if condition.
> 	 */
> 	if (lrc_ts != CONTEXT_ACTIVE)
>-		lrc->ctx_timestamp = lrc_ts;
>+		new_ts = lrc_ts;
>
> done:
>+	return new_ts;
>+}
>+
>+/**
>+ * xe_lrc_update_timestamp() - Update ctx timestamp
>+ * @lrc: Pointer to the lrc.
>+ * @old_ts: Old timestamp value
>+ *
>+ * Populate @old_ts current saved ctx timestamp, read new ctx timestamp and
>+ * update saved value.
>+ *
>+ * Returns: New ctx timestamp value
>+ */
>+u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
>+{
>+	*old_ts = lrc->ctx_timestamp;
>+	lrc->ctx_timestamp = xe_lrc_timestamp(lrc);
>+
> 	trace_xe_lrc_update_timestamp(lrc, *old_ts);
>
> 	return lrc->ctx_timestamp;
>diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
>index a32472b92242..93c1234e2706 100644
>--- a/drivers/gpu/drm/xe/xe_lrc.h
>+++ b/drivers/gpu/drm/xe/xe_lrc.h
>@@ -142,7 +142,6 @@ void xe_lrc_snapshot_free(struct xe_lrc_snapshot *snapshot);
>
> u32 xe_lrc_ctx_timestamp_ggtt_addr(struct xe_lrc *lrc);
> u32 xe_lrc_ctx_timestamp_udw_ggtt_addr(struct xe_lrc *lrc);
>-u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc);
> u32 xe_lrc_ctx_job_timestamp_ggtt_addr(struct xe_lrc *lrc);
> u32 xe_lrc_ctx_job_timestamp(struct xe_lrc *lrc);
> int xe_lrc_setup_wa_bb_with_scratch(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
>@@ -162,4 +161,6 @@ int xe_lrc_setup_wa_bb_with_scratch(struct xe_lrc *lrc, struct xe_hw_engine *hwe
>  */
> u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts);
>
>+u64 xe_lrc_timestamp(struct xe_lrc *lrc);
>+
> #endif
>diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
>index ac0c6dcffe15..3dacfc2da75c 100644
>--- a/drivers/gpu/drm/xe/xe_ring_ops.c
>+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
>@@ -233,13 +233,26 @@ static u32 get_ppgtt_flag(struct xe_sched_job *job)
> 	return 0;
> }
>
>-static int emit_copy_timestamp(struct xe_lrc *lrc, u32 *dw, int i)
>+static int emit_copy_timestamp(struct xe_device *xe, struct xe_lrc *lrc,
>+			       u32 *dw, int i)
> {
> 	dw[i++] = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT | MI_SRM_ADD_CS_OFFSET;
> 	dw[i++] = RING_CTX_TIMESTAMP(0).addr;
> 	dw[i++] = xe_lrc_ctx_job_timestamp_ggtt_addr(lrc);
> 	dw[i++] = 0;
>
>+	/*
>+	 * Ensure CTX timestamp >= Job timestamp during VF sampling to avoid
>+	 * arithmetic wraparound in TDR.
>+	 */
>+	if (IS_SRIOV_VF(xe)) {
>+		dw[i++] = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT |
>+			MI_SRM_ADD_CS_OFFSET;
>+		dw[i++] = RING_CTX_TIMESTAMP(0).addr;
>+		dw[i++] = xe_lrc_ctx_timestamp_ggtt_addr(lrc);
>+		dw[i++] = 0;
>+	}

Is this change for a different issue OR is it the same issue that is 
fixed in patch 8?

otherwise, LGTM,

Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Thanks,
Umesh

>+
> 	return i;
> }
>
>@@ -253,7 +266,7 @@ static void __emit_job_gen12_simple(struct xe_sched_job *job, struct xe_lrc *lrc
>
> 	*head = lrc->ring.tail;
>
>-	i = emit_copy_timestamp(lrc, dw, i);
>+	i = emit_copy_timestamp(gt_to_xe(gt), lrc, dw, i);
>
> 	if (job->ring_ops_flush_tlb) {
> 		dw[i++] = preparser_disable(true);
>@@ -308,7 +321,7 @@ static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
>
> 	*head = lrc->ring.tail;
>
>-	i = emit_copy_timestamp(lrc, dw, i);
>+	i = emit_copy_timestamp(xe, lrc, dw, i);
>
> 	dw[i++] = preparser_disable(true);
>
>@@ -362,7 +375,7 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
>
> 	*head = lrc->ring.tail;
>
>-	i = emit_copy_timestamp(lrc, dw, i);
>+	i = emit_copy_timestamp(xe, lrc, dw, i);
>
> 	dw[i++] = preparser_disable(true);
> 	if (lacks_render)
>@@ -406,12 +419,14 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
> 				     struct xe_lrc *lrc, u32 *head,
> 				     u32 seqno)
> {
>+	struct xe_gt *gt = job->q->gt;
>+	struct xe_device *xe = gt_to_xe(gt);
> 	u32 saddr = xe_lrc_start_seqno_ggtt_addr(lrc);
> 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
>
> 	*head = lrc->ring.tail;
>
>-	i = emit_copy_timestamp(lrc, dw, i);
>+	i = emit_copy_timestamp(xe, lrc, dw, i);
>
> 	i = emit_store_imm_ggtt(saddr, seqno, dw, i);
>
>diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
>index cb674a322113..39aec7f6d86d 100644
>--- a/drivers/gpu/drm/xe/xe_sched_job.c
>+++ b/drivers/gpu/drm/xe/xe_sched_job.c
>@@ -110,6 +110,7 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
> 		return ERR_PTR(-ENOMEM);
>
> 	job->q = q;
>+	job->sample_timestamp = U64_MAX;
> 	kref_init(&job->refcount);
> 	xe_exec_queue_get(job->q);
>
>diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
>index 7c4c54fe920a..13c2970e81a8 100644
>--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
>+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
>@@ -59,6 +59,8 @@ struct xe_sched_job {
> 	u32 lrc_seqno;
> 	/** @migrate_flush_flags: Additional flush flags for migration jobs */
> 	u32 migrate_flush_flags;
>+	/** @sample_timestamp: Sampling of job timestamp in TDR */
>+	u64 sample_timestamp;
> 	/** @ring_ops_flush_tlb: The ring ops need to flush TLB before payload. */
> 	bool ring_ops_flush_tlb;
> 	/** @ggtt: mapped in ggtt. */
>-- 
>2.34.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 9/9] drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR
  2025-12-02  7:31   ` Umesh Nerlige Ramappa
@ 2025-12-02 15:14     ` Matthew Brost
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Brost @ 2025-12-02 15:14 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-xe, dri-devel

On Mon, Dec 01, 2025 at 11:31:57PM -0800, Umesh Nerlige Ramappa wrote:
> On Mon, Dec 01, 2025 at 10:39:54AM -0800, Matthew Brost wrote:
> > We now have proper infrastructure to accurately check the LRC timestamp
> > without toggling the scheduling state for non-VFs. For VFs, it is still
> > possible to get an inaccurate view if the context is on hardware. We
> > guard against free-running contexts on VFs by banning jobs whose
> > timestamps are not moving. In addition, VFs have a timeslice quantum
> > that naturally triggers context switches when more than one VF is
> > running, thus updating the LRC timestamp.
> 
> I guess some workloads are configured to just keep running on VFs without
> switching. I am assuming they are classified as Long Running and are not
> affected by TDR. If so, the timeouts should still work reasonably on a VF
> considering switching is usually in milliseconds and timeouts are much
> larger.
> 
> > 
> > For multi-queue, it is desirable to avoid scheduling toggling in the TDR
> > because this scheduling state is shared among many queues. Furthermore,
> > this change simplifies the GuC state machine. The trade-off for VF cases
> > seems worthwhile.
> > 
> > v5:
> > - Add xe_lrc_timestamp helper (Umesh)
> > v6:
> > - Reduce number of tries on stuck timestamp (VF testing)
> > - Convert job timestamp save to a memory copy (VF testing)
> > v7:
> > - Save ctx timestamp to LRC when start VF job (VF testing)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_guc_submit.c      | 97 ++++++-------------------
> > drivers/gpu/drm/xe/xe_lrc.c             | 42 +++++++----
> > drivers/gpu/drm/xe/xe_lrc.h             |  3 +-
> > drivers/gpu/drm/xe/xe_ring_ops.c        | 25 +++++--
> > drivers/gpu/drm/xe/xe_sched_job.c       |  1 +
> > drivers/gpu/drm/xe/xe_sched_job_types.h |  2 +
> > 6 files changed, 76 insertions(+), 94 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 8190f2afbaed..dc4bf3126450 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -68,9 +68,7 @@ exec_queue_to_guc(struct xe_exec_queue *q)
> > #define EXEC_QUEUE_STATE_KILLED			(1 << 7)
> > #define EXEC_QUEUE_STATE_WEDGED			(1 << 8)
> > #define EXEC_QUEUE_STATE_BANNED			(1 << 9)
> > -#define EXEC_QUEUE_STATE_CHECK_TIMEOUT		(1 << 10)
> > -#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 11)
> > -#define EXEC_QUEUE_STATE_PENDING_TDR_EXIT	(1 << 12)
> > +#define EXEC_QUEUE_STATE_PENDING_RESUME		(1 << 10)
> > 
> > static bool exec_queue_registered(struct xe_exec_queue *q)
> > {
> > @@ -202,21 +200,6 @@ static void set_exec_queue_wedged(struct xe_exec_queue *q)
> > 	atomic_or(EXEC_QUEUE_STATE_WEDGED, &q->guc->state);
> > }
> > 
> > -static bool exec_queue_check_timeout(struct xe_exec_queue *q)
> > -{
> > -	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_CHECK_TIMEOUT;
> > -}
> > -
> > -static void set_exec_queue_check_timeout(struct xe_exec_queue *q)
> > -{
> > -	atomic_or(EXEC_QUEUE_STATE_CHECK_TIMEOUT, &q->guc->state);
> > -}
> > -
> > -static void clear_exec_queue_check_timeout(struct xe_exec_queue *q)
> > -{
> > -	atomic_and(~EXEC_QUEUE_STATE_CHECK_TIMEOUT, &q->guc->state);
> > -}
> > -
> > static bool exec_queue_pending_resume(struct xe_exec_queue *q)
> > {
> > 	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_RESUME;
> > @@ -232,21 +215,6 @@ static void clear_exec_queue_pending_resume(struct xe_exec_queue *q)
> > 	atomic_and(~EXEC_QUEUE_STATE_PENDING_RESUME, &q->guc->state);
> > }
> > 
> > -static bool exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
> > -{
> > -	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_PENDING_TDR_EXIT;
> > -}
> > -
> > -static void set_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
> > -{
> > -	atomic_or(EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
> > -}
> > -
> > -static void clear_exec_queue_pending_tdr_exit(struct xe_exec_queue *q)
> > -{
> > -	atomic_and(~EXEC_QUEUE_STATE_PENDING_TDR_EXIT, &q->guc->state);
> > -}
> > -
> > static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
> > {
> > 	return (atomic_read(&q->guc->state) &
> > @@ -1006,7 +974,16 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
> > 		return xe_sched_invalidate_job(job, 2);
> > 	}
> > 
> > -	ctx_timestamp = lower_32_bits(xe_lrc_ctx_timestamp(q->lrc[0]));
> > +	ctx_timestamp = lower_32_bits(xe_lrc_timestamp(q->lrc[0]));
> > +	if (ctx_timestamp == job->sample_timestamp) {
> > +		xe_gt_warn(gt, "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, timestamp stuck",
> > +			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
> > +			   q->guc->id);
> > +
> > +		return xe_sched_invalidate_job(job, 0);
> > +	}
> > +
> > +	job->sample_timestamp = ctx_timestamp;
> > 	ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]);
> > 
> > 	/*
> > @@ -1132,16 +1109,17 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > 	}
> > 
> > 	/*
> > -	 * XXX: Sampling timeout doesn't work in wedged mode as we have to
> > -	 * modify scheduling state to read timestamp. We could read the
> > -	 * timestamp from a register to accumulate current running time but this
> > -	 * doesn't work for SRIOV. For now assuming timeouts in wedged mode are
> > -	 * genuine timeouts.
> > +	 * Check if job is actually timed out, if so restart job execution and TDR
> > 	 */
> > +	if (!skip_timeout_check && !check_timeout(q, job))
> > +		goto rearm;
> > +
> > 	if (!exec_queue_killed(q))
> > 		wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
> > 
> > -	/* Engine state now stable, disable scheduling to check timestamp */
> > +	set_exec_queue_banned(q);
> > +
> > +	/* Kick job / queue off hardware */
> > 	if (!wedged && (exec_queue_enabled(q) || exec_queue_pending_disable(q))) {
> > 		int ret;
> > 
> > @@ -1163,13 +1141,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > 			if (!ret || xe_guc_read_stopped(guc))
> > 				goto trigger_reset;
> > 
> > -			/*
> > -			 * Flag communicates to G2H handler that schedule
> > -			 * disable originated from a timeout check. The G2H then
> > -			 * avoid triggering cleanup or deregistering the exec
> > -			 * queue.
> > -			 */
> > -			set_exec_queue_check_timeout(q);
> > 			disable_scheduling(q, skip_timeout_check);
> > 		}
> > 
> > @@ -1198,22 +1169,12 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > 			xe_devcoredump(q, job,
> > 				       "Schedule disable failed to respond, guc_id=%d, ret=%d, guc_read=%d",
> > 				       q->guc->id, ret, xe_guc_read_stopped(guc));
> > -			set_exec_queue_banned(q);
> > 			xe_gt_reset_async(q->gt);
> > 			xe_sched_tdr_queue_imm(sched);
> > 			goto rearm;
> > 		}
> > 	}
> > 
> > -	/*
> > -	 * Check if job is actually timed out, if so restart job execution and TDR
> > -	 */
> > -	if (!wedged && !skip_timeout_check && !check_timeout(q, job) &&
> > -	    !exec_queue_reset(q) && exec_queue_registered(q)) {
> > -		clear_exec_queue_check_timeout(q);
> > -		goto sched_enable;
> > -	}
> > -
> > 	if (q->vm && q->vm->xef) {
> > 		process_name = q->vm->xef->process_name;
> > 		pid = q->vm->xef->pid;
> > @@ -1244,14 +1205,11 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > 	if (!wedged && (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
> > 			(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)))) {
> > 		if (!xe_sched_invalidate_job(job, 2)) {
> > -			clear_exec_queue_check_timeout(q);
> > 			xe_gt_reset_async(q->gt);
> > 			goto rearm;
> > 		}
> > 	}
> > 
> > -	set_exec_queue_banned(q);
> > -
> > 	/* Mark all outstanding jobs as bad, thus completing them */
> > 	xe_sched_job_set_error(job, err);
> > 	drm_sched_for_each_pending_job(tmp_job, &sched->base, NULL)
> > @@ -1266,9 +1224,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > 	 */
> > 	return DRM_GPU_SCHED_STAT_NO_HANG;
> > 
> > -sched_enable:
> > -	set_exec_queue_pending_tdr_exit(q);
> > -	enable_scheduling(q);
> > rearm:
> > 	/*
> > 	 * XXX: Ideally want to adjust timeout based on current execution time
> > @@ -1898,8 +1853,7 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> > 			  q->guc->id);
> > 	}
> > 
> > -	if (pending_enable && !pending_resume &&
> > -	    !exec_queue_pending_tdr_exit(q)) {
> > +	if (pending_enable && !pending_resume) {
> > 		clear_exec_queue_registered(q);
> > 		xe_gt_dbg(guc_to_gt(guc), "Replay REGISTER - guc_id=%d",
> > 			  q->guc->id);
> > @@ -1908,7 +1862,6 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> > 	if (pending_enable) {
> > 		clear_exec_queue_enabled(q);
> > 		clear_exec_queue_pending_resume(q);
> > -		clear_exec_queue_pending_tdr_exit(q);
> > 		clear_exec_queue_pending_enable(q);
> > 		xe_gt_dbg(guc_to_gt(guc), "Replay ENABLE - guc_id=%d",
> > 			  q->guc->id);
> > @@ -1934,7 +1887,6 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> > 		if (!pending_enable)
> > 			set_exec_queue_enabled(q);
> > 		clear_exec_queue_pending_disable(q);
> > -		clear_exec_queue_check_timeout(q);
> > 		xe_gt_dbg(guc_to_gt(guc), "Replay DISABLE - guc_id=%d",
> > 			  q->guc->id);
> > 	}
> > @@ -2308,13 +2260,10 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
> > 
> > 		q->guc->resume_time = ktime_get();
> > 		clear_exec_queue_pending_resume(q);
> > -		clear_exec_queue_pending_tdr_exit(q);
> > 		clear_exec_queue_pending_enable(q);
> > 		smp_wmb();
> > 		wake_up_all(&guc->ct.wq);
> > 	} else {
> > -		bool check_timeout = exec_queue_check_timeout(q);
> > -
> > 		xe_gt_assert(guc_to_gt(guc), runnable_state == 0);
> > 		xe_gt_assert(guc_to_gt(guc), exec_queue_pending_disable(q));
> > 
> > @@ -2322,11 +2271,11 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
> > 			suspend_fence_signal(q);
> > 			clear_exec_queue_pending_disable(q);
> > 		} else {
> > -			if (exec_queue_banned(q) || check_timeout) {
> > +			if (exec_queue_banned(q)) {
> > 				smp_wmb();
> > 				wake_up_all(&guc->ct.wq);
> > 			}
> > -			if (!check_timeout && exec_queue_destroyed(q)) {
> > +			if (exec_queue_destroyed(q)) {
> > 				/*
> > 				 * Make sure to clear the pending_disable only
> > 				 * after sampling the destroyed state. We want
> > @@ -2436,7 +2385,7 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len)
> > 	 * guc_exec_queue_timedout_job.
> > 	 */
> > 	set_exec_queue_reset(q);
> > -	if (!exec_queue_banned(q) && !exec_queue_check_timeout(q))
> > +	if (!exec_queue_banned(q))
> > 		xe_guc_exec_queue_trigger_cleanup(q);
> > 
> > 	return 0;
> > @@ -2517,7 +2466,7 @@ int xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32 *msg,
> > 
> > 	/* Treat the same as engine reset */
> > 	set_exec_queue_reset(q);
> > -	if (!exec_queue_banned(q) && !exec_queue_check_timeout(q))
> > +	if (!exec_queue_banned(q))
> > 		xe_guc_exec_queue_trigger_cleanup(q);
> > 
> > 	return 0;
> > diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
> > index 166353455f8f..38b0c536f6fb 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc.c
> > +++ b/drivers/gpu/drm/xe/xe_lrc.c
> > @@ -852,7 +852,7 @@ u32 xe_lrc_ctx_timestamp_udw_ggtt_addr(struct xe_lrc *lrc)
> >  *
> >  * Returns: ctx timestamp value
> >  */
> > -u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc)
> > +static u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc)
> > {
> > 	struct xe_device *xe = lrc_to_xe(lrc);
> > 	struct iosys_map map;
> > @@ -2380,35 +2380,31 @@ static int get_ctx_timestamp(struct xe_lrc *lrc, u32 engine_id, u64 *reg_ctx_ts)
> > }
> > 
> > /**
> > - * xe_lrc_update_timestamp() - Update ctx timestamp
> > + * xe_lrc_timestamp() - Current ctx timestamp
> >  * @lrc: Pointer to the lrc.
> > - * @old_ts: Old timestamp value
> >  *
> > - * Populate @old_ts current saved ctx timestamp, read new ctx timestamp and
> > - * update saved value. With support for active contexts, the calculation may be
> > - * slightly racy, so follow a read-again logic to ensure that the context is
> > - * still active before returning the right timestamp.
> > + * Return latest ctx timestamp. With support for active contexts, the
> > + * calculation may bb slightly racy, so follow a read-again logic to ensure that
> > + * the context is still active before returning the right timestamp.
> >  *
> >  * Returns: New ctx timestamp value
> >  */
> > -u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
> > +u64 xe_lrc_timestamp(struct xe_lrc *lrc)
> > {
> > -	u64 lrc_ts, reg_ts;
> > +	u64 lrc_ts, reg_ts, new_ts;
> > 	u32 engine_id;
> > 
> > -	*old_ts = lrc->ctx_timestamp;
> > -
> > 	lrc_ts = xe_lrc_ctx_timestamp(lrc);
> > 	/* CTX_TIMESTAMP mmio read is invalid on VF, so return the LRC value */
> > 	if (IS_SRIOV_VF(lrc_to_xe(lrc))) {
> > -		lrc->ctx_timestamp = lrc_ts;
> > +		new_ts = lrc_ts;
> > 		goto done;
> > 	}
> > 
> > 	if (lrc_ts == CONTEXT_ACTIVE) {
> > 		engine_id = xe_lrc_engine_id(lrc);
> > 		if (!get_ctx_timestamp(lrc, engine_id, &reg_ts))
> > -			lrc->ctx_timestamp = reg_ts;
> > +			new_ts = reg_ts;
> > 
> > 		/* read lrc again to ensure context is still active */
> > 		lrc_ts = xe_lrc_ctx_timestamp(lrc);
> > @@ -2419,9 +2415,27 @@ u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
> > 	 * be a separate if condition.
> > 	 */
> > 	if (lrc_ts != CONTEXT_ACTIVE)
> > -		lrc->ctx_timestamp = lrc_ts;
> > +		new_ts = lrc_ts;
> > 
> > done:
> > +	return new_ts;
> > +}
> > +
> > +/**
> > + * xe_lrc_update_timestamp() - Update ctx timestamp
> > + * @lrc: Pointer to the lrc.
> > + * @old_ts: Old timestamp value
> > + *
> > + * Populate @old_ts current saved ctx timestamp, read new ctx timestamp and
> > + * update saved value.
> > + *
> > + * Returns: New ctx timestamp value
> > + */
> > +u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts)
> > +{
> > +	*old_ts = lrc->ctx_timestamp;
> > +	lrc->ctx_timestamp = xe_lrc_timestamp(lrc);
> > +
> > 	trace_xe_lrc_update_timestamp(lrc, *old_ts);
> > 
> > 	return lrc->ctx_timestamp;
> > diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
> > index a32472b92242..93c1234e2706 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc.h
> > +++ b/drivers/gpu/drm/xe/xe_lrc.h
> > @@ -142,7 +142,6 @@ void xe_lrc_snapshot_free(struct xe_lrc_snapshot *snapshot);
> > 
> > u32 xe_lrc_ctx_timestamp_ggtt_addr(struct xe_lrc *lrc);
> > u32 xe_lrc_ctx_timestamp_udw_ggtt_addr(struct xe_lrc *lrc);
> > -u64 xe_lrc_ctx_timestamp(struct xe_lrc *lrc);
> > u32 xe_lrc_ctx_job_timestamp_ggtt_addr(struct xe_lrc *lrc);
> > u32 xe_lrc_ctx_job_timestamp(struct xe_lrc *lrc);
> > int xe_lrc_setup_wa_bb_with_scratch(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
> > @@ -162,4 +161,6 @@ int xe_lrc_setup_wa_bb_with_scratch(struct xe_lrc *lrc, struct xe_hw_engine *hwe
> >  */
> > u64 xe_lrc_update_timestamp(struct xe_lrc *lrc, u64 *old_ts);
> > 
> > +u64 xe_lrc_timestamp(struct xe_lrc *lrc);
> > +
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
> > index ac0c6dcffe15..3dacfc2da75c 100644
> > --- a/drivers/gpu/drm/xe/xe_ring_ops.c
> > +++ b/drivers/gpu/drm/xe/xe_ring_ops.c
> > @@ -233,13 +233,26 @@ static u32 get_ppgtt_flag(struct xe_sched_job *job)
> > 	return 0;
> > }
> > 
> > -static int emit_copy_timestamp(struct xe_lrc *lrc, u32 *dw, int i)
> > +static int emit_copy_timestamp(struct xe_device *xe, struct xe_lrc *lrc,
> > +			       u32 *dw, int i)
> > {
> > 	dw[i++] = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT | MI_SRM_ADD_CS_OFFSET;
> > 	dw[i++] = RING_CTX_TIMESTAMP(0).addr;
> > 	dw[i++] = xe_lrc_ctx_job_timestamp_ggtt_addr(lrc);
> > 	dw[i++] = 0;
> > 
> > +	/*
> > +	 * Ensure CTX timestamp >= Job timestamp during VF sampling to avoid
> > +	 * arithmetic wraparound in TDR.
> > +	 */
> > +	if (IS_SRIOV_VF(xe)) {
> > +		dw[i++] = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT |
> > +			MI_SRM_ADD_CS_OFFSET;
> > +		dw[i++] = RING_CTX_TIMESTAMP(0).addr;
> > +		dw[i++] = xe_lrc_ctx_timestamp_ggtt_addr(lrc);
> > +		dw[i++] = 0;
> > +	}
> 
> Is this change for a different issue OR is it the same issue that is fixed
> in patch 8?
> 

This is covering the case where the LRC timestamp is less than job
timestamp. Consider the case a context switches on with timestamp of 1,
by the time job timestamp is saved the value is 2. The TDR would see
this as wrap, thus immediately timeout the job upon first TDR fire. It
is possible the job only switched on the hardware at very end of TDR
period.

This code ensures LRC timestamp >= job timestamp on first switch in. So
this code is needed in addition to the previous patch.

Matt

> otherwise, LGTM,
> 
> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> 
> Thanks,
> Umesh
> 
> > +
> > 	return i;
> > }
> > 
> > @@ -253,7 +266,7 @@ static void __emit_job_gen12_simple(struct xe_sched_job *job, struct xe_lrc *lrc
> > 
> > 	*head = lrc->ring.tail;
> > 
> > -	i = emit_copy_timestamp(lrc, dw, i);
> > +	i = emit_copy_timestamp(gt_to_xe(gt), lrc, dw, i);
> > 
> > 	if (job->ring_ops_flush_tlb) {
> > 		dw[i++] = preparser_disable(true);
> > @@ -308,7 +321,7 @@ static void __emit_job_gen12_video(struct xe_sched_job *job, struct xe_lrc *lrc,
> > 
> > 	*head = lrc->ring.tail;
> > 
> > -	i = emit_copy_timestamp(lrc, dw, i);
> > +	i = emit_copy_timestamp(xe, lrc, dw, i);
> > 
> > 	dw[i++] = preparser_disable(true);
> > 
> > @@ -362,7 +375,7 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
> > 
> > 	*head = lrc->ring.tail;
> > 
> > -	i = emit_copy_timestamp(lrc, dw, i);
> > +	i = emit_copy_timestamp(xe, lrc, dw, i);
> > 
> > 	dw[i++] = preparser_disable(true);
> > 	if (lacks_render)
> > @@ -406,12 +419,14 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
> > 				     struct xe_lrc *lrc, u32 *head,
> > 				     u32 seqno)
> > {
> > +	struct xe_gt *gt = job->q->gt;
> > +	struct xe_device *xe = gt_to_xe(gt);
> > 	u32 saddr = xe_lrc_start_seqno_ggtt_addr(lrc);
> > 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
> > 
> > 	*head = lrc->ring.tail;
> > 
> > -	i = emit_copy_timestamp(lrc, dw, i);
> > +	i = emit_copy_timestamp(xe, lrc, dw, i);
> > 
> > 	i = emit_store_imm_ggtt(saddr, seqno, dw, i);
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
> > index cb674a322113..39aec7f6d86d 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job.c
> > +++ b/drivers/gpu/drm/xe/xe_sched_job.c
> > @@ -110,6 +110,7 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
> > 		return ERR_PTR(-ENOMEM);
> > 
> > 	job->q = q;
> > +	job->sample_timestamp = U64_MAX;
> > 	kref_init(&job->refcount);
> > 	xe_exec_queue_get(job->q);
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
> > index 7c4c54fe920a..13c2970e81a8 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job_types.h
> > +++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
> > @@ -59,6 +59,8 @@ struct xe_sched_job {
> > 	u32 lrc_seqno;
> > 	/** @migrate_flush_flags: Additional flush flags for migration jobs */
> > 	u32 migrate_flush_flags;
> > +	/** @sample_timestamp: Sampling of job timestamp in TDR */
> > +	u64 sample_timestamp;
> > 	/** @ring_ops_flush_tlb: The ring ops need to flush TLB before payload. */
> > 	bool ring_ops_flush_tlb;
> > 	/** @ggtt: mapped in ggtt. */
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe
  2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
                   ` (8 preceding siblings ...)
  2025-12-01 18:39 ` [PATCH v7 9/9] drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR Matthew Brost
@ 2025-12-03  1:23 ` Matthew Brost
  2025-12-03  8:33   ` Philipp Stanner
  9 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-03  1:23 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel, phasta, christian.koenig, dakr

On Mon, Dec 01, 2025 at 10:39:45AM -0800, Matthew Brost wrote:

Fellow DRM sched maintainers - going to merge the first two patches in
this series to drm-misc-next two days from now unless I hear an
objection.

Matt 

> At XDC, we discussed that drivers should avoid accessing DRM scheduler
> internals, misusing DRM scheduler locks, and adopt a well-defined
> pending job list iterator. This series proposes the necessary changes to
> the DRM scheduler to bring Xe in line with that agreement and updates Xe
> to use the new DRM scheduler API.
> 
> While here, cleanup LR queue handling and simplify GuC state machine in
> Xe too. Also rework LRC timestamp sampling to avoid scheduling toggle.
> 
> v2:
>  - Fix checkpatch / naming issues
> v3:
>  - Only allow pending job list iterator to be called on stopped schedulers
>  - Cleanup LR queue handling / fix a few misselanous Xe scheduler issues
> v4:
>  - Address Niranjana's feedback
>  - Add patch to avoid toggling scheduler state in the TDR
> v5:
>  - Rebase
>  - Fixup LRC timeout check (Umesh)
> v6:
>  - Fix VF bugs (Testing)
> v7:
>  - Disable timestamp WA on VF
> 
> Matt
> 
> Matthew Brost (9):
>   drm/sched: Add several job helpers to avoid drivers touching scheduler
>     state
>   drm/sched: Add pending job list iterator
>   drm/xe: Add dedicated message lock
>   drm/xe: Stop abusing DRM scheduler internals
>   drm/xe: Only toggle scheduling in TDR if GuC is running
>   drm/xe: Do not deregister queues in TDR
>   drm/xe: Remove special casing for LR queues in submission
>   drm/xe: Disable timestamp WA on VFs
>   drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR
> 
>  drivers/gpu/drm/scheduler/sched_main.c       |   4 +-
>  drivers/gpu/drm/xe/xe_gpu_scheduler.c        |   9 +-
>  drivers/gpu/drm/xe/xe_gpu_scheduler.h        |  37 +-
>  drivers/gpu/drm/xe/xe_gpu_scheduler_types.h  |   2 +
>  drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |   2 -
>  drivers/gpu/drm/xe/xe_guc_submit.c           | 362 +++----------------
>  drivers/gpu/drm/xe/xe_guc_submit_types.h     |  11 -
>  drivers/gpu/drm/xe/xe_hw_fence.c             |  16 -
>  drivers/gpu/drm/xe/xe_hw_fence.h             |   2 -
>  drivers/gpu/drm/xe/xe_lrc.c                  |  45 ++-
>  drivers/gpu/drm/xe/xe_lrc.h                  |   3 +-
>  drivers/gpu/drm/xe/xe_ring_ops.c             |  25 +-
>  drivers/gpu/drm/xe/xe_sched_job.c            |   1 +
>  drivers/gpu/drm/xe/xe_sched_job_types.h      |   2 +
>  drivers/gpu/drm/xe/xe_trace.h                |   5 -
>  include/drm/gpu_scheduler.h                  |  82 +++++
>  16 files changed, 211 insertions(+), 397 deletions(-)
> 
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe
  2025-12-03  1:23 ` [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
@ 2025-12-03  8:33   ` Philipp Stanner
  0 siblings, 0 replies; 27+ messages in thread
From: Philipp Stanner @ 2025-12-03  8:33 UTC (permalink / raw)
  To: Matthew Brost, intel-xe; +Cc: dri-devel, phasta, christian.koenig, dakr

On Tue, 2025-12-02 at 17:23 -0800, Matthew Brost wrote:
> On Mon, Dec 01, 2025 at 10:39:45AM -0800, Matthew Brost wrote:
> 
> Fellow DRM sched maintainers - going to merge the first two patches in
> this series to drm-misc-next two days from now unless I hear an
> objection.

Fellow maintainer Matt, it seems that none of us has ever been on Cc
for this series or patch? I can't find it in my inbox. Did you forget
to add us for 7 revisions, or did you omit us on purpose?

I look occasionally at dri-devel, but you know how huge the list is.

P.


> 
> Matt 
> 
> > At XDC, we discussed that drivers should avoid accessing DRM scheduler
> > internals, misusing DRM scheduler locks, and adopt a well-defined
> > pending job list iterator. This series proposes the necessary changes to
> > the DRM scheduler to bring Xe in line with that agreement and updates Xe
> > to use the new DRM scheduler API.
> > 
> > While here, cleanup LR queue handling and simplify GuC state machine in
> > Xe too. Also rework LRC timestamp sampling to avoid scheduling toggle.
> > 
> > v2:
> >  - Fix checkpatch / naming issues
> > v3:
> >  - Only allow pending job list iterator to be called on stopped schedulers
> >  - Cleanup LR queue handling / fix a few misselanous Xe scheduler issues
> > v4:
> >  - Address Niranjana's feedback
> >  - Add patch to avoid toggling scheduler state in the TDR
> > v5:
> >  - Rebase
> >  - Fixup LRC timeout check (Umesh)
> > v6:
> >  - Fix VF bugs (Testing)
> > v7:
> >  - Disable timestamp WA on VF
> > 
> > Matt
> > 
> > Matthew Brost (9):
> >   drm/sched: Add several job helpers to avoid drivers touching scheduler
> >     state
> >   drm/sched: Add pending job list iterator
> >   drm/xe: Add dedicated message lock
> >   drm/xe: Stop abusing DRM scheduler internals
> >   drm/xe: Only toggle scheduling in TDR if GuC is running
> >   drm/xe: Do not deregister queues in TDR
> >   drm/xe: Remove special casing for LR queues in submission
> >   drm/xe: Disable timestamp WA on VFs
> >   drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR
> > 
> >  drivers/gpu/drm/scheduler/sched_main.c       |   4 +-
> >  drivers/gpu/drm/xe/xe_gpu_scheduler.c        |   9 +-
> >  drivers/gpu/drm/xe/xe_gpu_scheduler.h        |  37 +-
> >  drivers/gpu/drm/xe/xe_gpu_scheduler_types.h  |   2 +
> >  drivers/gpu/drm/xe/xe_guc_exec_queue_types.h |   2 -
> >  drivers/gpu/drm/xe/xe_guc_submit.c           | 362 +++----------------
> >  drivers/gpu/drm/xe/xe_guc_submit_types.h     |  11 -
> >  drivers/gpu/drm/xe/xe_hw_fence.c             |  16 -
> >  drivers/gpu/drm/xe/xe_hw_fence.h             |   2 -
> >  drivers/gpu/drm/xe/xe_lrc.c                  |  45 ++-
> >  drivers/gpu/drm/xe/xe_lrc.h                  |   3 +-
> >  drivers/gpu/drm/xe/xe_ring_ops.c             |  25 +-
> >  drivers/gpu/drm/xe/xe_sched_job.c            |   1 +
> >  drivers/gpu/drm/xe/xe_sched_job_types.h      |   2 +
> >  drivers/gpu/drm/xe/xe_trace.h                |   5 -
> >  include/drm/gpu_scheduler.h                  |  82 +++++
> >  16 files changed, 211 insertions(+), 397 deletions(-)
> > 
> > -- 
> > 2.34.1
> > 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state
  2025-12-01 18:39 ` [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state Matthew Brost
@ 2025-12-03  8:56   ` Philipp Stanner
  2025-12-03 21:10     ` Matthew Brost
  0 siblings, 1 reply; 27+ messages in thread
From: Philipp Stanner @ 2025-12-03  8:56 UTC (permalink / raw)
  To: Matthew Brost, intel-xe
  Cc: dri-devel, dakr, Christian König, Alex Deucher

+Cc Christian, Alex, Danilo


On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> Add helpers to see if scheduler is stopped and a jobs signaled state.
> Expected to be used driver side on recovery and debug flows.

First, thanks for working on this.

This is a big and significant change because it moves towards ending
the 10-year practice of accessing internal locks etc. – I think this
should have a long(er) and detailed commit message aka "In the past
drivers used to … this must end because … to do so we need to provide
those new functions: …"

> 
> v4:
>  - Reorder patch to first in series (Niranjana)
>  - Also check parent fence for signaling (Niranjana)

"We" mostly agreed of not adding changelogs to commit messages anymore
and either have them in the cover letter or in the patche's comment
section below ---
The commit changelog comments are not canonical in the kernel and don't
provide any value IMO.

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c |  4 ++--
>  include/drm/gpu_scheduler.h            | 32 ++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 1d4f1b822e7b..cf40c18ab433 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -344,7 +344,7 @@ drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>   */
>  static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>  {
> -	if (!READ_ONCE(sched->pause_submit))
> +	if (!drm_sched_is_stopped(sched))
>  		queue_work(sched->submit_wq, &sched->work_run_job);
>  }
>  
> @@ -354,7 +354,7 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>   */
>  static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>  {
> -	if (!READ_ONCE(sched->pause_submit))
> +	if (!drm_sched_is_stopped(sched))
>  		queue_work(sched->submit_wq, &sched->work_free_job);
>  }
>  
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index fb88301b3c45..385bf34e76fe 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -698,4 +698,36 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>  				   struct drm_gpu_scheduler **sched_list,
>  				   unsigned int num_sched_list);
>  
> +/* Inlines */

Surplus comment, everyone immediately sees by the keyword that the
functions are inline.

But why do you want to provide them here instead of in sched_main.c in
the first place?


> +
> +/**
> + * drm_sched_is_stopped() - DRM is stopped

Well no, I doubt the entire DRM subsystem is stopped ;)

"Checks whether drm_sched is stopped"

> + * @sched: DRM scheduler
> + *
> + * Return: True if sched is stopped, False otherwise
> + */
> +static inline bool drm_sched_is_stopped(struct drm_gpu_scheduler *sched)
> +{
> +	return READ_ONCE(sched->pause_submit);

I am by the way suspecting since a long time

> +}
> +
> +/**
> + * drm_sched_job_is_signaled() - DRM scheduler job is signaled
> + * @job: DRM scheduler job
> + *
> + * Determine if DRM scheduler job is signaled. DRM scheduler should be stopped
> + * to obtain a stable snapshot of state. Both parent fence (hardware fence) and
> + * finished fence (software fence) are check to determine signaling state.

s/check/checked

I can roughly understand why you need the start/stop checkers for your
list iterator, but what is this function's purpose? The commit message
should explain that.

Do you need them in Xe? Do all drivers need them?

I think it's very cool that you provide this series and are working on
all that, but at XDC I think the important point was that we determined
that AMD and Intel basically do the same trick for GPU resets.

So our desire was not only to prevent folks from accessing the
scheduler's internals, but, ideally, also provide a well documented,
centralized and canonical mechanisms to do GPU resets.

So I think this drm/sched code must be discussed with AMD and we should
see whether it would be sufficient for them, too. And if yes, we need
to properly document that new way of GPU resets and tell users what
those functions are for. The docstrings so far just highlight that
those functions exist and how they are used, but not *why* they exist.

> + *
> + * Return: True if job is signaled, False otherwise

True and False should be lower case I think. At least I've never seen
them upper case in docstrings so far?


P.

> + */
> +static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
> +{
> +	struct drm_sched_fence *s_fence = job->s_fence;
> +
> +	WARN_ON(!drm_sched_is_stopped(job->sched));
> +	return (s_fence->parent && dma_fence_is_signaled(s_fence->parent)) ||
> +		dma_fence_is_signaled(&s_fence->finished);
> +}
> +
>  #endif


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/9] drm/sched: Add pending job list iterator
  2025-12-01 18:39 ` [PATCH v7 2/9] drm/sched: Add pending job list iterator Matthew Brost
@ 2025-12-03  9:07   ` Philipp Stanner
  2025-12-03 10:28     ` Philipp Stanner
  2025-12-04 16:04     ` Alex Deucher
  0 siblings, 2 replies; 27+ messages in thread
From: Philipp Stanner @ 2025-12-03  9:07 UTC (permalink / raw)
  To: Matthew Brost, intel-xe
  Cc: dri-devel, Alex Deucher, Christian König, dakr

+Cc Alex, Christian, Danilo


On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> Stop open coding pending job list in drivers. Add pending job list
> iterator which safely walks DRM scheduler list asserting DRM scheduler
> is stopped.
> 
> v2:
>  - Fix checkpatch (CI)
> v3:
>  - Drop locked version (Christian)
> v4:
>  - Reorder patch (Niranjana)

Same with the changelog.

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 385bf34e76fe..9d228513d06c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
>  		dma_fence_is_signaled(&s_fence->finished);
>  }
>  
> +/**
> + * struct drm_sched_pending_job_iter - DRM scheduler pending job iterator state
> + * @sched: DRM scheduler associated with pending job iterator
> + */
> +struct drm_sched_pending_job_iter {
> +	struct drm_gpu_scheduler *sched;
> +};
> +
> +/* Drivers should never call this directly */
> +static inline struct drm_sched_pending_job_iter
> +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_pending_job_iter iter = {
> +		.sched = sched,
> +	};
> +
> +	WARN_ON(!drm_sched_is_stopped(sched));
> +	return iter;
> +}
> +
> +/* Drivers should never call this directly */
> +static inline void
> +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter iter)
> +{
> +	WARN_ON(!drm_sched_is_stopped(iter.sched));
> +}
> +
> +DEFINE_CLASS(drm_sched_pending_job_iter, struct drm_sched_pending_job_iter,
> +	     __drm_sched_pending_job_iter_end(_T),
> +	     __drm_sched_pending_job_iter_begin(__sched),
> +	     struct drm_gpu_scheduler *__sched);
> +static inline void *
> +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t *_T)
> +{ return _T; }
> +#define class_drm_sched_pending_job_iter_is_conditional false
> +
> +/**
> + * drm_sched_for_each_pending_job() - Iterator for each pending job in scheduler
> + * @__job: Current pending job being iterated over
> + * @__sched: DRM scheduler to iterate over pending jobs
> + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
> + *
> + * Iterator for each pending job in scheduler, filtering on an entity, and
> + * enforcing scheduler is fully stopped
> + */
> +#define drm_sched_for_each_pending_job(__job, __sched, __entity)		\
> +	scoped_guard(drm_sched_pending_job_iter, (__sched))			\
> +		list_for_each_entry((__job), &(__sched)->pending_list, list)	\
> +			for_each_if(!(__entity) || (__job)->entity == (__entity))
> +
>  #endif


See my comments in the first patch. The docu doesn't mention at all why
this new functionality exists and when and why users would be expected
to use it.

As far as I remember from XDC, both AMD and Intel overwrite a timed out
jobs buffer data in the rings on GPU reset. To do so, the driver needs
the timedout job (passed through timedout_job() callback) and then
needs all the pending non-broken jobs.

AFAICS your patch provides a generic iterator over the entire
pending_list. How is a driver then supposed to determine which are the
non-broken jobs (just asking, but that needs to be documented)?

Could it make sense to use a different iterator which only returns jobs
of not belonging to the same context as the timedout-one?

Those are important questions that need to be addressed before merging
that.

And if this works canonically (i.e., for basically everyone), it needs
to be documented in drm_sched_resubmit_jobs() that this iterator is now
the canonical way of handling timeouts.

Moreover, btw, just yesterday I added an entry to the DRM todo list
which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
would have to be removed, too.


@AMD: Would the code Matthew provides work for you? Please give your
input. This is very important common infrastructure.


Philipp


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 3/9] drm/xe: Add dedicated message lock
  2025-12-01 18:39 ` [PATCH v7 3/9] drm/xe: Add dedicated message lock Matthew Brost
@ 2025-12-03  9:38   ` Philipp Stanner
  0 siblings, 0 replies; 27+ messages in thread
From: Philipp Stanner @ 2025-12-03  9:38 UTC (permalink / raw)
  To: Matthew Brost, intel-xe; +Cc: dri-devel

On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> Stop abusing DRM scheduler job list lock for messages, add dedicated
> message lock.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Nice! Me gusta.


Acked-by: Philipp Stanner <phasta@kernel.org>

> ---
>  drivers/gpu/drm/xe/xe_gpu_scheduler.c       | 5 +++--
>  drivers/gpu/drm/xe/xe_gpu_scheduler.h       | 4 ++--
>  drivers/gpu/drm/xe/xe_gpu_scheduler_types.h | 2 ++
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> index f91e06d03511..f4f23317191f 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> @@ -77,6 +77,7 @@ int xe_sched_init(struct xe_gpu_scheduler *sched,
>  	};
>  
>  	sched->ops = xe_ops;
> +	spin_lock_init(&sched->msg_lock);
>  	INIT_LIST_HEAD(&sched->msgs);
>  	INIT_WORK(&sched->work_process_msg, xe_sched_process_msg_work);
>  
> @@ -117,7 +118,7 @@ void xe_sched_add_msg(struct xe_gpu_scheduler *sched,
>  void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
>  			     struct xe_sched_msg *msg)
>  {
> -	lockdep_assert_held(&sched->base.job_list_lock);
> +	lockdep_assert_held(&sched->msg_lock);
>  
>  	list_add_tail(&msg->link, &sched->msgs);
>  	xe_sched_process_msg_queue(sched);
> @@ -131,7 +132,7 @@ void xe_sched_add_msg_locked(struct xe_gpu_scheduler *sched,
>  void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
>  			   struct xe_sched_msg *msg)
>  {
> -	lockdep_assert_held(&sched->base.job_list_lock);
> +	lockdep_assert_held(&sched->msg_lock);
>  
>  	list_add(&msg->link, &sched->msgs);
>  	xe_sched_process_msg_queue(sched);
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> index c7a77a3a9681..dceb2cd0ee5b 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> @@ -33,12 +33,12 @@ void xe_sched_add_msg_head(struct xe_gpu_scheduler *sched,
>  
>  static inline void xe_sched_msg_lock(struct xe_gpu_scheduler *sched)
>  {
> -	spin_lock(&sched->base.job_list_lock);
> +	spin_lock(&sched->msg_lock);
>  }
>  
>  static inline void xe_sched_msg_unlock(struct xe_gpu_scheduler *sched)
>  {
> -	spin_unlock(&sched->base.job_list_lock);
> +	spin_unlock(&sched->msg_lock);
>  }
>  
>  static inline void xe_sched_stop(struct xe_gpu_scheduler *sched)
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h b/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h
> index 6731b13da8bb..63d9bf92583c 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h
> @@ -47,6 +47,8 @@ struct xe_gpu_scheduler {
>  	const struct xe_sched_backend_ops	*ops;
>  	/** @msgs: list of messages to be processed in @work_process_msg */
>  	struct list_head			msgs;
> +	/** @msg_lock: Message lock */
> +	spinlock_t				msg_lock;
>  	/** @work_process_msg: processes messages */
>  	struct work_struct		work_process_msg;
>  };


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/9] drm/sched: Add pending job list iterator
  2025-12-03  9:07   ` Philipp Stanner
@ 2025-12-03 10:28     ` Philipp Stanner
  2025-12-04 16:04     ` Alex Deucher
  1 sibling, 0 replies; 27+ messages in thread
From: Philipp Stanner @ 2025-12-03 10:28 UTC (permalink / raw)
  To: Matthew Brost, intel-xe
  Cc: dri-devel, Alex Deucher, Christian König, dakr

On Wed, 2025-12-03 at 10:07 +0100, Philipp Stanner wrote:
> +Cc Alex, Christian, Danilo
> 
> 
> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> > 

[…]

> > +
> > +/**
> > + * drm_sched_for_each_pending_job() - Iterator for each pending job in scheduler
> > + * @__job: Current pending job being iterated over
> > + * @__sched: DRM scheduler to iterate over pending jobs
> > + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
> > + *
> > + * Iterator for each pending job in scheduler, filtering on an entity, and
> > + * enforcing scheduler is fully stopped
> > + */
> > +#define drm_sched_for_each_pending_job(__job, __sched, __entity)		\
> > +	scoped_guard(drm_sched_pending_job_iter, (__sched))			\
> > +		list_for_each_entry((__job), &(__sched)->pending_list, list)	\
> > +			for_each_if(!(__entity) || (__job)->entity == (__entity))
> > +
> >  #endif
> 
> 
> See my comments in the first patch. The docu doesn't mention at all why
> this new functionality exists and when and why users would be expected
> to use it.
> 
> As far as I remember from XDC, both AMD and Intel overwrite a timed out
> jobs buffer data in the rings on GPU reset. To do so, the driver needs
> the timedout job (passed through timedout_job() callback) and then
> needs all the pending non-broken jobs.
> 
> AFAICS your patch provides a generic iterator over the entire
> pending_list. How is a driver then supposed to determine which are the
> non-broken jobs (just asking, but that needs to be documented)?
> 
> Could it make sense to use a different iterator which only returns jobs
> of not belonging to the same context as the timedout-one?

(forget about that comment, you do that with the entity-filter
obviously)

P.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals
  2025-12-01 18:39 ` [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals Matthew Brost
@ 2025-12-03 10:56   ` Philipp Stanner
  2025-12-03 20:44     ` Matthew Brost
  0 siblings, 1 reply; 27+ messages in thread
From: Philipp Stanner @ 2025-12-03 10:56 UTC (permalink / raw)
  To: Matthew Brost, intel-xe
  Cc: dri-devel, dakr, Alex Deucher, Christian König

On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> Use new pending job list iterator and new helper functions in Xe to
> avoid reaching into DRM scheduler internals.

Cool.

Obviously this is your driver, but some comments below which you might
want to take into account.

> 
> Part of this change involves removing pending jobs debug information
> from debugfs and devcoredump. As agreed, the pending job list should
> only be accessed when the scheduler is stopped. However, it's not
> straightforward to determine whether the scheduler is stopped from the
> shared debugfs/devcoredump code path. Additionally, the pending job list
> provides little useful information, as pending jobs can be inferred from
> seqnos and ring head/tail positions. Therefore, this debug information
> is being removed.

This reads a bit like a contradiction to the first sentence.

> 
> v4:
>  - Add comment around DRM_GPU_SCHED_STAT_NO_HANG (Niranjana)

Revision info for just one of 7 revisions?

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gpu_scheduler.c    |  4 +-
>  drivers/gpu/drm/xe/xe_gpu_scheduler.h    | 33 ++--------
>  drivers/gpu/drm/xe/xe_guc_submit.c       | 81 ++++++------------------
>  drivers/gpu/drm/xe/xe_guc_submit_types.h | 11 ----
>  drivers/gpu/drm/xe/xe_hw_fence.c         | 16 -----
>  drivers/gpu/drm/xe/xe_hw_fence.h         |  2 -
>  6 files changed, 27 insertions(+), 120 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> index f4f23317191f..9c8004d5dd91 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> @@ -7,7 +7,7 @@
>  
>  static void xe_sched_process_msg_queue(struct xe_gpu_scheduler *sched)
>  {
> -	if (!READ_ONCE(sched->base.pause_submit))
> +	if (!drm_sched_is_stopped(&sched->base))
>  		queue_work(sched->base.submit_wq, &sched->work_process_msg);

Sharing the submit_wq is legal. But next-level cleanness would be if
struct drm_gpu_scheduler's internal components wouldn't be touched.
That's kind of a luxury request, though.

>  }
>  
> @@ -43,7 +43,7 @@ static void xe_sched_process_msg_work(struct work_struct *w)
>  		container_of(w, struct xe_gpu_scheduler, work_process_msg);
>  	struct xe_sched_msg *msg;
>  
> -	if (READ_ONCE(sched->base.pause_submit))
> +	if (drm_sched_is_stopped(&sched->base))
>  		return;
>  
>  	msg = xe_sched_get_msg(sched);
> diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> index dceb2cd0ee5b..664c2db56af3 100644
> --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> @@ -56,12 +56,9 @@ static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
>  	struct drm_sched_job *s_job;
>  	bool restore_replay = false;
>  
> -	list_for_each_entry(s_job, &sched->base.pending_list, list) {
> -		struct drm_sched_fence *s_fence = s_job->s_fence;
> -		struct dma_fence *hw_fence = s_fence->parent;
> -
> +	drm_sched_for_each_pending_job(s_job, &sched->base, NULL) {
>  		restore_replay |= to_xe_sched_job(s_job)->restore_replay;
> -		if (restore_replay || (hw_fence && !dma_fence_is_signaled(hw_fence)))
> +		if (restore_replay || !drm_sched_job_is_signaled(s_job))

So that's where this function is needed. You check whether that job in
the pending_list is signaled. 

>  			sched->base.ops->run_job(s_job);

Aaaaaahm. So you invoke your own callback. But basically just to access
the function pointer I suppose?

Since this is effectively your drm_sched_resubmit_jobs(), it is
definitely desirable to provide a text book example of how to do resets
so that others can follow your usage.

Can't you replace ops->run_job() with a call to your functions where
you push the jobs to the ring, directly?


P.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals
  2025-12-03 10:56   ` Philipp Stanner
@ 2025-12-03 20:44     ` Matthew Brost
  2025-12-08 13:44       ` Philipp Stanner
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Brost @ 2025-12-03 20:44 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: intel-xe, dri-devel, dakr, Alex Deucher, Christian König

On Wed, Dec 03, 2025 at 11:56:01AM +0100, Philipp Stanner wrote:
> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> > Use new pending job list iterator and new helper functions in Xe to
> > avoid reaching into DRM scheduler internals.
> 
> Cool.
> 
> Obviously this is your driver, but some comments below which you might
> want to take into account.
> 
> > 
> > Part of this change involves removing pending jobs debug information
> > from debugfs and devcoredump. As agreed, the pending job list should
> > only be accessed when the scheduler is stopped. However, it's not
> > straightforward to determine whether the scheduler is stopped from the
> > shared debugfs/devcoredump code path. Additionally, the pending job list
> > provides little useful information, as pending jobs can be inferred from
> > seqnos and ring head/tail positions. Therefore, this debug information
> > is being removed.
> 
> This reads a bit like a contradiction to the first sentence.
> 
> > 
> > v4:
> >  - Add comment around DRM_GPU_SCHED_STAT_NO_HANG (Niranjana)
> 
> Revision info for just one of 7 revisions?
> 

Only v4 changed.

> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gpu_scheduler.c    |  4 +-
> >  drivers/gpu/drm/xe/xe_gpu_scheduler.h    | 33 ++--------
> >  drivers/gpu/drm/xe/xe_guc_submit.c       | 81 ++++++------------------
> >  drivers/gpu/drm/xe/xe_guc_submit_types.h | 11 ----
> >  drivers/gpu/drm/xe/xe_hw_fence.c         | 16 -----
> >  drivers/gpu/drm/xe/xe_hw_fence.h         |  2 -
> >  6 files changed, 27 insertions(+), 120 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > index f4f23317191f..9c8004d5dd91 100644
> > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > @@ -7,7 +7,7 @@
> >  
> >  static void xe_sched_process_msg_queue(struct xe_gpu_scheduler *sched)
> >  {
> > -	if (!READ_ONCE(sched->base.pause_submit))
> > +	if (!drm_sched_is_stopped(&sched->base))
> >  		queue_work(sched->base.submit_wq, &sched->work_process_msg);
> 
> Sharing the submit_wq is legal. But next-level cleanness would be if
> struct drm_gpu_scheduler's internal components wouldn't be touched.
> That's kind of a luxury request, though.
> 

Yes, perhaps a helper to extract the submit_wq too.

> >  }
> >  
> > @@ -43,7 +43,7 @@ static void xe_sched_process_msg_work(struct work_struct *w)
> >  		container_of(w, struct xe_gpu_scheduler, work_process_msg);
> >  	struct xe_sched_msg *msg;
> >  
> > -	if (READ_ONCE(sched->base.pause_submit))
> > +	if (drm_sched_is_stopped(&sched->base))
> >  		return;
> >  
> >  	msg = xe_sched_get_msg(sched);
> > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > index dceb2cd0ee5b..664c2db56af3 100644
> > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > @@ -56,12 +56,9 @@ static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
> >  	struct drm_sched_job *s_job;
> >  	bool restore_replay = false;
> >  
> > -	list_for_each_entry(s_job, &sched->base.pending_list, list) {
> > -		struct drm_sched_fence *s_fence = s_job->s_fence;
> > -		struct dma_fence *hw_fence = s_fence->parent;
> > -
> > +	drm_sched_for_each_pending_job(s_job, &sched->base, NULL) {
> >  		restore_replay |= to_xe_sched_job(s_job)->restore_replay;
> > -		if (restore_replay || (hw_fence && !dma_fence_is_signaled(hw_fence)))
> > +		if (restore_replay || !drm_sched_job_is_signaled(s_job))
> 
> So that's where this function is needed. You check whether that job in
> the pending_list is signaled. 
> 

Yes, during GT reset flows (think a device level reset) it is possible
we stop the scheduler between the window of a job signaling but before
free_job is called. We want avoid resubmission of jobs which have
signaled.

> >  			sched->base.ops->run_job(s_job);
> 
> Aaaaaahm. So you invoke your own callback. But basically just to access
> the function pointer I suppose?
> 
> Since this is effectively your drm_sched_resubmit_jobs(), it is
> definitely desirable to provide a text book example of how to do resets
> so that others can follow your usage.
> 

Yes, but drm_sched_resubmit_jobs() does some nonsense with dma-fence
that I don’t need here. Honestly, I’m a little unsure what that is
actually doing. We also use this function during VF restore after
migration. This is a multi-step process that needs to operate on the
same set of jobs at each step of the restore. That’s what the
restore_replay variable represents—it marks a job at the very beginning
of the restore process, and each step along the way ensures execution
starts at that job. Techincally once we here in a VF restore jobs can
start signaling as the hardware is live. So some of this really is
vendor-specific.

> Can't you replace ops->run_job() with a call to your functions where
> you push the jobs to the ring, directly?
> 

Yes, we could, but that function isn’t currently exported. Also, in
future products, we may assign a different run_job vfunc based on
hardware generation or queue type. So using a vfunc here makes sense as
a bit of future-proofing. Of course, we could also have a DRM
scheduler-level helper that invokes run_job for us.

Matt

> 
> P.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state
  2025-12-03  8:56   ` Philipp Stanner
@ 2025-12-03 21:10     ` Matthew Brost
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Brost @ 2025-12-03 21:10 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: intel-xe, dri-devel, dakr, Christian König, Alex Deucher

On Wed, Dec 03, 2025 at 09:56:45AM +0100, Philipp Stanner wrote:
> +Cc Christian, Alex, Danilo
> 
> 
> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> > Add helpers to see if scheduler is stopped and a jobs signaled state.
> > Expected to be used driver side on recovery and debug flows.
> 
> First, thanks for working on this.
> 
> This is a big and significant change because it moves towards ending
> the 10-year practice of accessing internal locks etc. – I think this
> should have a long(er) and detailed commit message aka "In the past
> drivers used to … this must end because … to do so we need to provide
> those new functions: …"
> 

Sure, let me add that.

> > 
> > v4:
> >  - Reorder patch to first in series (Niranjana)
> >  - Also check parent fence for signaling (Niranjana)
> 
> "We" mostly agreed of not adding changelogs to commit messages anymore
> and either have them in the cover letter or in the patche's comment
> section below ---
> The commit changelog comments are not canonical in the kernel and don't
> provide any value IMO.
> 

In Xe we typically keep these, right or wrong? Also if this is below
---, if I checkouk a mbox and apply it then next time I send the patch
change log is lost unless I add it back in. Maybe there is a git am
option so that doesn't happen? Anyways, I'll fix this up in the next
rev.

> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c |  4 ++--
> >  include/drm/gpu_scheduler.h            | 32 ++++++++++++++++++++++++++
> >  2 files changed, 34 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 1d4f1b822e7b..cf40c18ab433 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -344,7 +344,7 @@ drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
> >   */
> >  static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> >  {
> > -	if (!READ_ONCE(sched->pause_submit))
> > +	if (!drm_sched_is_stopped(sched))
> >  		queue_work(sched->submit_wq, &sched->work_run_job);
> >  }
> >  
> > @@ -354,7 +354,7 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> >   */
> >  static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
> >  {
> > -	if (!READ_ONCE(sched->pause_submit))
> > +	if (!drm_sched_is_stopped(sched))
> >  		queue_work(sched->submit_wq, &sched->work_free_job);
> >  }
> >  
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index fb88301b3c45..385bf34e76fe 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -698,4 +698,36 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >  				   struct drm_gpu_scheduler **sched_list,
> >  				   unsigned int num_sched_list);
> >  
> > +/* Inlines */

This file has surplus comments, so just followed the style. See
'Scheduler operations' and "Jobs' in this header. But can remove.

> 
> Surplus comment, everyone immediately sees by the keyword that the
> functions are inline.
> 
> But why do you want to provide them here instead of in sched_main.c in
> the first place?

They are small functions so made them inlines but can move sched_main.c
if needed. The iterator in the following patch needs to be in header
though.

> 
> 
> > +
> > +/**
> > + * drm_sched_is_stopped() - DRM is stopped
> 
> Well no, I doubt the entire DRM subsystem is stopped ;)
> 
> "Checks whether drm_sched is stopped"
> 

Sure.

> > + * @sched: DRM scheduler
> > + *
> > + * Return: True if sched is stopped, False otherwise
> > + */
> > +static inline bool drm_sched_is_stopped(struct drm_gpu_scheduler *sched)
> > +{
> > +	return READ_ONCE(sched->pause_submit);
> 
> I am by the way suspecting since a long time
> 

?

> > +}
> > +
> > +/**
> > + * drm_sched_job_is_signaled() - DRM scheduler job is signaled
> > + * @job: DRM scheduler job
> > + *
> > + * Determine if DRM scheduler job is signaled. DRM scheduler should be stopped
> > + * to obtain a stable snapshot of state. Both parent fence (hardware fence) and
> > + * finished fence (software fence) are check to determine signaling state.
> 
> s/check/checked
>

+1
 
> I can roughly understand why you need the start/stop checkers for your
> list iterator, but what is this function's purpose? The commit message
> should explain that.
> 

Sure can adjust the commit message.

> Do you need them in Xe? Do all drivers need them?
> 

I think Xe question in answered in patch #4. Unsure on other driver.

> I think it's very cool that you provide this series and are working on
> all that, but at XDC I think the important point was that we determined
> that AMD and Intel basically do the same trick for GPU resets.
> 
> So our desire was not only to prevent folks from accessing the
> scheduler's internals, but, ideally, also provide a well documented,
> centralized and canonical mechanisms to do GPU resets.

See my reply in patch #4. I believe GPU resets could largely be generic.
However, my driver’s VF migration restore use case also calls run_job
again, which is a vendor-specific flow. So I’d prefer to keep that part
on the driver side and just use the functions provided in the first two
patches of this series to avoid touching the internals of the scheduler.
Eventually, I might push some of the logic from my custom function into
my run_job callback, but at the moment the ROI on that is quite low—and
I’m not convinced this can be made completely generic.

> 
> So I think this drm/sched code must be discussed with AMD and we should
> see whether it would be sufficient for them, too. And if yes, we need
> to properly document that new way of GPU resets and tell users what
> those functions are for. The docstrings so far just highlight that
> those functions exist and how they are used, but not *why* they exist.
> 

Again, I really doubt that everything related to GPU resets and
resubmission can be made generic. This series is about providing the
interfaces to do these things and doing so safely (e.g., not walking the
pending job list while the scheduler is running, etc.).

> > + *
> > + * Return: True if job is signaled, False otherwise
> 
> True and False should be lower case I think. At least I've never seen
> them upper case in docstrings so far?
> 

That's typically how we type this in Xe but this is a bikeshed. Can
change if you like.

Matt

> 
> P.
> 
> > + */
> > +static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
> > +{
> > +	struct drm_sched_fence *s_fence = job->s_fence;
> > +
> > +	WARN_ON(!drm_sched_is_stopped(job->sched));
> > +	return (s_fence->parent && dma_fence_is_signaled(s_fence->parent)) ||
> > +		dma_fence_is_signaled(&s_fence->finished);
> > +}
> > +
> >  #endif
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/9] drm/sched: Add pending job list iterator
  2025-12-03  9:07   ` Philipp Stanner
  2025-12-03 10:28     ` Philipp Stanner
@ 2025-12-04 16:04     ` Alex Deucher
  2025-12-05  9:19       ` Christian König
  1 sibling, 1 reply; 27+ messages in thread
From: Alex Deucher @ 2025-12-04 16:04 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Matthew Brost, intel-xe, dri-devel, Alex Deucher,
	Christian König, dakr

On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <pstanner@redhat.com> wrote:
>
> +Cc Alex, Christian, Danilo
>
>
> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> > Stop open coding pending job list in drivers. Add pending job list
> > iterator which safely walks DRM scheduler list asserting DRM scheduler
> > is stopped.
> >
> > v2:
> >  - Fix checkpatch (CI)
> > v3:
> >  - Drop locked version (Christian)
> > v4:
> >  - Reorder patch (Niranjana)
>
> Same with the changelog.
>
> >
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > ---
> >  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 50 insertions(+)
> >
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 385bf34e76fe..9d228513d06c 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
> >               dma_fence_is_signaled(&s_fence->finished);
> >  }
> >
> > +/**
> > + * struct drm_sched_pending_job_iter - DRM scheduler pending job iterator state
> > + * @sched: DRM scheduler associated with pending job iterator
> > + */
> > +struct drm_sched_pending_job_iter {
> > +     struct drm_gpu_scheduler *sched;
> > +};
> > +
> > +/* Drivers should never call this directly */
> > +static inline struct drm_sched_pending_job_iter
> > +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
> > +{
> > +     struct drm_sched_pending_job_iter iter = {
> > +             .sched = sched,
> > +     };
> > +
> > +     WARN_ON(!drm_sched_is_stopped(sched));
> > +     return iter;
> > +}
> > +
> > +/* Drivers should never call this directly */
> > +static inline void
> > +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter iter)
> > +{
> > +     WARN_ON(!drm_sched_is_stopped(iter.sched));
> > +}
> > +
> > +DEFINE_CLASS(drm_sched_pending_job_iter, struct drm_sched_pending_job_iter,
> > +          __drm_sched_pending_job_iter_end(_T),
> > +          __drm_sched_pending_job_iter_begin(__sched),
> > +          struct drm_gpu_scheduler *__sched);
> > +static inline void *
> > +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t *_T)
> > +{ return _T; }
> > +#define class_drm_sched_pending_job_iter_is_conditional false
> > +
> > +/**
> > + * drm_sched_for_each_pending_job() - Iterator for each pending job in scheduler
> > + * @__job: Current pending job being iterated over
> > + * @__sched: DRM scheduler to iterate over pending jobs
> > + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
> > + *
> > + * Iterator for each pending job in scheduler, filtering on an entity, and
> > + * enforcing scheduler is fully stopped
> > + */
> > +#define drm_sched_for_each_pending_job(__job, __sched, __entity)             \
> > +     scoped_guard(drm_sched_pending_job_iter, (__sched))                     \
> > +             list_for_each_entry((__job), &(__sched)->pending_list, list)    \
> > +                     for_each_if(!(__entity) || (__job)->entity == (__entity))
> > +
> >  #endif
>
>
> See my comments in the first patch. The docu doesn't mention at all why
> this new functionality exists and when and why users would be expected
> to use it.
>
> As far as I remember from XDC, both AMD and Intel overwrite a timed out
> jobs buffer data in the rings on GPU reset. To do so, the driver needs
> the timedout job (passed through timedout_job() callback) and then
> needs all the pending non-broken jobs.
>
> AFAICS your patch provides a generic iterator over the entire
> pending_list. How is a driver then supposed to determine which are the
> non-broken jobs (just asking, but that needs to be documented)?
>
> Could it make sense to use a different iterator which only returns jobs
> of not belonging to the same context as the timedout-one?
>
> Those are important questions that need to be addressed before merging
> that.
>
> And if this works canonically (i.e., for basically everyone), it needs
> to be documented in drm_sched_resubmit_jobs() that this iterator is now
> the canonical way of handling timeouts.
>
> Moreover, btw, just yesterday I added an entry to the DRM todo list
> which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
> would have to be removed, too.
>
>
> @AMD: Would the code Matthew provides work for you? Please give your
> input. This is very important common infrastructure.

I don't think drm_sched_resubmit_jobs() can work for us without major
rework.  For our kernel queues, we have a single queue on which jobs
for different clients are scheduled.  When we reset the queue, we lose
all jobs on the queue and have to re-emit the non-guilty ones.  We do
this at the ring level, i.e., we save the packets directly from the
ring and then re-emit the packets for the non-guilty contexts to the
freshly reset ring.  This avoids running run_job() again which would
issue new fences and race with memory management, etc.

I think the following would be workable:
1. driver job_timedout() callback flags the job as bad. resets the bad
queue, and calls drm_sched_resubmit_jobs()
2. drm_sched_resubmit_jobs() walks the pending list and calls
run_job() for every job
2. driver run_job() callback looks to see if we already ran this job
and uses the original fence rather than allocating a new one
3. driver run_job() callback checks to see if the job is guilty or
from the same context and if so, sets an error on the fences and
submits only the fence packet to the queue so that any follow up jobs
will properly synchronize if they need to wait on the fence from the
bad job.
4. driver run_job() callback will submit the full packet stream for
non-guilty contexts

I guess we could use the iterator and implement that logic in the
driver directly rather than using drm_sched_resubmit_jobs().

Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/9] drm/sched: Add pending job list iterator
  2025-12-04 16:04     ` Alex Deucher
@ 2025-12-05  9:19       ` Christian König
  2025-12-05 18:54         ` Matthew Brost
  2025-12-08 13:33         ` Philipp Stanner
  0 siblings, 2 replies; 27+ messages in thread
From: Christian König @ 2025-12-05  9:19 UTC (permalink / raw)
  To: Alex Deucher, Philipp Stanner
  Cc: Matthew Brost, intel-xe, dri-devel, Alex Deucher, dakr

On 12/4/25 17:04, Alex Deucher wrote:
> On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <pstanner@redhat.com> wrote:
>>
>> +Cc Alex, Christian, Danilo
>>
>>
>> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
>>> Stop open coding pending job list in drivers. Add pending job list
>>> iterator which safely walks DRM scheduler list asserting DRM scheduler
>>> is stopped.
>>>
>>> v2:
>>>  - Fix checkpatch (CI)
>>> v3:
>>>  - Drop locked version (Christian)
>>> v4:
>>>  - Reorder patch (Niranjana)
>>
>> Same with the changelog.
>>
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>> ---
>>>  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 50 insertions(+)
>>>
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 385bf34e76fe..9d228513d06c 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
>>>               dma_fence_is_signaled(&s_fence->finished);
>>>  }
>>>
>>> +/**
>>> + * struct drm_sched_pending_job_iter - DRM scheduler pending job iterator state
>>> + * @sched: DRM scheduler associated with pending job iterator
>>> + */
>>> +struct drm_sched_pending_job_iter {
>>> +     struct drm_gpu_scheduler *sched;
>>> +};
>>> +
>>> +/* Drivers should never call this directly */
>>> +static inline struct drm_sched_pending_job_iter
>>> +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
>>> +{
>>> +     struct drm_sched_pending_job_iter iter = {
>>> +             .sched = sched,
>>> +     };
>>> +
>>> +     WARN_ON(!drm_sched_is_stopped(sched));
>>> +     return iter;
>>> +}
>>> +
>>> +/* Drivers should never call this directly */
>>> +static inline void
>>> +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter iter)
>>> +{
>>> +     WARN_ON(!drm_sched_is_stopped(iter.sched));
>>> +}
>>> +
>>> +DEFINE_CLASS(drm_sched_pending_job_iter, struct drm_sched_pending_job_iter,
>>> +          __drm_sched_pending_job_iter_end(_T),
>>> +          __drm_sched_pending_job_iter_begin(__sched),
>>> +          struct drm_gpu_scheduler *__sched);
>>> +static inline void *
>>> +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t *_T)
>>> +{ return _T; }
>>> +#define class_drm_sched_pending_job_iter_is_conditional false
>>> +
>>> +/**
>>> + * drm_sched_for_each_pending_job() - Iterator for each pending job in scheduler
>>> + * @__job: Current pending job being iterated over
>>> + * @__sched: DRM scheduler to iterate over pending jobs
>>> + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
>>> + *
>>> + * Iterator for each pending job in scheduler, filtering on an entity, and
>>> + * enforcing scheduler is fully stopped
>>> + */
>>> +#define drm_sched_for_each_pending_job(__job, __sched, __entity)             \
>>> +     scoped_guard(drm_sched_pending_job_iter, (__sched))                     \
>>> +             list_for_each_entry((__job), &(__sched)->pending_list, list)    \
>>> +                     for_each_if(!(__entity) || (__job)->entity == (__entity))
>>> +
>>>  #endif
>>
>>
>> See my comments in the first patch. The docu doesn't mention at all why
>> this new functionality exists and when and why users would be expected
>> to use it.
>>
>> As far as I remember from XDC, both AMD and Intel overwrite a timed out
>> jobs buffer data in the rings on GPU reset. To do so, the driver needs
>> the timedout job (passed through timedout_job() callback) and then
>> needs all the pending non-broken jobs.
>>
>> AFAICS your patch provides a generic iterator over the entire
>> pending_list. How is a driver then supposed to determine which are the
>> non-broken jobs (just asking, but that needs to be documented)?
>>
>> Could it make sense to use a different iterator which only returns jobs
>> of not belonging to the same context as the timedout-one?
>>
>> Those are important questions that need to be addressed before merging
>> that.
>>
>> And if this works canonically (i.e., for basically everyone), it needs
>> to be documented in drm_sched_resubmit_jobs() that this iterator is now
>> the canonical way of handling timeouts.
>>
>> Moreover, btw, just yesterday I added an entry to the DRM todo list
>> which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
>> would have to be removed, too.
>>
>>
>> @AMD: Would the code Matthew provides work for you? Please give your
>> input. This is very important common infrastructure.
> 
> I don't think drm_sched_resubmit_jobs() can work for us without major
> rework.  For our kernel queues, we have a single queue on which jobs
> for different clients are scheduled.  When we reset the queue, we lose
> all jobs on the queue and have to re-emit the non-guilty ones.  We do
> this at the ring level, i.e., we save the packets directly from the
> ring and then re-emit the packets for the non-guilty contexts to the
> freshly reset ring.  This avoids running run_job() again which would
> issue new fences and race with memory management, etc.
> 
> I think the following would be workable:
> 1. driver job_timedout() callback flags the job as bad. resets the bad
> queue, and calls drm_sched_resubmit_jobs()
> 2. drm_sched_resubmit_jobs() walks the pending list and calls
> run_job() for every job

Calling run_job() multiple times was one of the worst ideas I have ever seen.

The problem here is that you need a transactional approach to the internal driver state which is modified by ->run_job().

So for example if you have:
->run_job(A)
->run_job(B)
->run_job(C)

And after a reset you find that you need to re-submit only job B and A & C are filtered then that means that your driver state needs to get back before running job A.

> 2. driver run_job() callback looks to see if we already ran this job
> and uses the original fence rather than allocating a new one

Nope, the problem is *all* drivers *must* use the original fence. Otherwise you always run into trouble.

We should not promote a driver interface which makes it extremely easy to shoot down the whole system.

> 3. driver run_job() callback checks to see if the job is guilty or
> from the same context and if so, sets an error on the fences and
> submits only the fence packet to the queue so that any follow up jobs
> will properly synchronize if they need to wait on the fence from the
> bad job.
> 4. driver run_job() callback will submit the full packet stream for
> non-guilty contexts
> 
> I guess we could use the iterator and implement that logic in the
> driver directly rather than using drm_sched_resubmit_jobs().

Yeah, exactly that's the way to go.

Christian.

> 
> Alex


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/9] drm/sched: Add pending job list iterator
  2025-12-05  9:19       ` Christian König
@ 2025-12-05 18:54         ` Matthew Brost
  2025-12-08 13:33         ` Philipp Stanner
  1 sibling, 0 replies; 27+ messages in thread
From: Matthew Brost @ 2025-12-05 18:54 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Deucher, Philipp Stanner, intel-xe, dri-devel, Alex Deucher,
	dakr

On Fri, Dec 05, 2025 at 10:19:32AM +0100, Christian König wrote:
> On 12/4/25 17:04, Alex Deucher wrote:
> > On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <pstanner@redhat.com> wrote:
> >>
> >> +Cc Alex, Christian, Danilo
> >>
> >>
> >> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> >>> Stop open coding pending job list in drivers. Add pending job list
> >>> iterator which safely walks DRM scheduler list asserting DRM scheduler
> >>> is stopped.
> >>>
> >>> v2:
> >>>  - Fix checkpatch (CI)
> >>> v3:
> >>>  - Drop locked version (Christian)
> >>> v4:
> >>>  - Reorder patch (Niranjana)
> >>
> >> Same with the changelog.
> >>
> >>>
> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> >>> ---
> >>>  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 50 insertions(+)
> >>>
> >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> >>> index 385bf34e76fe..9d228513d06c 100644
> >>> --- a/include/drm/gpu_scheduler.h
> >>> +++ b/include/drm/gpu_scheduler.h
> >>> @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
> >>>               dma_fence_is_signaled(&s_fence->finished);
> >>>  }
> >>>
> >>> +/**
> >>> + * struct drm_sched_pending_job_iter - DRM scheduler pending job iterator state
> >>> + * @sched: DRM scheduler associated with pending job iterator
> >>> + */
> >>> +struct drm_sched_pending_job_iter {
> >>> +     struct drm_gpu_scheduler *sched;
> >>> +};
> >>> +
> >>> +/* Drivers should never call this directly */
> >>> +static inline struct drm_sched_pending_job_iter
> >>> +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
> >>> +{
> >>> +     struct drm_sched_pending_job_iter iter = {
> >>> +             .sched = sched,
> >>> +     };
> >>> +
> >>> +     WARN_ON(!drm_sched_is_stopped(sched));
> >>> +     return iter;
> >>> +}
> >>> +
> >>> +/* Drivers should never call this directly */
> >>> +static inline void
> >>> +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter iter)
> >>> +{
> >>> +     WARN_ON(!drm_sched_is_stopped(iter.sched));
> >>> +}
> >>> +
> >>> +DEFINE_CLASS(drm_sched_pending_job_iter, struct drm_sched_pending_job_iter,
> >>> +          __drm_sched_pending_job_iter_end(_T),
> >>> +          __drm_sched_pending_job_iter_begin(__sched),
> >>> +          struct drm_gpu_scheduler *__sched);
> >>> +static inline void *
> >>> +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t *_T)
> >>> +{ return _T; }
> >>> +#define class_drm_sched_pending_job_iter_is_conditional false
> >>> +
> >>> +/**
> >>> + * drm_sched_for_each_pending_job() - Iterator for each pending job in scheduler
> >>> + * @__job: Current pending job being iterated over
> >>> + * @__sched: DRM scheduler to iterate over pending jobs
> >>> + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
> >>> + *
> >>> + * Iterator for each pending job in scheduler, filtering on an entity, and
> >>> + * enforcing scheduler is fully stopped
> >>> + */
> >>> +#define drm_sched_for_each_pending_job(__job, __sched, __entity)             \
> >>> +     scoped_guard(drm_sched_pending_job_iter, (__sched))                     \
> >>> +             list_for_each_entry((__job), &(__sched)->pending_list, list)    \
> >>> +                     for_each_if(!(__entity) || (__job)->entity == (__entity))
> >>> +
> >>>  #endif
> >>
> >>
> >> See my comments in the first patch. The docu doesn't mention at all why
> >> this new functionality exists and when and why users would be expected
> >> to use it.
> >>
> >> As far as I remember from XDC, both AMD and Intel overwrite a timed out
> >> jobs buffer data in the rings on GPU reset. To do so, the driver needs
> >> the timedout job (passed through timedout_job() callback) and then
> >> needs all the pending non-broken jobs.
> >>
> >> AFAICS your patch provides a generic iterator over the entire
> >> pending_list. How is a driver then supposed to determine which are the
> >> non-broken jobs (just asking, but that needs to be documented)?
> >>
> >> Could it make sense to use a different iterator which only returns jobs
> >> of not belonging to the same context as the timedout-one?
> >>
> >> Those are important questions that need to be addressed before merging
> >> that.
> >>
> >> And if this works canonically (i.e., for basically everyone), it needs
> >> to be documented in drm_sched_resubmit_jobs() that this iterator is now
> >> the canonical way of handling timeouts.
> >>
> >> Moreover, btw, just yesterday I added an entry to the DRM todo list
> >> which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
> >> would have to be removed, too.
> >>
> >>
> >> @AMD: Would the code Matthew provides work for you? Please give your
> >> input. This is very important common infrastructure.
> > 
> > I don't think drm_sched_resubmit_jobs() can work for us without major
> > rework.  For our kernel queues, we have a single queue on which jobs
> > for different clients are scheduled.  When we reset the queue, we lose
> > all jobs on the queue and have to re-emit the non-guilty ones.  We do
> > this at the ring level, i.e., we save the packets directly from the
> > ring and then re-emit the packets for the non-guilty contexts to the
> > freshly reset ring.  This avoids running run_job() again which would
> > issue new fences and race with memory management, etc.
> > 
> > I think the following would be workable:
> > 1. driver job_timedout() callback flags the job as bad. resets the bad
> > queue, and calls drm_sched_resubmit_jobs()
> > 2. drm_sched_resubmit_jobs() walks the pending list and calls
> > run_job() for every job
> 
> Calling run_job() multiple times was one of the worst ideas I have ever seen.
> 

I'm not sure who this is directed at but maybe dial back the intensity a
bit here. I really doubt this is one of the worst ideas you've ever
seen.

> The problem here is that you need a transactional approach to the internal driver state which is modified by ->run_job().
> 
> So for example if you have:
> ->run_job(A)
> ->run_job(B)
> ->run_job(C)
> 
> And after a reset you find that you need to re-submit only job B and A & C are filtered then that means that your driver state needs to get back before running job A.
> 

This scenario isn’t possible in Xe due to the 1:1 relationship between
the scheduler and the entity. Jobs execute serially on a single ring. So
if B needs to be resubmitted, so does C. I’m not sure how AMDGPU works,
but it seems like a significant problem if this scenario can occur or
the scheduler is being misused, as AFAIK jobs submitted to a scheduler
instance should signal in order.

> > 2. driver run_job() callback looks to see if we already ran this job
> > and uses the original fence rather than allocating a new one
> 
> Nope, the problem is *all* drivers *must* use the original fence. Otherwise you always run into trouble.
> 

Isn’t that what Alex is saying—always use the original fence? I fully
agree with this; Xe does the same. That’s why drm_sched_resubmit_jobs is
confusing, as the way it’s coded assumes run_job can return a different
fence than the original invocation of run_job. This is part of the
reason I didn’t use this function in Xe—it looked scary.

> We should not promote a driver interface which makes it extremely easy to shoot down the whole system.
> 
> > 3. driver run_job() callback checks to see if the job is guilty or
> > from the same context and if so, sets an error on the fences and
> > submits only the fence packet to the queue so that any follow up jobs
> > will properly synchronize if they need to wait on the fence from the
> > bad job.
> > 4. driver run_job() callback will submit the full packet stream for
> > non-guilty contexts
> > 
> > I guess we could use the iterator and implement that logic in the
> > driver directly rather than using drm_sched_resubmit_jobs().
> 
> Yeah, exactly that's the way to go.
> 

I think we’re in agreement that this patch can be rebased, address any
comments here, and then be merged?

Matt

> Christian.
> 
> > 
> > Alex
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/9] drm/sched: Add pending job list iterator
  2025-12-05  9:19       ` Christian König
  2025-12-05 18:54         ` Matthew Brost
@ 2025-12-08 13:33         ` Philipp Stanner
  1 sibling, 0 replies; 27+ messages in thread
From: Philipp Stanner @ 2025-12-08 13:33 UTC (permalink / raw)
  To: Christian König, Alex Deucher
  Cc: Matthew Brost, intel-xe, dri-devel, Alex Deucher, dakr

On Fri, 2025-12-05 at 10:19 +0100, Christian König wrote:
> On 12/4/25 17:04, Alex Deucher wrote:
> > On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <pstanner@redhat.com> wrote:
> > > 
> > > +Cc Alex, Christian, Danilo
> > > 
> > > 
> > > On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> > > > Stop open coding pending job list in drivers. Add pending job list
> > > > iterator which safely walks DRM scheduler list asserting DRM scheduler
> > > > is stopped.
> > > > 
> > > > v2:
> > > >  - Fix checkpatch (CI)
> > > > v3:
> > > >  - Drop locked version (Christian)
> > > > v4:
> > > >  - Reorder patch (Niranjana)
> > > 
> > > Same with the changelog.
> > > 
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > > > ---
> > > >  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 50 insertions(+)
> > > > 
> > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > index 385bf34e76fe..9d228513d06c 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
> > > >               dma_fence_is_signaled(&s_fence->finished);
> > > >  }
> > > > 
> > > > +/**
> > > > + * struct drm_sched_pending_job_iter - DRM scheduler pending job iterator state
> > > > + * @sched: DRM scheduler associated with pending job iterator
> > > > + */
> > > > +struct drm_sched_pending_job_iter {
> > > > +     struct drm_gpu_scheduler *sched;
> > > > +};
> > > > +
> > > > +/* Drivers should never call this directly */
> > > > +static inline struct drm_sched_pending_job_iter
> > > > +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +     struct drm_sched_pending_job_iter iter = {
> > > > +             .sched = sched,
> > > > +     };
> > > > +
> > > > +     WARN_ON(!drm_sched_is_stopped(sched));
> > > > +     return iter;
> > > > +}
> > > > +
> > > > +/* Drivers should never call this directly */
> > > > +static inline void
> > > > +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter iter)
> > > > +{
> > > > +     WARN_ON(!drm_sched_is_stopped(iter.sched));
> > > > +}
> > > > +
> > > > +DEFINE_CLASS(drm_sched_pending_job_iter, struct drm_sched_pending_job_iter,
> > > > +          __drm_sched_pending_job_iter_end(_T),
> > > > +          __drm_sched_pending_job_iter_begin(__sched),
> > > > +          struct drm_gpu_scheduler *__sched);
> > > > +static inline void *
> > > > +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t *_T)
> > > > +{ return _T; }
> > > > +#define class_drm_sched_pending_job_iter_is_conditional false
> > > > +
> > > > +/**
> > > > + * drm_sched_for_each_pending_job() - Iterator for each pending job in scheduler
> > > > + * @__job: Current pending job being iterated over
> > > > + * @__sched: DRM scheduler to iterate over pending jobs
> > > > + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
> > > > + *
> > > > + * Iterator for each pending job in scheduler, filtering on an entity, and
> > > > + * enforcing scheduler is fully stopped
> > > > + */
> > > > +#define drm_sched_for_each_pending_job(__job, __sched, __entity)             \
> > > > +     scoped_guard(drm_sched_pending_job_iter, (__sched))                     \
> > > > +             list_for_each_entry((__job), &(__sched)->pending_list, list)    \
> > > > +                     for_each_if(!(__entity) || (__job)->entity == (__entity))
> > > > +
> > > >  #endif
> > > 
> > > 
> > > See my comments in the first patch. The docu doesn't mention at all why
> > > this new functionality exists and when and why users would be expected
> > > to use it.
> > > 
> > > As far as I remember from XDC, both AMD and Intel overwrite a timed out
> > > jobs buffer data in the rings on GPU reset. To do so, the driver needs
> > > the timedout job (passed through timedout_job() callback) and then
> > > needs all the pending non-broken jobs.
> > > 
> > > AFAICS your patch provides a generic iterator over the entire
> > > pending_list. How is a driver then supposed to determine which are the
> > > non-broken jobs (just asking, but that needs to be documented)?
> > > 
> > > Could it make sense to use a different iterator which only returns jobs
> > > of not belonging to the same context as the timedout-one?
> > > 
> > > Those are important questions that need to be addressed before merging
> > > that.
> > > 
> > > And if this works canonically (i.e., for basically everyone), it needs
> > > to be documented in drm_sched_resubmit_jobs() that this iterator is now
> > > the canonical way of handling timeouts.
> > > 
> > > Moreover, btw, just yesterday I added an entry to the DRM todo list
> > > which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
> > > would have to be removed, too.
> > > 
> > > 
> > > @AMD: Would the code Matthew provides work for you? Please give your
> > > input. This is very important common infrastructure.
> > 
> > I don't think drm_sched_resubmit_jobs() can work for us without major
> > rework.  For our kernel queues, we have a single queue on which jobs
> > for different clients are scheduled.  When we reset the queue, we lose
> > all jobs on the queue and have to re-emit the non-guilty ones.  We do
> > this at the ring level, i.e., we save the packets directly from the
> > ring and then re-emit the packets for the non-guilty contexts to the
> > freshly reset ring.  This avoids running run_job() again which would
> > issue new fences and race with memory management, etc.
> > 
> > I think the following would be workable:
> > 1. driver job_timedout() callback flags the job as bad. resets the bad
> > queue, and calls drm_sched_resubmit_jobs()
> > 2. drm_sched_resubmit_jobs() walks the pending list and calls
> > run_job() for every job
> 
> Calling run_job() multiple times was one of the worst ideas I have ever seen.
> 
> The problem here is that you need a transactional approach to the internal driver state which is modified by ->run_job().
> 
> So for example if you have:
> ->run_job(A)
> ->run_job(B)
> ->run_job(C)
> 
> And after a reset you find that you need to re-submit only job B and A & C are filtered then that means that your driver state needs to get back before running job A.
> 
> > 2. driver run_job() callback looks to see if we already ran this job
> > and uses the original fence rather than allocating a new one
> 
> Nope, the problem is *all* drivers *must* use the original fence. Otherwise you always run into trouble.
> 
> We should not promote a driver interface which makes it extremely easy to shoot down the whole system.
> 
> > 3. driver run_job() callback checks to see if the job is guilty or
> > from the same context and if so, sets an error on the fences and
> > submits only the fence packet to the queue so that any follow up jobs
> > will properly synchronize if they need to wait on the fence from the
> > bad job.
> > 4. driver run_job() callback will submit the full packet stream for
> > non-guilty contexts
> > 
> > I guess we could use the iterator and implement that logic in the
> > driver directly rather than using drm_sched_resubmit_jobs().
> 
> Yeah, exactly that's the way to go.

Sorry, I guess my message was confusing.

I don't mean anyone to use drm_sched_resubmit_jobs() at all. That
function is deprecated, and since there are still users, we can't
easily modify it.
I just mentioned that function because its docstring should inform
about what users should do *instead* of calling that function.
Currently, it's just marked as deprecated without providing users with
an alternative.

Anyways.
So can you take a look at Matthew's iterator and see if that will work
for you?

If we'd end up with an at least mostly-generic solution for resubmits
that's in use in both Xe and amdgpu, that would be a huge leap forward.


P.

> 
> Christian.
> 
> > 
> > Alex
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals
  2025-12-03 20:44     ` Matthew Brost
@ 2025-12-08 13:44       ` Philipp Stanner
  0 siblings, 0 replies; 27+ messages in thread
From: Philipp Stanner @ 2025-12-08 13:44 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, dakr, Alex Deucher, Christian König

On Wed, 2025-12-03 at 12:44 -0800, Matthew Brost wrote:
> On Wed, Dec 03, 2025 at 11:56:01AM +0100, Philipp Stanner wrote:
> > On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> > > Use new pending job list iterator and new helper functions in Xe to
> > > avoid reaching into DRM scheduler internals.
> > > 

[…]

> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gpu_scheduler.c    |  4 +-
> > >  drivers/gpu/drm/xe/xe_gpu_scheduler.h    | 33 ++--------
> > >  drivers/gpu/drm/xe/xe_guc_submit.c       | 81 ++++++------------------
> > >  drivers/gpu/drm/xe/xe_guc_submit_types.h | 11 ----
> > >  drivers/gpu/drm/xe/xe_hw_fence.c         | 16 -----
> > >  drivers/gpu/drm/xe/xe_hw_fence.h         |  2 -
> > >  6 files changed, 27 insertions(+), 120 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > > index f4f23317191f..9c8004d5dd91 100644
> > > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > > @@ -7,7 +7,7 @@
> > >  
> > >  static void xe_sched_process_msg_queue(struct xe_gpu_scheduler *sched)
> > >  {
> > > -	if (!READ_ONCE(sched->base.pause_submit))
> > > +	if (!drm_sched_is_stopped(&sched->base))
> > >  		queue_work(sched->base.submit_wq, &sched->work_process_msg);
> > 
> > Sharing the submit_wq is legal. But next-level cleanness would be if
> > struct drm_gpu_scheduler's internal components wouldn't be touched.
> > That's kind of a luxury request, though.
> > 
> 
> Yes, perhaps a helper to extract the submit_wq too.

Could work; but best would be if driver's store their own pointer.
That's not always possible (Boris recently tried to do it for Panthor),
but often it is.

> 
> > >  }
> > >  
> > > @@ -43,7 +43,7 @@ static void xe_sched_process_msg_work(struct work_struct *w)
> > >  		container_of(w, struct xe_gpu_scheduler, work_process_msg);
> > >  	struct xe_sched_msg *msg;
> > >  
> > > -	if (READ_ONCE(sched->base.pause_submit))
> > > +	if (drm_sched_is_stopped(&sched->base))
> > >  		return;
> > >  
> > >  	msg = xe_sched_get_msg(sched);
> > > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > index dceb2cd0ee5b..664c2db56af3 100644
> > > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > @@ -56,12 +56,9 @@ static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
> > >  	struct drm_sched_job *s_job;
> > >  	bool restore_replay = false;
> > >  
> > > -	list_for_each_entry(s_job, &sched->base.pending_list, list) {
> > > -		struct drm_sched_fence *s_fence = s_job->s_fence;
> > > -		struct dma_fence *hw_fence = s_fence->parent;
> > > -
> > > +	drm_sched_for_each_pending_job(s_job, &sched->base, NULL) {
> > >  		restore_replay |= to_xe_sched_job(s_job)->restore_replay;
> > > -		if (restore_replay || (hw_fence && !dma_fence_is_signaled(hw_fence)))
> > > +		if (restore_replay || !drm_sched_job_is_signaled(s_job))
> > 
> > So that's where this function is needed. You check whether that job in
> > the pending_list is signaled. 
> > 
> 
> Yes, during GT reset flows (think a device level reset) it is possible
> we stop the scheduler between the window of a job signaling but before
> free_job is called. We want avoid resubmission of jobs which have
> signaled.

I'm not so convinced then that the function should be called
drm_sched_job_is_signaled(). A job is also associated with
s_fence.finished.

Why is it that that can race here, btw. – isn't it your driver which
signals the hardware fences? How and where? Interrupts?

> 
> > >  			sched->base.ops->run_job(s_job);
> > 
> > Aaaaaahm. So you invoke your own callback. But basically just to access
> > the function pointer I suppose?
> > 
> > Since this is effectively your drm_sched_resubmit_jobs(), it is
> > definitely desirable to provide a text book example of how to do resets
> > so that others can follow your usage.
> > 
> 
> Yes, but drm_sched_resubmit_jobs() does some nonsense with dma-fence
> that I don’t need here. Honestly, I’m a little unsure what that is
> actually doing.
> 

Resubmit jobs shouldn't be used anymore, it's depercated. What I mean
is that your code here effectively is the resubmission code. So if you
think that it's the right way of doing resets, in harmony with
drm_sched, then it would be good if this code is as tidy as possible
and preferably even commented, so that we can point other driver
programmers to this as an example of idiomatic usage.


>  We also use this function during VF restore after
> migration. This is a multi-step process that needs to operate on the
> same set of jobs at each step of the restore. That’s what the
> restore_replay variable represents—it marks a job at the very beginning
> of the restore process, and each step along the way ensures execution
> starts at that job. Techincally once we here in a VF restore jobs can
> start signaling as the hardware is live. So some of this really is
> vendor-specific.
> 
> > Can't you replace ops->run_job() with a call to your functions where
> > you push the jobs to the ring, directly?
> > 
> 
> Yes, we could, but that function isn’t currently exported. Also, in
> future products, we may assign a different run_job vfunc based on
> hardware generation or queue type. So using a vfunc here makes sense as
> a bit of future-proofing. Of course, we could also have a DRM
> scheduler-level helper that invokes run_job for us.

OK.

But same comment as above applies, having distinct pointers would be the cleanest thing.


P.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-12-08 13:44 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-01 18:39 [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
2025-12-01 18:39 ` [PATCH v7 1/9] drm/sched: Add several job helpers to avoid drivers touching scheduler state Matthew Brost
2025-12-03  8:56   ` Philipp Stanner
2025-12-03 21:10     ` Matthew Brost
2025-12-01 18:39 ` [PATCH v7 2/9] drm/sched: Add pending job list iterator Matthew Brost
2025-12-03  9:07   ` Philipp Stanner
2025-12-03 10:28     ` Philipp Stanner
2025-12-04 16:04     ` Alex Deucher
2025-12-05  9:19       ` Christian König
2025-12-05 18:54         ` Matthew Brost
2025-12-08 13:33         ` Philipp Stanner
2025-12-01 18:39 ` [PATCH v7 3/9] drm/xe: Add dedicated message lock Matthew Brost
2025-12-03  9:38   ` Philipp Stanner
2025-12-01 18:39 ` [PATCH v7 4/9] drm/xe: Stop abusing DRM scheduler internals Matthew Brost
2025-12-03 10:56   ` Philipp Stanner
2025-12-03 20:44     ` Matthew Brost
2025-12-08 13:44       ` Philipp Stanner
2025-12-01 18:39 ` [PATCH v7 5/9] drm/xe: Only toggle scheduling in TDR if GuC is running Matthew Brost
2025-12-01 18:39 ` [PATCH v7 6/9] drm/xe: Do not deregister queues in TDR Matthew Brost
2025-12-01 18:39 ` [PATCH v7 7/9] drm/xe: Remove special casing for LR queues in submission Matthew Brost
2025-12-01 18:39 ` [PATCH v7 8/9] drm/xe: Disable timestamp WA on VFs Matthew Brost
2025-12-02  6:42   ` Umesh Nerlige Ramappa
2025-12-01 18:39 ` [PATCH v7 9/9] drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR Matthew Brost
2025-12-02  7:31   ` Umesh Nerlige Ramappa
2025-12-02 15:14     ` Matthew Brost
2025-12-03  1:23 ` [PATCH v7 0/9] Fix DRM scheduler layering violations in Xe Matthew Brost
2025-12-03  8:33   ` Philipp Stanner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).