[PATCH 0/7] Fix a couple of wedge corner-case memory leaks

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] Fix a couple of wedge corner-case memory leaks
@ 2025-10-20 21:45 Stuart Summers
  2025-10-20 21:45 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

Most of the patches in this series are just adding
some debug hints to help track these down. I split
these up in case we want to pick and choose which ones
to include in the tree. I found them useful.

Also add a little documentation on the various stages
handled within the GuC exec queue submission path.

The main interesting patch is the last one in the
series which fixes some corner cases when the
driver becomes wedged in the middle of either communication
with the DRM scheduler or in the event the GuC becomes
unresponsive. In both of these cases there is a chance
we could leak memory around the exec queue members
like the LRC and the LRC BO. This patch fixes those
scenarios.

v2: Address feedback from Matt:
    - Let the DRM scheduler handle pausing/unpausing
    - Still do the wait after scheduling disable/deregister
      as with the previous patch, but skip the intermediate
      software-based schedule disable using the "banned"
      flag and instead just jump straight to the deregister
      handling which will fully reset the queue state.
      Note that for this case I am seeing a hardware failure
      after submitting to GuC but before receiving the
      response from GuC. So even if we wedge in this case
      (monitoring the hardware state change), the queue
      itself is not wedged because of the active GuC
      submission (CT is not stalled at that point).
v3: Add back in the xe pause checks and instead just kickstart
    message handling in the guc_submi_fini() routine before
    doing the async wait there.
v4: Handle the CT communication loss during wedge asynchronously
    Also combine those last two patches into one to handle
    wedge cleanup generally.
v5: Add a new patch with a little documentation on the GuC
    submission handling stages.
    Move the scheduler kickstart and destruction call on the
    dangling queues into the wedged_fini() callback. These
    only get called now for queues which are in an error
    state - wedge was called, but these weren't fully
    cleaned up as seen by the lack of exec_queue reference
    at the time of wedging.
    Also fix the migration ordering teardown reference mistake
    pointed by Matt in the previous series rev.

Stuart Summers (7):
  drm/xe: Add additional trace points for LRCs
  drm/xe: Add a trace point for VM close
  drm/xe: Add the BO pointer info to the BO trace
  drm/xe: Add new exec queue trace points
  drm/xe: Correct migration VM teardown order
  drm/xe: Clean up GuC software state after a wedge
  drm/xe/doc: Add GuC submission kernel-doc

 Documentation/gpu/xe/index.rst         |  1 +
 Documentation/gpu/xe/xe_guc_submit.rst |  8 +++
 drivers/gpu/drm/xe/xe_exec_queue.c     |  4 ++
 drivers/gpu/drm/xe/xe_guc_submit.c     | 93 ++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_lrc.c            |  4 ++
 drivers/gpu/drm/xe/xe_lrc.h            |  3 +
 drivers/gpu/drm/xe/xe_migrate.c        |  9 +--
 drivers/gpu/drm/xe/xe_trace.h          | 22 +++++-
 drivers/gpu/drm/xe/xe_trace_bo.h       | 12 +++-
 drivers/gpu/drm/xe/xe_trace_lrc.h      | 42 +++++++++++-
 drivers/gpu/drm/xe/xe_vm.c             |  2 +
 11 files changed, 187 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_guc_submit.rst

-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/7] drm/xe: Add additional trace points for LRCs
  2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
@ 2025-10-20 21:45 ` Stuart Summers
  2025-10-20 21:45 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

Add trace points to indicate when an LRC has been
created and destroyed or get and put.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
 drivers/gpu/drm/xe/xe_lrc.c       |  4 +++
 drivers/gpu/drm/xe/xe_lrc.h       |  3 +++
 drivers/gpu/drm/xe/xe_trace_lrc.h | 42 ++++++++++++++++++++++++++++++-
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index b5083c99dd50..42d1c861fe18 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1575,6 +1575,8 @@ struct xe_lrc *xe_lrc_create(struct xe_hw_engine *hwe, struct xe_vm *vm,
 		return ERR_PTR(err);
 	}
 
+	trace_xe_lrc_create(lrc);
+
 	return lrc;
 }
 
@@ -1589,6 +1591,8 @@ void xe_lrc_destroy(struct kref *ref)
 {
 	struct xe_lrc *lrc = container_of(ref, struct xe_lrc, refcount);
 
+	trace_xe_lrc_destroy(lrc);
+
 	xe_lrc_finish(lrc);
 	kfree(lrc);
 }
diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
index 2fb628da5c43..fd67810f9812 100644
--- a/drivers/gpu/drm/xe/xe_lrc.h
+++ b/drivers/gpu/drm/xe/xe_lrc.h
@@ -8,6 +8,7 @@
 #include <linux/types.h>
 
 #include "xe_lrc_types.h"
+#include "xe_trace_lrc.h"
 
 struct drm_printer;
 struct xe_bb;
@@ -61,6 +62,7 @@ void xe_lrc_destroy(struct kref *ref);
 static inline struct xe_lrc *xe_lrc_get(struct xe_lrc *lrc)
 {
 	kref_get(&lrc->refcount);
+	trace_xe_lrc_get(lrc);
 	return lrc;
 }
 
@@ -73,6 +75,7 @@ static inline struct xe_lrc *xe_lrc_get(struct xe_lrc *lrc)
  */
 static inline void xe_lrc_put(struct xe_lrc *lrc)
 {
+	trace_xe_lrc_put(lrc);
 	kref_put(&lrc->refcount, xe_lrc_destroy);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_trace_lrc.h b/drivers/gpu/drm/xe/xe_trace_lrc.h
index d525cbee1e34..e8daa5d323e7 100644
--- a/drivers/gpu/drm/xe/xe_trace_lrc.h
+++ b/drivers/gpu/drm/xe/xe_trace_lrc.h
@@ -13,7 +13,6 @@
 #include <linux/types.h>
 
 #include "xe_gt_types.h"
-#include "xe_lrc.h"
 #include "xe_lrc_types.h"
 
 #define __dev_name_lrc(lrc)	dev_name(gt_to_xe((lrc)->fence_ctx.gt)->drm.dev)
@@ -42,6 +41,47 @@ TRACE_EVENT(xe_lrc_update_timestamp,
 		      __get_str(device_id))
 );
 
+DECLARE_EVENT_CLASS(xe_lrc,
+	    TP_PROTO(struct xe_lrc *lrc),
+	    TP_ARGS(lrc),
+
+	    TP_STRUCT__entry(
+		     __field(struct xe_lrc *, lrc)
+		     __string(name, lrc->fence_ctx.name)
+		     __string(device_id, __dev_name_lrc(lrc))
+	    ),
+
+	    TP_fast_assign(
+		   __entry->lrc	= lrc;
+		   __assign_str(name);
+		   __assign_str(device_id);
+		   ),
+
+	    TP_printk("lrc=:%p lrc->name=%s device_id:%s",
+		      __entry->lrc, __get_str(name),
+		      __get_str(device_id))
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_create,
+	     TP_PROTO(struct xe_lrc *lrc),
+	     TP_ARGS(lrc)
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_destroy,
+	     TP_PROTO(struct xe_lrc *lrc),
+	     TP_ARGS(lrc)
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_get,
+	     TP_PROTO(struct xe_lrc *lrc),
+	     TP_ARGS(lrc)
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_put,
+	     TP_PROTO(struct xe_lrc *lrc),
+	     TP_ARGS(lrc)
+);
+
 #endif
 
 /* This part must be outside protection */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/7] drm/xe: Add a trace point for VM close
  2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
  2025-10-20 21:45 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
@ 2025-10-20 21:45 ` Stuart Summers
  2025-10-20 21:45 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

All better tracking of error cases when calling through
xe_vm_close_and_put().

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
 drivers/gpu/drm/xe/xe_trace_bo.h | 6 ++++++
 drivers/gpu/drm/xe/xe_vm.c       | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_trace_bo.h b/drivers/gpu/drm/xe/xe_trace_bo.h
index 86323cf3be2c..238311cfb816 100644
--- a/drivers/gpu/drm/xe/xe_trace_bo.h
+++ b/drivers/gpu/drm/xe/xe_trace_bo.h
@@ -14,6 +14,7 @@
 
 #include "xe_bo.h"
 #include "xe_bo_types.h"
+#include "xe_exec_queue_types.h"
 #include "xe_vm.h"
 
 #define __dev_name_bo(bo)	dev_name(xe_bo_device(bo)->drm.dev)
@@ -223,6 +224,11 @@ DEFINE_EVENT(xe_vm, xe_vm_free,
 	     TP_ARGS(vm)
 );
 
+DEFINE_EVENT(xe_vm, xe_vm_close,
+	     TP_PROTO(struct xe_vm *vm),
+	     TP_ARGS(vm)
+);
+
 DEFINE_EVENT(xe_vm, xe_vm_cpu_bind,
 	     TP_PROTO(struct xe_vm *vm),
 	     TP_ARGS(vm)
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 179758ca7cb8..64b8a170efcf 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1681,6 +1681,8 @@ static void xe_vm_close(struct xe_vm *vm)
 	bool bound;
 	int idx;
 
+	trace_xe_vm_close(vm);
+
 	bound = drm_dev_enter(&xe->drm, &idx);
 
 	down_write(&vm->lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace
  2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
  2025-10-20 21:45 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
  2025-10-20 21:45 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
@ 2025-10-20 21:45 ` Stuart Summers
  2025-10-20 21:45 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

Just a little extra detail to make BOs easier to track
through the trace event log.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
 drivers/gpu/drm/xe/xe_trace_bo.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_trace_bo.h b/drivers/gpu/drm/xe/xe_trace_bo.h
index 238311cfb816..0d795733e0f0 100644
--- a/drivers/gpu/drm/xe/xe_trace_bo.h
+++ b/drivers/gpu/drm/xe/xe_trace_bo.h
@@ -27,6 +27,7 @@ DECLARE_EVENT_CLASS(xe_bo,
 
 		    TP_STRUCT__entry(
 			     __string(dev, __dev_name_bo(bo))
+			     __field(struct xe_bo *, bo)
 			     __field(size_t, size)
 			     __field(u32, flags)
 			     __field(struct xe_vm *, vm)
@@ -34,13 +35,14 @@ DECLARE_EVENT_CLASS(xe_bo,
 
 		    TP_fast_assign(
 			   __assign_str(dev);
+			   __entry->bo = bo;
 			   __entry->size = xe_bo_size(bo);
 			   __entry->flags = bo->flags;
 			   __entry->vm = bo->vm;
 			   ),
 
-		    TP_printk("dev=%s, size=%zu, flags=0x%02x, vm=%p",
-			      __get_str(dev), __entry->size,
+		    TP_printk("dev=%s, %p, size=%zu, flags=0x%02x, vm=%p",
+			      __get_str(dev), __entry->bo, __entry->size,
 			      __entry->flags, __entry->vm)
 );
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/7] drm/xe: Add new exec queue trace points
  2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
                   ` (2 preceding siblings ...)
  2025-10-20 21:45 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
@ 2025-10-20 21:45 ` Stuart Summers
  2025-10-20 21:45 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

Add an exec queue and guc exec queue trace point
to separate out which part of the stack is executing.

This is helpful because several of the guc specific
paths rely on responses from guc which is interesting
to view separately in the event the guc stops responding
in the middle of an operation that would expect a
response from guc otherwise.

Also add in the exec queue pointer information to
the trace events for easier tracking. Contexts (guc_ids)
can get re-used, so this just makes grepping a little
easier in this type of debug.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
 drivers/gpu/drm/xe/xe_exec_queue.c |  4 ++++
 drivers/gpu/drm/xe/xe_guc_submit.c | 11 +++++++----
 drivers/gpu/drm/xe/xe_trace.h      | 22 ++++++++++++++++++++--
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 90cbc95f8e2e..a2ef381cf6d9 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -377,6 +377,8 @@ void xe_exec_queue_destroy(struct kref *ref)
 	struct xe_exec_queue *q = container_of(ref, struct xe_exec_queue, refcount);
 	struct xe_exec_queue *eq, *next;
 
+	trace_xe_exec_queue_destroy(q);
+
 	if (xe_exec_queue_uses_pxp(q))
 		xe_pxp_exec_queue_remove(gt_to_xe(q->gt)->pxp, q);
 
@@ -953,6 +955,8 @@ void xe_exec_queue_kill(struct xe_exec_queue *q)
 {
 	struct xe_exec_queue *eq = q, *next;
 
+	trace_xe_exec_queue_kill(q);
+
 	list_for_each_entry_safe(eq, next, &eq->multi_gt_list,
 				 multi_gt_link) {
 		q->ops->kill(eq);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index e9aa0625ce60..5ec1e4a83d68 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1003,9 +1003,12 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
 	}
 
 	mutex_lock(&guc->submission_state.lock);
-	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
-		if (xe_exec_queue_get_unless_zero(q))
+	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+		if (xe_exec_queue_get_unless_zero(q)) {
 			set_exec_queue_wedged(q);
+			trace_xe_exec_queue_wedge(q);
+		}
+	}
 	mutex_unlock(&guc->submission_state.lock);
 }
 
@@ -1455,7 +1458,7 @@ static void __guc_exec_queue_destroy_async(struct work_struct *w)
 	struct xe_guc *guc = exec_queue_to_guc(q);
 
 	xe_pm_runtime_get(guc_to_xe(guc));
-	trace_xe_exec_queue_destroy(q);
+	trace_xe_guc_exec_queue_destroy(q);
 
 	if (xe_exec_queue_is_lr(q))
 		cancel_work_sync(&ge->lr_tdr);
@@ -1716,7 +1719,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 
 static void guc_exec_queue_kill(struct xe_exec_queue *q)
 {
-	trace_xe_exec_queue_kill(q);
+	trace_xe_guc_exec_queue_kill(q);
 	set_exec_queue_killed(q);
 	__suspend_fence_signal(q);
 	xe_guc_exec_queue_trigger_cleanup(q);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 314f42fcbcbd..a5dd0c48d894 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -71,6 +71,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
 
 		    TP_STRUCT__entry(
 			     __string(dev, __dev_name_eq(q))
+			     __field(struct xe_exec_queue *, q)
 			     __field(enum xe_engine_class, class)
 			     __field(u32, logical_mask)
 			     __field(u8, gt_id)
@@ -82,6 +83,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
 
 		    TP_fast_assign(
 			   __assign_str(dev);
+			   __entry->q = q;
 			   __entry->class = q->class;
 			   __entry->logical_mask = q->logical_mask;
 			   __entry->gt_id = q->gt->info.id;
@@ -91,8 +93,9 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
 			   __entry->flags = q->flags;
 			   ),
 
-		    TP_printk("dev=%s, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
-			      __get_str(dev), __entry->class, __entry->logical_mask,
+		    TP_printk("dev=%s, %p, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
+			      __get_str(dev), __entry->q,
+			      __entry->class, __entry->logical_mask,
 			      __entry->gt_id, __entry->width, __entry->guc_id,
 			      __entry->guc_state, __entry->flags)
 );
@@ -147,11 +150,21 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_close,
 	     TP_ARGS(q)
 );
 
+DEFINE_EVENT(xe_exec_queue, xe_exec_queue_wedge,
+	     TP_PROTO(struct xe_exec_queue *q),
+	     TP_ARGS(q)
+);
+
 DEFINE_EVENT(xe_exec_queue, xe_exec_queue_kill,
 	     TP_PROTO(struct xe_exec_queue *q),
 	     TP_ARGS(q)
 );
 
+DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_kill,
+	     TP_PROTO(struct xe_exec_queue *q),
+	     TP_ARGS(q)
+);
+
 DEFINE_EVENT(xe_exec_queue, xe_exec_queue_cleanup_entity,
 	     TP_PROTO(struct xe_exec_queue *q),
 	     TP_ARGS(q)
@@ -162,6 +175,11 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_destroy,
 	     TP_ARGS(q)
 );
 
+DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_destroy,
+	     TP_PROTO(struct xe_exec_queue *q),
+	     TP_ARGS(q)
+);
+
 DEFINE_EVENT(xe_exec_queue, xe_exec_queue_reset,
 	     TP_PROTO(struct xe_exec_queue *q),
 	     TP_ARGS(q)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/7] drm/xe: Correct migration VM teardown order
  2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
                   ` (3 preceding siblings ...)
  2025-10-20 21:45 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
@ 2025-10-20 21:45 ` Stuart Summers
  2025-10-22 20:30   ` Matthew Brost
  2025-10-20 21:45 ` [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge Stuart Summers
  2025-10-20 21:45 ` [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc Stuart Summers
  6 siblings, 1 reply; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

Adjust the sequence of the migration teardown to match what
is happening in the init() function.

v2: Take a reference to the migrate queue before put (Matt)

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 7345a5b65169..ae42917918e5 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -91,17 +91,18 @@ struct xe_migrate {
 static void xe_migrate_fini(void *arg)
 {
 	struct xe_migrate *m = arg;
+	struct xe_exec_queue *q = m->q;
 
-	xe_vm_lock(m->q->vm, false);
+	xe_vm_lock(q->vm, false);
 	xe_bo_unpin(m->pt_bo);
-	xe_vm_unlock(m->q->vm);
+	xe_vm_unlock(q->vm);
 
 	dma_fence_put(m->fence);
 	xe_bo_put(m->pt_bo);
 	drm_suballoc_manager_fini(&m->vm_update_sa);
 	mutex_destroy(&m->job_mutex);
-	xe_vm_close_and_put(m->q->vm);
-	xe_exec_queue_put(m->q);
+	xe_exec_queue_put(q);
+	xe_vm_close_and_put(q->vm);
 }
 
 static u64 xe_migrate_vm_addr(u64 slot, u32 level)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge
  2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
                   ` (4 preceding siblings ...)
  2025-10-20 21:45 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
@ 2025-10-20 21:45 ` Stuart Summers
  2025-10-22 21:15   ` Matthew Brost
  2025-10-20 21:45 ` [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc Stuart Summers
  6 siblings, 1 reply; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

When the driver is wedged during a hardware failure, there
is a chance the queue kill coming from those events can
race with either the scheduler teardown or the queue
deregistration with GuC. Basically the following two
scenarios can occur (from event trace):

Scheduler start missing:
  xe_exec_queue_create
  xe_exec_queue_kill
  xe_guc_exec_queue_kill
  xe_exec_queue_destroy

GuC CT response missing:
  xe_exec_queue_create
  xe_exec_queue_register
  xe_exec_queue_scheduling_enable
  xe_exec_queue_scheduling_done
  xe_exec_queue_kill
  xe_guc_exec_queue_kill
  xe_exec_queue_close
  xe_exec_queue_destroy
  xe_exec_queue_cleanup_entity
  xe_exec_queue_scheduling_disable

The above traces depend also on inclusion of [1].

In the first scenario, the queue is created, but killed
prior to completing the message cleanup. In the second,
we go through a full registration before killing. The
CT communication happens in that last call to
xe_exec_queue_scheduling_disable.

We expect to then get a call to xe_guc_exec_queue_destroy
in both cases if the aforementioned scheduler/GuC CT communication
had happened, which we are missing here, hence missing any
LRC/BO cleanup in the exec queues in question.

Since this sequence seems specific to the wedge case
as described above, add a targeted scheduler start
and guc deregistration handler to the wedged_fini()
routine.

Without this change, if we inject wedges in the above scenarios
we can expect the following when the DRM memory tracking is
enabled (see CONFIG_DRM_DEBUG_MM):
[  129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 + 00008000]: inserted at
                drm_mm_insert_node_in_range+0x2ec/0x4b0
                __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
                __xe_bo_create_locked+0x184/0x520 [xe]
                xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
                xe_bo_create_pin_map+0x13/0x20 [xe]
                xe_lrc_create+0x139/0x18e0 [xe]
                xe_exec_queue_create+0x22f/0x3e0 [xe]
                xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
                drm_ioctl_kernel+0x9f/0xf0
                drm_ioctl+0x20f/0x440
                xe_drm_ioctl+0x121/0x150 [xe]
                __x64_sys_ioctl+0x8c/0xe0
                do_syscall_64+0x4c/0x1d0
                entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 + 00008000]: inserted at
                drm_mm_insert_node_in_range+0x2ec/0x4b0
                __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
                __xe_bo_create_locked+0x184/0x520 [xe]
                xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
                xe_bo_create_pin_map+0x13/0x20 [xe]
                xe_lrc_create+0x139/0x18e0 [xe]
                xe_exec_queue_create+0x22f/0x3e0 [xe]
                xe_exec_queue_create_bind+0x7f/0xd0 [xe]
                xe_vm_create+0x4aa/0x8b0 [xe]
                xe_vm_create_ioctl+0x17b/0x420 [xe]
                drm_ioctl_kernel+0x9f/0xf0
                drm_ioctl+0x20f/0x440
                xe_drm_ioctl+0x121/0x150 [xe]
                __x64_sys_ioctl+0x8c/0xe0
                do_syscall_64+0x4c/0x1d0
                entry_SYSCALL_64_after_hwframe+0x76/0x7e

Signed-off-by: Stuart Summers <stuart.summers@intel.com>

[1] https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 5ec1e4a83d68..a11ae4e70809 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -287,6 +287,8 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
 	xa_destroy(&guc->submission_state.exec_queue_lookup);
 }
 
+static void __guc_exec_queue_destroy(struct xe_guc *guc, struct xe_exec_queue *q);
+
 static void guc_submit_wedged_fini(void *arg)
 {
 	struct xe_guc *guc = arg;
@@ -299,6 +301,16 @@ static void guc_submit_wedged_fini(void *arg)
 			mutex_unlock(&guc->submission_state.lock);
 			xe_exec_queue_put(q);
 			mutex_lock(&guc->submission_state.lock);
+		} else {
+			/*
+			 * Make sure queues which were killed as part of a
+			 * wedge are cleaned up properly. Clean up any
+			 * dangling scheduler tasks and pending exec queue
+			 * deregistration.
+			 */
+			xe_sched_submission_start(&q->guc->sched);
+			if (exec_queue_pending_disable(q))
+				__guc_exec_queue_destroy(guc, q);
 		}
 	}
 	mutex_unlock(&guc->submission_state.lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc
  2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
                   ` (5 preceding siblings ...)
  2025-10-20 21:45 ` [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge Stuart Summers
@ 2025-10-20 21:45 ` Stuart Summers
  2025-10-20 22:05   ` Matthew Brost
  6 siblings, 1 reply; 15+ messages in thread
From: Stuart Summers @ 2025-10-20 21:45 UTC (permalink / raw)
  Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
	shuicheng.lin, Stuart Summers

Add some documentation for the different roles played by the XeKMD
in communication with the GuC through exec queue management and
the DRM scheduler.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
 Documentation/gpu/xe/index.rst         |  1 +
 Documentation/gpu/xe/xe_guc_submit.rst |  8 +++
 drivers/gpu/drm/xe/xe_guc_submit.c     | 70 ++++++++++++++++++++++++++
 3 files changed, 79 insertions(+)
 create mode 100644 Documentation/gpu/xe/xe_guc_submit.rst

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index bc432c95d1a3..030268ddba87 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -15,6 +15,7 @@ DG2, etc is provided to prototype the driver.
    xe_map
    xe_migrate
    xe_exec_queue
+   xe_guc_submit
    xe_cs
    xe_pm
    xe_gt_freq
diff --git a/Documentation/gpu/xe/xe_guc_submit.rst b/Documentation/gpu/xe/xe_guc_submit.rst
new file mode 100644
index 000000000000..a82b5d57ee6a
--- /dev/null
+++ b/Documentation/gpu/xe/xe_guc_submit.rst
@@ -0,0 +1,8 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+==============
+GuC Submission
+==============
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_guc_submit.c
+   :doc: GuC Submission
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index a11ae4e70809..fd05f73cd96b 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -46,6 +46,76 @@
 #include "xe_trace.h"
 #include "xe_vm.h"
 
+/**
+ * DOC: GuC Submission
+ *
+ * The primary submission vehicle for the XeKMD is the GuC firmware.
+ * This includes creating, registering, and submitting exec queues
+ * (see :ref:`execution-queue`). Once submitted, actual hardware scheduling
+ * is handled fully by the GuC.
+ *
+ * To facilitate these operations, XeKMD makes use of the DRM scheduler
+ * to handle state changes and determine when jobs should be scheduled
+ * with the GuC.
+ *
+ * Queue Registration
+ * ==================
+ *
+ * Registration and deregistration of exec queues with the GuC involves
+ * creation of the ring (LRC) for that context. These are asynchronous
+ * actions by the XeKMD which completes that registration/deregistration
+ * process once the GuC responds that the hardware registration has
+ * completed.
+ *
+ * Queue Submission
+ * ================
+ *
+ * The GuC has a parameter for each context indicating whether that
+ * context can be scheduled with the associated engine. After registration
+ * has completed, the KMD, typically as a reaction to a request from
+ * the user exec IOCTL call, will issue a schedule request with the
+ * GuC. The GuC will then add that to its schedule for contexts
+ * and will in turn schedule this with the command streamer based
+ * on time slice availability.
+ *
+ * Queue Destruction
+ * =================
+ *
+ * As mentioned above, before destroying a queue, it must first be
+ * deregistered. In the typical case, this happens first by sending
+ * a schedule disable request, then on confirmation from the GuC,
+ * a deregistration request. Only after that deregistration response
+ * will the XeKMD free resources associated with that queue such as
+ * the LRC.
+ *
+ * Engine Resets
+ * =============
+ *
+ * Any time more than one context is being scheduled by the GuC, the GuC
+ * will give each context a certain configurable time slice. If a context
+ * exceeds this time slice, the GuC will reset the engine and notify the XeKMD
+ * along with an engine state at the time of the reset. The XeKMD will then
+ * resubmit any outstanding jobs to that engine. If a particular job has been
+ * resubmitted in this way more than once, the associated exec queue will be
+ * marked banned and will no longer be schedulable on that engine.
+ *
+ * Queue Wedging
+ * =============
+ *
+ * On device wedging (see :ref:`device-wedging`), each exec queue is also marked
+ * as wedged. This includes taking a reference to that device and adding a
+ * dereference on driver teardown. This additional reference allows a user
+ * to collect certain debug information about the exec queue prior to driver
+ * unbind.
+ *
+ * There are some instances where the device wedge might not take a reference
+ * in this way, for instance, if the context is in the process of being
+ * deregistered with the GuC at the time of the wedge. In this case, a
+ * reference to that exec queue should already be available and any
+ * associated data structures will be cleaned up later in the teardown
+ * sequence.
+ */
+
 static struct xe_guc *
 exec_queue_to_guc(struct xe_exec_queue *q)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc
  2025-10-20 21:45 ` [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc Stuart Summers
@ 2025-10-20 22:05   ` Matthew Brost
  2025-10-20 22:07     ` Summers, Stuart
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Brost @ 2025-10-20 22:05 UTC (permalink / raw)
  To: Stuart Summers
  Cc: intel-xe, niranjana.vishwanathapura, zhanjun.dong, shuicheng.lin

On Mon, Oct 20, 2025 at 09:45:29PM +0000, Stuart Summers wrote:
> Add some documentation for the different roles played by the XeKMD
> in communication with the GuC through exec queue management and
> the DRM scheduler.
> 

I've written fairly detailed GuC submission documentation but it stuck
in review purgatory. Here is the most recent post [1].

Matt

[1] https://patchwork.freedesktop.org/patch/677980/?series=154627&rev=4 

> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
>  Documentation/gpu/xe/index.rst         |  1 +
>  Documentation/gpu/xe/xe_guc_submit.rst |  8 +++
>  drivers/gpu/drm/xe/xe_guc_submit.c     | 70 ++++++++++++++++++++++++++
>  3 files changed, 79 insertions(+)
>  create mode 100644 Documentation/gpu/xe/xe_guc_submit.rst
> 
> diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
> index bc432c95d1a3..030268ddba87 100644
> --- a/Documentation/gpu/xe/index.rst
> +++ b/Documentation/gpu/xe/index.rst
> @@ -15,6 +15,7 @@ DG2, etc is provided to prototype the driver.
>     xe_map
>     xe_migrate
>     xe_exec_queue
> +   xe_guc_submit
>     xe_cs
>     xe_pm
>     xe_gt_freq
> diff --git a/Documentation/gpu/xe/xe_guc_submit.rst b/Documentation/gpu/xe/xe_guc_submit.rst
> new file mode 100644
> index 000000000000..a82b5d57ee6a
> --- /dev/null
> +++ b/Documentation/gpu/xe/xe_guc_submit.rst
> @@ -0,0 +1,8 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +
> +==============
> +GuC Submission
> +==============
> +
> +.. kernel-doc:: drivers/gpu/drm/xe/xe_guc_submit.c
> +   :doc: GuC Submission
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index a11ae4e70809..fd05f73cd96b 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -46,6 +46,76 @@
>  #include "xe_trace.h"
>  #include "xe_vm.h"
>  
> +/**
> + * DOC: GuC Submission
> + *
> + * The primary submission vehicle for the XeKMD is the GuC firmware.
> + * This includes creating, registering, and submitting exec queues
> + * (see :ref:`execution-queue`). Once submitted, actual hardware scheduling
> + * is handled fully by the GuC.
> + *
> + * To facilitate these operations, XeKMD makes use of the DRM scheduler
> + * to handle state changes and determine when jobs should be scheduled
> + * with the GuC.
> + *
> + * Queue Registration
> + * ==================
> + *
> + * Registration and deregistration of exec queues with the GuC involves
> + * creation of the ring (LRC) for that context. These are asynchronous
> + * actions by the XeKMD which completes that registration/deregistration
> + * process once the GuC responds that the hardware registration has
> + * completed.
> + *
> + * Queue Submission
> + * ================
> + *
> + * The GuC has a parameter for each context indicating whether that
> + * context can be scheduled with the associated engine. After registration
> + * has completed, the KMD, typically as a reaction to a request from
> + * the user exec IOCTL call, will issue a schedule request with the
> + * GuC. The GuC will then add that to its schedule for contexts
> + * and will in turn schedule this with the command streamer based
> + * on time slice availability.
> + *
> + * Queue Destruction
> + * =================
> + *
> + * As mentioned above, before destroying a queue, it must first be
> + * deregistered. In the typical case, this happens first by sending
> + * a schedule disable request, then on confirmation from the GuC,
> + * a deregistration request. Only after that deregistration response
> + * will the XeKMD free resources associated with that queue such as
> + * the LRC.
> + *
> + * Engine Resets
> + * =============
> + *
> + * Any time more than one context is being scheduled by the GuC, the GuC
> + * will give each context a certain configurable time slice. If a context
> + * exceeds this time slice, the GuC will reset the engine and notify the XeKMD
> + * along with an engine state at the time of the reset. The XeKMD will then
> + * resubmit any outstanding jobs to that engine. If a particular job has been
> + * resubmitted in this way more than once, the associated exec queue will be
> + * marked banned and will no longer be schedulable on that engine.
> + *
> + * Queue Wedging
> + * =============
> + *
> + * On device wedging (see :ref:`device-wedging`), each exec queue is also marked
> + * as wedged. This includes taking a reference to that device and adding a
> + * dereference on driver teardown. This additional reference allows a user
> + * to collect certain debug information about the exec queue prior to driver
> + * unbind.
> + *
> + * There are some instances where the device wedge might not take a reference
> + * in this way, for instance, if the context is in the process of being
> + * deregistered with the GuC at the time of the wedge. In this case, a
> + * reference to that exec queue should already be available and any
> + * associated data structures will be cleaned up later in the teardown
> + * sequence.
> + */
> +
>  static struct xe_guc *
>  exec_queue_to_guc(struct xe_exec_queue *q)
>  {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc
  2025-10-20 22:05   ` Matthew Brost
@ 2025-10-20 22:07     ` Summers, Stuart
  0 siblings, 0 replies; 15+ messages in thread
From: Summers, Stuart @ 2025-10-20 22:07 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Dong,  Zhanjun, Lin, Shuicheng,
	Vishwanathapura, Niranjana

On Mon, 2025-10-20 at 15:05 -0700, Matthew Brost wrote:
> On Mon, Oct 20, 2025 at 09:45:29PM +0000, Stuart Summers wrote:
> > Add some documentation for the different roles played by the XeKMD
> > in communication with the GuC through exec queue management and
> > the DRM scheduler.
> > 
> 
> I've written fairly detailed GuC submission documentation but it
> stuck
> in review purgatory. Here is the most recent post [1].

Oh excellent I hadn't realized this was part of that series. Happy to
drop mine... I'll also take a look at your patch.

Thanks,
Stuart

> 
> Matt
> 
> [1]
> https://patchwork.freedesktop.org/patch/677980/?series=154627&rev=4 
> 
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > ---
> >  Documentation/gpu/xe/index.rst         |  1 +
> >  Documentation/gpu/xe/xe_guc_submit.rst |  8 +++
> >  drivers/gpu/drm/xe/xe_guc_submit.c     | 70
> > ++++++++++++++++++++++++++
> >  3 files changed, 79 insertions(+)
> >  create mode 100644 Documentation/gpu/xe/xe_guc_submit.rst
> > 
> > diff --git a/Documentation/gpu/xe/index.rst
> > b/Documentation/gpu/xe/index.rst
> > index bc432c95d1a3..030268ddba87 100644
> > --- a/Documentation/gpu/xe/index.rst
> > +++ b/Documentation/gpu/xe/index.rst
> > @@ -15,6 +15,7 @@ DG2, etc is provided to prototype the driver.
> >     xe_map
> >     xe_migrate
> >     xe_exec_queue
> > +   xe_guc_submit
> >     xe_cs
> >     xe_pm
> >     xe_gt_freq
> > diff --git a/Documentation/gpu/xe/xe_guc_submit.rst
> > b/Documentation/gpu/xe/xe_guc_submit.rst
> > new file mode 100644
> > index 000000000000..a82b5d57ee6a
> > --- /dev/null
> > +++ b/Documentation/gpu/xe/xe_guc_submit.rst
> > @@ -0,0 +1,8 @@
> > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> > +
> > +==============
> > +GuC Submission
> > +==============
> > +
> > +.. kernel-doc:: drivers/gpu/drm/xe/xe_guc_submit.c
> > +   :doc: GuC Submission
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index a11ae4e70809..fd05f73cd96b 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -46,6 +46,76 @@
> >  #include "xe_trace.h"
> >  #include "xe_vm.h"
> >  
> > +/**
> > + * DOC: GuC Submission
> > + *
> > + * The primary submission vehicle for the XeKMD is the GuC
> > firmware.
> > + * This includes creating, registering, and submitting exec queues
> > + * (see :ref:`execution-queue`). Once submitted, actual hardware
> > scheduling
> > + * is handled fully by the GuC.
> > + *
> > + * To facilitate these operations, XeKMD makes use of the DRM
> > scheduler
> > + * to handle state changes and determine when jobs should be
> > scheduled
> > + * with the GuC.
> > + *
> > + * Queue Registration
> > + * ==================
> > + *
> > + * Registration and deregistration of exec queues with the GuC
> > involves
> > + * creation of the ring (LRC) for that context. These are
> > asynchronous
> > + * actions by the XeKMD which completes that
> > registration/deregistration
> > + * process once the GuC responds that the hardware registration
> > has
> > + * completed.
> > + *
> > + * Queue Submission
> > + * ================
> > + *
> > + * The GuC has a parameter for each context indicating whether
> > that
> > + * context can be scheduled with the associated engine. After
> > registration
> > + * has completed, the KMD, typically as a reaction to a request
> > from
> > + * the user exec IOCTL call, will issue a schedule request with
> > the
> > + * GuC. The GuC will then add that to its schedule for contexts
> > + * and will in turn schedule this with the command streamer based
> > + * on time slice availability.
> > + *
> > + * Queue Destruction
> > + * =================
> > + *
> > + * As mentioned above, before destroying a queue, it must first be
> > + * deregistered. In the typical case, this happens first by
> > sending
> > + * a schedule disable request, then on confirmation from the GuC,
> > + * a deregistration request. Only after that deregistration
> > response
> > + * will the XeKMD free resources associated with that queue such
> > as
> > + * the LRC.
> > + *
> > + * Engine Resets
> > + * =============
> > + *
> > + * Any time more than one context is being scheduled by the GuC,
> > the GuC
> > + * will give each context a certain configurable time slice. If a
> > context
> > + * exceeds this time slice, the GuC will reset the engine and
> > notify the XeKMD
> > + * along with an engine state at the time of the reset. The XeKMD
> > will then
> > + * resubmit any outstanding jobs to that engine. If a particular
> > job has been
> > + * resubmitted in this way more than once, the associated exec
> > queue will be
> > + * marked banned and will no longer be schedulable on that engine.
> > + *
> > + * Queue Wedging
> > + * =============
> > + *
> > + * On device wedging (see :ref:`device-wedging`), each exec queue
> > is also marked
> > + * as wedged. This includes taking a reference to that device and
> > adding a
> > + * dereference on driver teardown. This additional reference
> > allows a user
> > + * to collect certain debug information about the exec queue prior
> > to driver
> > + * unbind.
> > + *
> > + * There are some instances where the device wedge might not take
> > a reference
> > + * in this way, for instance, if the context is in the process of
> > being
> > + * deregistered with the GuC at the time of the wedge. In this
> > case, a
> > + * reference to that exec queue should already be available and
> > any
> > + * associated data structures will be cleaned up later in the
> > teardown
> > + * sequence.
> > + */
> > +
> >  static struct xe_guc *
> >  exec_queue_to_guc(struct xe_exec_queue *q)
> >  {
> > -- 
> > 2.34.1
> > 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/7] drm/xe: Correct migration VM teardown order
  2025-10-20 21:45 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
@ 2025-10-22 20:30   ` Matthew Brost
  2025-10-23 17:18     ` Summers, Stuart
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Brost @ 2025-10-22 20:30 UTC (permalink / raw)
  To: Stuart Summers
  Cc: intel-xe, niranjana.vishwanathapura, zhanjun.dong, shuicheng.lin

On Mon, Oct 20, 2025 at 09:45:27PM +0000, Stuart Summers wrote:
> Adjust the sequence of the migration teardown to match what
> is happening in the init() function.
> 
> v2: Take a reference to the migrate queue before put (Matt)
> 

You need to store the VM on the stack, not the queue;

> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_migrate.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index 7345a5b65169..ae42917918e5 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -91,17 +91,18 @@ struct xe_migrate {
>  static void xe_migrate_fini(void *arg)
>  {
>  	struct xe_migrate *m = arg;
> +	struct xe_exec_queue *q = m->q;
>  
> -	xe_vm_lock(m->q->vm, false);
> +	xe_vm_lock(q->vm, false);
>  	xe_bo_unpin(m->pt_bo);
> -	xe_vm_unlock(m->q->vm);
> +	xe_vm_unlock(q->vm);
>  
>  	dma_fence_put(m->fence);
>  	xe_bo_put(m->pt_bo);
>  	drm_suballoc_manager_fini(&m->vm_update_sa);
>  	mutex_destroy(&m->job_mutex);
> -	xe_vm_close_and_put(m->q->vm);
> -	xe_exec_queue_put(m->q);
> +	xe_exec_queue_put(q);

^^^ This can destroy the 'q'.

> +	xe_vm_close_and_put(q->vm);

^^^ Now 'q->' can dereference invalid memory.

Matt

>  }
>  
>  static u64 xe_migrate_vm_addr(u64 slot, u32 level)
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge
  2025-10-20 21:45 ` [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge Stuart Summers
@ 2025-10-22 21:15   ` Matthew Brost
  2025-10-23 17:43     ` Summers, Stuart
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Brost @ 2025-10-22 21:15 UTC (permalink / raw)
  To: Stuart Summers
  Cc: intel-xe, niranjana.vishwanathapura, zhanjun.dong, shuicheng.lin

On Mon, Oct 20, 2025 at 09:45:28PM +0000, Stuart Summers wrote:
> When the driver is wedged during a hardware failure, there
> is a chance the queue kill coming from those events can
> race with either the scheduler teardown or the queue
> deregistration with GuC. Basically the following two
> scenarios can occur (from event trace):
> 
> Scheduler start missing:
>   xe_exec_queue_create

The queues should be initialized in a started state unless a GT reset or
VF migration is in progress. In both cases, upon successful completion,
all queues will be restarted.

I did spot a bug in GT resets — if those fail, we don’t properly restart
the queues. That should be fixed.

Also, I think xe_guc_declare_wedged is incorrect now that I’m looking at
it.

It probably should be:

void xe_guc_declare_wedged(struct xe_guc *guc)
{
        xe_gt_assert(guc_to_gt(guc), guc_to_xe(guc)->wedged.mode);

        xe_guc_ct_stop(&guc->ct);
        xe_guc_submit_wedge(guc);
	xe_guc_sanitize(guc);
}

>   xe_exec_queue_kill
>   xe_guc_exec_queue_kill
>   xe_exec_queue_destroy
> 
> GuC CT response missing:
>   xe_exec_queue_create
>   xe_exec_queue_register
>   xe_exec_queue_scheduling_enable
>   xe_exec_queue_scheduling_done
>   xe_exec_queue_kill
>   xe_guc_exec_queue_kill
>   xe_exec_queue_close
>   xe_exec_queue_destroy
>   xe_exec_queue_cleanup_entity
>   xe_exec_queue_scheduling_disable

The ref count should be zero here — xe_exec_queue_scheduling_disable is
called after this series [1]. I think we need to get this series in
before making changes to the GuC submission state machine. Technically,
all we need are the last three patches from that series, as they
simplify some things. I believe an upcoming Xe3 feature would also
benefit from getting these patches in too.

So that means in xe_guc_submit_wedge() the below if statement is going
to fail.

1006         mutex_lock(&guc->submission_state.lock);
1007         xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
1008                 if (xe_exec_queue_get_unless_zero(q))
1009                         set_exec_queue_wedged(q);
1010         mutex_unlock(&guc->submission_state.lock);

I think we need...

else if (exec_queue_register(q)) 
	__guc_exec_queue_destroy(guc, q);

We also need to cleanup suspend fences too as those could be lost under
the right race condition.

So prior to existing if statement, we also need:

if (q->guc->suspend_pending)
	suspend_fence_signal(q);

[1] https://patchwork.freedesktop.org/series/155315/

> 
> The above traces depend also on inclusion of [1].
> 
> In the first scenario, the queue is created, but killed
> prior to completing the message cleanup. In the second,
> we go through a full registration before killing. The
> CT communication happens in that last call to
> xe_exec_queue_scheduling_disable.
> 
> We expect to then get a call to xe_guc_exec_queue_destroy
> in both cases if the aforementioned scheduler/GuC CT communication
> had happened, which we are missing here, hence missing any
> LRC/BO cleanup in the exec queues in question.
> 
> Since this sequence seems specific to the wedge case
> as described above, add a targeted scheduler start
> and guc deregistration handler to the wedged_fini()
> routine.
> 
> Without this change, if we inject wedges in the above scenarios
> we can expect the following when the DRM memory tracking is
> enabled (see CONFIG_DRM_DEBUG_MM):
> [  129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 + 00008000]: inserted at
>                 drm_mm_insert_node_in_range+0x2ec/0x4b0
>                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
>                 __xe_bo_create_locked+0x184/0x520 [xe]
>                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
>                 xe_bo_create_pin_map+0x13/0x20 [xe]
>                 xe_lrc_create+0x139/0x18e0 [xe]
>                 xe_exec_queue_create+0x22f/0x3e0 [xe]
>                 xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
>                 drm_ioctl_kernel+0x9f/0xf0
>                 drm_ioctl+0x20f/0x440
>                 xe_drm_ioctl+0x121/0x150 [xe]
>                 __x64_sys_ioctl+0x8c/0xe0
>                 do_syscall_64+0x4c/0x1d0
>                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 + 00008000]: inserted at
>                 drm_mm_insert_node_in_range+0x2ec/0x4b0
>                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
>                 __xe_bo_create_locked+0x184/0x520 [xe]
>                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
>                 xe_bo_create_pin_map+0x13/0x20 [xe]
>                 xe_lrc_create+0x139/0x18e0 [xe]
>                 xe_exec_queue_create+0x22f/0x3e0 [xe]
>                 xe_exec_queue_create_bind+0x7f/0xd0 [xe]
>                 xe_vm_create+0x4aa/0x8b0 [xe]
>                 xe_vm_create_ioctl+0x17b/0x420 [xe]
>                 drm_ioctl_kernel+0x9f/0xf0
>                 drm_ioctl+0x20f/0x440
>                 xe_drm_ioctl+0x121/0x150 [xe]
>                 __x64_sys_ioctl+0x8c/0xe0
>                 do_syscall_64+0x4c/0x1d0
>                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> 
> [1] https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 5ec1e4a83d68..a11ae4e70809 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -287,6 +287,8 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
>  	xa_destroy(&guc->submission_state.exec_queue_lookup);
>  }
>  
> +static void __guc_exec_queue_destroy(struct xe_guc *guc, struct xe_exec_queue *q);
> +
>  static void guc_submit_wedged_fini(void *arg)
>  {
>  	struct xe_guc *guc = arg;
> @@ -299,6 +301,16 @@ static void guc_submit_wedged_fini(void *arg)
>  			mutex_unlock(&guc->submission_state.lock);
>  			xe_exec_queue_put(q);
>  			mutex_lock(&guc->submission_state.lock);

With everything above I don't think this new code below is needed.

But to make sure we know what we are doing, how about this from [2]
before the xe_exec_queue_put.

xe_gt_assert(..., !drm_sched_is_stopped(sched));

Wanna try out these suggestions? It is always possible I made a mistake
here.

Matt

[2] https://patchwork.freedesktop.org/patch/681606/?series=155315&rev=3

> +		} else {
> +			/*
> +			 * Make sure queues which were killed as part of a
> +			 * wedge are cleaned up properly. Clean up any
> +			 * dangling scheduler tasks and pending exec queue
> +			 * deregistration.
> +			 */
> +			xe_sched_submission_start(&q->guc->sched);
> +			if (exec_queue_pending_disable(q))
> +				__guc_exec_queue_destroy(guc, q);
>  		}
>  	}
>  	mutex_unlock(&guc->submission_state.lock);
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/7] drm/xe: Correct migration VM teardown order
  2025-10-22 20:30   ` Matthew Brost
@ 2025-10-23 17:18     ` Summers, Stuart
  0 siblings, 0 replies; 15+ messages in thread
From: Summers, Stuart @ 2025-10-23 17:18 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Dong,  Zhanjun, Lin, Shuicheng,
	Vishwanathapura, Niranjana

On Wed, 2025-10-22 at 13:30 -0700, Matthew Brost wrote:
> On Mon, Oct 20, 2025 at 09:45:27PM +0000, Stuart Summers wrote:
> > Adjust the sequence of the migration teardown to match what
> > is happening in the init() function.
> > 
> > v2: Take a reference to the migrate queue before put (Matt)
> > 
> 
> You need to store the VM on the stack, not the queue;
> 
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_migrate.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > b/drivers/gpu/drm/xe/xe_migrate.c
> > index 7345a5b65169..ae42917918e5 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -91,17 +91,18 @@ struct xe_migrate {
> >  static void xe_migrate_fini(void *arg)
> >  {
> >         struct xe_migrate *m = arg;
> > +       struct xe_exec_queue *q = m->q;
> >  
> > -       xe_vm_lock(m->q->vm, false);
> > +       xe_vm_lock(q->vm, false);
> >         xe_bo_unpin(m->pt_bo);
> > -       xe_vm_unlock(m->q->vm);
> > +       xe_vm_unlock(q->vm);
> >  
> >         dma_fence_put(m->fence);
> >         xe_bo_put(m->pt_bo);
> >         drm_suballoc_manager_fini(&m->vm_update_sa);
> >         mutex_destroy(&m->job_mutex);
> > -       xe_vm_close_and_put(m->q->vm);
> > -       xe_exec_queue_put(m->q);
> > +       xe_exec_queue_put(q);
> 
> ^^^ This can destroy the 'q'.
> 
> > +       xe_vm_close_and_put(q->vm);
> 
> ^^^ Now 'q->' can dereference invalid memory.

Bah you're right, sorry about that :(. Thanks for the feedback!

-Stuart

> 
> Matt
> 
> >  }
> >  
> >  static u64 xe_migrate_vm_addr(u64 slot, u32 level)
> > -- 
> > 2.34.1
> > 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge
  2025-10-22 21:15   ` Matthew Brost
@ 2025-10-23 17:43     ` Summers, Stuart
  2025-10-23 18:26       ` Matthew Brost
  0 siblings, 1 reply; 15+ messages in thread
From: Summers, Stuart @ 2025-10-23 17:43 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Dong,  Zhanjun, Lin, Shuicheng,
	Vishwanathapura, Niranjana

On Wed, 2025-10-22 at 14:15 -0700, Matthew Brost wrote:
> On Mon, Oct 20, 2025 at 09:45:28PM +0000, Stuart Summers wrote:
> > When the driver is wedged during a hardware failure, there
> > is a chance the queue kill coming from those events can
> > race with either the scheduler teardown or the queue
> > deregistration with GuC. Basically the following two
> > scenarios can occur (from event trace):
> > 
> > Scheduler start missing:
> >   xe_exec_queue_create
> 
> The queues should be initialized in a started state unless a GT reset
> or
> VF migration is in progress. In both cases, upon successful
> completion,
> all queues will be restarted.
> 
> I did spot a bug in GT resets — if those fail, we don’t properly
> restart
> the queues. That should be fixed.
> 
> Also, I think xe_guc_declare_wedged is incorrect now that I’m looking
> at
> it.
> 
> It probably should be:
> 
> void xe_guc_declare_wedged(struct xe_guc *guc)
> {
>         xe_gt_assert(guc_to_gt(guc), guc_to_xe(guc)->wedged.mode);
> 
>         xe_guc_ct_stop(&guc->ct);
>         xe_guc_submit_wedge(guc);
>         xe_guc_sanitize(guc);
> }

It's a good point, but to be clear, I'm not doing a GT reset here.
Actually in the case I'm testing, there is an explicit PCIe FLR and
then I'm explicitly wedging after and making sure the unbind completes
error-free. Not something I'm necessarily planning on driving into the
tree here, but doing this for some internal testing.

But yeah I'll give this a try with a GT reset in the mix to make sure
that cleans up the way you're suggesting. And thanks for that wedge
revision, I'll give that a try too.

> 
> >   xe_exec_queue_kill
> >   xe_guc_exec_queue_kill
> >   xe_exec_queue_destroy
> > 
> > GuC CT response missing:
> >   xe_exec_queue_create
> >   xe_exec_queue_register
> >   xe_exec_queue_scheduling_enable
> >   xe_exec_queue_scheduling_done
> >   xe_exec_queue_kill
> >   xe_guc_exec_queue_kill
> >   xe_exec_queue_close
> >   xe_exec_queue_destroy
> >   xe_exec_queue_cleanup_entity
> >   xe_exec_queue_scheduling_disable
> 
> The ref count should be zero here — xe_exec_queue_scheduling_disable

I did confirm that it is (at least per the get_unless_zero() call
below).

> is
> called after this series [1]. I think we need to get this series in

Let me pull in that series. I'm still going through it on my side...
thanks for the link!

> before making changes to the GuC submission state machine.
> Technically,
> all we need are the last three patches from that series, as they
> simplify some things. I believe an upcoming Xe3 feature would also
> benefit from getting these patches in too.
> 
> So that means in xe_guc_submit_wedge() the below if statement is
> going
> to fail.
> 
> 1006         mutex_lock(&guc->submission_state.lock);
> 1007         xa_for_each(&guc->submission_state.exec_queue_lookup,
> index, q)
> 1008                 if (xe_exec_queue_get_unless_zero(q))
> 1009                         set_exec_queue_wedged(q);
> 1010         mutex_unlock(&guc->submission_state.lock);
> 
> I think we need...
> 
> else if (exec_queue_register(q)) 
>         __guc_exec_queue_destroy(guc, q);

Right.. I remember you mentioning that also in a prior rev... let me
confirm here. When I was testing, this wasn't working in all cases, but
I'll double check and get back.

Also this was the point of the pending_disable here. We do explicitly
set that in this flow whereas registered has a bunch of entry points
and I was trying to isolate to the case of GuC dying mid-CT send. It
seems to me if we have registered but not pending_disable, we have a
bug in the sequence somewhere rather than an outside error injection
(like GuC dying).

> 
> We also need to cleanup suspend fences too as those could be lost
> under
> the right race condition.
> 
> So prior to existing if statement, we also need:
> 
> if (q->guc->suspend_pending)
>         suspend_fence_signal(q);

Ok

> 
> [1] https://patchwork.freedesktop.org/series/155315/
> 
> > 
> > The above traces depend also on inclusion of [1].
> > 
> > In the first scenario, the queue is created, but killed
> > prior to completing the message cleanup. In the second,
> > we go through a full registration before killing. The
> > CT communication happens in that last call to
> > xe_exec_queue_scheduling_disable.
> > 
> > We expect to then get a call to xe_guc_exec_queue_destroy
> > in both cases if the aforementioned scheduler/GuC CT communication
> > had happened, which we are missing here, hence missing any
> > LRC/BO cleanup in the exec queues in question.
> > 
> > Since this sequence seems specific to the wedge case
> > as described above, add a targeted scheduler start
> > and guc deregistration handler to the wedged_fini()
> > routine.
> > 
> > Without this change, if we inject wedges in the above scenarios
> > we can expect the following when the DRM memory tracking is
> > enabled (see CONFIG_DRM_DEBUG_MM):
> > [  129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 +
> > 00008000]: inserted at
> >                 drm_mm_insert_node_in_range+0x2ec/0x4b0
> >                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> >                 __xe_bo_create_locked+0x184/0x520 [xe]
> >                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> >                 xe_bo_create_pin_map+0x13/0x20 [xe]
> >                 xe_lrc_create+0x139/0x18e0 [xe]
> >                 xe_exec_queue_create+0x22f/0x3e0 [xe]
> >                 xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
> >                 drm_ioctl_kernel+0x9f/0xf0
> >                 drm_ioctl+0x20f/0x440
> >                 xe_drm_ioctl+0x121/0x150 [xe]
> >                 __x64_sys_ioctl+0x8c/0xe0
> >                 do_syscall_64+0x4c/0x1d0
> >                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 +
> > 00008000]: inserted at
> >                 drm_mm_insert_node_in_range+0x2ec/0x4b0
> >                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> >                 __xe_bo_create_locked+0x184/0x520 [xe]
> >                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> >                 xe_bo_create_pin_map+0x13/0x20 [xe]
> >                 xe_lrc_create+0x139/0x18e0 [xe]
> >                 xe_exec_queue_create+0x22f/0x3e0 [xe]
> >                 xe_exec_queue_create_bind+0x7f/0xd0 [xe]
> >                 xe_vm_create+0x4aa/0x8b0 [xe]
> >                 xe_vm_create_ioctl+0x17b/0x420 [xe]
> >                 drm_ioctl_kernel+0x9f/0xf0
> >                 drm_ioctl+0x20f/0x440
> >                 xe_drm_ioctl+0x121/0x150 [xe]
> >                 __x64_sys_ioctl+0x8c/0xe0
> >                 do_syscall_64+0x4c/0x1d0
> >                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
> > ---
> >  drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 5ec1e4a83d68..a11ae4e70809 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -287,6 +287,8 @@ static void guc_submit_fini(struct drm_device
> > *drm, void *arg)
> >         xa_destroy(&guc->submission_state.exec_queue_lookup);
> >  }
> >  
> > +static void __guc_exec_queue_destroy(struct xe_guc *guc, struct
> > xe_exec_queue *q);
> > +
> >  static void guc_submit_wedged_fini(void *arg)
> >  {
> >         struct xe_guc *guc = arg;
> > @@ -299,6 +301,16 @@ static void guc_submit_wedged_fini(void *arg)
> >                         mutex_unlock(&guc->submission_state.lock);
> >                         xe_exec_queue_put(q);
> >                         mutex_lock(&guc->submission_state.lock);
> 
> With everything above I don't think this new code below is needed.
> 
> But to make sure we know what we are doing, how about this from [2]
> before the xe_exec_queue_put.
> 
> xe_gt_assert(..., !drm_sched_is_stopped(sched));

Yeah I agree this seems like a good idea. It also follows what we're
doing in the other state changes.

> 
> Wanna try out these suggestions? It is always possible I made a
> mistake
> here.

Really appreciate the feedback Matt. Yeah I'll take a look and get back
if it still doesn't work here.

Thanks,
Stuart

> 
> Matt
> 
> [2]
> https://patchwork.freedesktop.org/patch/681606/?series=155315&rev=3
> 
> > +               } else {
> > +                       /*
> > +                        * Make sure queues which were killed as
> > part of a
> > +                        * wedge are cleaned up properly. Clean up
> > any
> > +                        * dangling scheduler tasks and pending
> > exec queue
> > +                        * deregistration.
> > +                        */
> > +                       xe_sched_submission_start(&q->guc->sched);
> > +                       if (exec_queue_pending_disable(q))
> > +                               __guc_exec_queue_destroy(guc, q);
> >                 }
> >         }
> >         mutex_unlock(&guc->submission_state.lock);
> > -- 
> > 2.34.1
> > 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge
  2025-10-23 17:43     ` Summers, Stuart
@ 2025-10-23 18:26       ` Matthew Brost
  0 siblings, 0 replies; 15+ messages in thread
From: Matthew Brost @ 2025-10-23 18:26 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Dong,  Zhanjun, Lin, Shuicheng,
	Vishwanathapura, Niranjana

On Thu, Oct 23, 2025 at 11:43:34AM -0600, Summers, Stuart wrote:
> On Wed, 2025-10-22 at 14:15 -0700, Matthew Brost wrote:
> > On Mon, Oct 20, 2025 at 09:45:28PM +0000, Stuart Summers wrote:
> > > When the driver is wedged during a hardware failure, there
> > > is a chance the queue kill coming from those events can
> > > race with either the scheduler teardown or the queue
> > > deregistration with GuC. Basically the following two
> > > scenarios can occur (from event trace):
> > > 
> > > Scheduler start missing:
> > >   xe_exec_queue_create
> > 
> > The queues should be initialized in a started state unless a GT reset
> > or
> > VF migration is in progress. In both cases, upon successful
> > completion,
> > all queues will be restarted.
> > 
> > I did spot a bug in GT resets — if those fail, we don’t properly
> > restart
> > the queues. That should be fixed.
> > 
> > Also, I think xe_guc_declare_wedged is incorrect now that I’m looking
> > at
> > it.
> > 
> > It probably should be:
> > 
> > void xe_guc_declare_wedged(struct xe_guc *guc)
> > {
> >         xe_gt_assert(guc_to_gt(guc), guc_to_xe(guc)->wedged.mode);
> > 
> >         xe_guc_ct_stop(&guc->ct);
> >         xe_guc_submit_wedge(guc);
> >         xe_guc_sanitize(guc);
> > }
> 
> It's a good point, but to be clear, I'm not doing a GT reset here.
> Actually in the case I'm testing, there is an explicit PCIe FLR and
> then I'm explicitly wedging after and making sure the unbind completes
> error-free. Not something I'm necessarily planning on driving into the
> tree here, but doing this for some internal testing.
> 
> But yeah I'll give this a try with a GT reset in the mix to make sure
> that cleans up the way you're suggesting. And thanks for that wedge
> revision, I'll give that a try too.
> 
> > 
> > >   xe_exec_queue_kill
> > >   xe_guc_exec_queue_kill
> > >   xe_exec_queue_destroy
> > > 
> > > GuC CT response missing:
> > >   xe_exec_queue_create
> > >   xe_exec_queue_register
> > >   xe_exec_queue_scheduling_enable
> > >   xe_exec_queue_scheduling_done
> > >   xe_exec_queue_kill
> > >   xe_guc_exec_queue_kill
> > >   xe_exec_queue_close
> > >   xe_exec_queue_destroy
> > >   xe_exec_queue_cleanup_entity
> > >   xe_exec_queue_scheduling_disable
> > 
> > The ref count should be zero here — xe_exec_queue_scheduling_disable
> 
> I did confirm that it is (at least per the get_unless_zero() call
> below).
> 
> > is
> > called after this series [1]. I think we need to get this series in
> 
> Let me pull in that series. I'm still going through it on my side...
> thanks for the link!
> 
> > before making changes to the GuC submission state machine.
> > Technically,
> > all we need are the last three patches from that series, as they
> > simplify some things. I believe an upcoming Xe3 feature would also
> > benefit from getting these patches in too.
> > 
> > So that means in xe_guc_submit_wedge() the below if statement is
> > going
> > to fail.
> > 
> > 1006         mutex_lock(&guc->submission_state.lock);
> > 1007         xa_for_each(&guc->submission_state.exec_queue_lookup,
> > index, q)
> > 1008                 if (xe_exec_queue_get_unless_zero(q))
> > 1009                         set_exec_queue_wedged(q);
> > 1010         mutex_unlock(&guc->submission_state.lock);
> > 
> > I think we need...
> > 
> > else if (exec_queue_register(q)) 
> >         __guc_exec_queue_destroy(guc, q);
> 
> Right.. I remember you mentioning that also in a prior rev... let me
> confirm here. When I was testing, this wasn't working in all cases, but
> I'll double check and get back.
> 
> Also this was the point of the pending_disable here. We do explicitly
> set that in this flow whereas registered has a bunch of entry points
> and I was trying to isolate to the case of GuC dying mid-CT send. It
> seems to me if we have registered but not pending_disable, we have a
> bug in the sequence somewhere rather than an outside error injection
> (like GuC dying).
>

Part of my changes will tie the deregistration process directly to the
refcount — i.e., we deregister the process when the refcount reaches
zero. Previously, it was possible to deregister during TDRs, but that’s
no longer the case.

So, if we reach this point and the exec queue is present in the lookup
array with a refcount of zero, then either:

- The G2H for deregistration was lost as part of wedging, or
- The final worker for destruction is queued or currently running.

The exec_queue_registered flag will tell us whether the G2H was
processed, as we clear this bit upon receiving the G2H.

> > 
> > We also need to cleanup suspend fences too as those could be lost
> > under
> > the right race condition.
> > 
> > So prior to existing if statement, we also need:
> > 
> > if (q->guc->suspend_pending)
> >         suspend_fence_signal(q);
> 
> Ok
> 
> > 
> > [1] https://patchwork.freedesktop.org/series/155315/
> > 
> > > 
> > > The above traces depend also on inclusion of [1].
> > > 
> > > In the first scenario, the queue is created, but killed
> > > prior to completing the message cleanup. In the second,
> > > we go through a full registration before killing. The
> > > CT communication happens in that last call to
> > > xe_exec_queue_scheduling_disable.
> > > 
> > > We expect to then get a call to xe_guc_exec_queue_destroy
> > > in both cases if the aforementioned scheduler/GuC CT communication
> > > had happened, which we are missing here, hence missing any
> > > LRC/BO cleanup in the exec queues in question.
> > > 
> > > Since this sequence seems specific to the wedge case
> > > as described above, add a targeted scheduler start
> > > and guc deregistration handler to the wedged_fini()
> > > routine.
> > > 
> > > Without this change, if we inject wedges in the above scenarios
> > > we can expect the following when the DRM memory tracking is
> > > enabled (see CONFIG_DRM_DEBUG_MM):
> > > [  129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 +
> > > 00008000]: inserted at
> > >                 drm_mm_insert_node_in_range+0x2ec/0x4b0
> > >                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> > >                 __xe_bo_create_locked+0x184/0x520 [xe]
> > >                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> > >                 xe_bo_create_pin_map+0x13/0x20 [xe]
> > >                 xe_lrc_create+0x139/0x18e0 [xe]
> > >                 xe_exec_queue_create+0x22f/0x3e0 [xe]
> > >                 xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
> > >                 drm_ioctl_kernel+0x9f/0xf0
> > >                 drm_ioctl+0x20f/0x440
> > >                 xe_drm_ioctl+0x121/0x150 [xe]
> > >                 __x64_sys_ioctl+0x8c/0xe0
> > >                 do_syscall_64+0x4c/0x1d0
> > >                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > [  129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 +
> > > 00008000]: inserted at
> > >                 drm_mm_insert_node_in_range+0x2ec/0x4b0
> > >                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> > >                 __xe_bo_create_locked+0x184/0x520 [xe]
> > >                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> > >                 xe_bo_create_pin_map+0x13/0x20 [xe]
> > >                 xe_lrc_create+0x139/0x18e0 [xe]
> > >                 xe_exec_queue_create+0x22f/0x3e0 [xe]
> > >                 xe_exec_queue_create_bind+0x7f/0xd0 [xe]
> > >                 xe_vm_create+0x4aa/0x8b0 [xe]
> > >                 xe_vm_create_ioctl+0x17b/0x420 [xe]
> > >                 drm_ioctl_kernel+0x9f/0xf0
> > >                 drm_ioctl+0x20f/0x440
> > >                 xe_drm_ioctl+0x121/0x150 [xe]
> > >                 __x64_sys_ioctl+0x8c/0xe0
> > >                 do_syscall_64+0x4c/0x1d0
> > >                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > 
> > > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > > 
> > > [1]
> > > https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
> > > ---
> > >  drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index 5ec1e4a83d68..a11ae4e70809 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -287,6 +287,8 @@ static void guc_submit_fini(struct drm_device
> > > *drm, void *arg)
> > >         xa_destroy(&guc->submission_state.exec_queue_lookup);
> > >  }
> > >  
> > > +static void __guc_exec_queue_destroy(struct xe_guc *guc, struct
> > > xe_exec_queue *q);
> > > +
> > >  static void guc_submit_wedged_fini(void *arg)
> > >  {
> > >         struct xe_guc *guc = arg;
> > > @@ -299,6 +301,16 @@ static void guc_submit_wedged_fini(void *arg)
> > >                         mutex_unlock(&guc->submission_state.lock);
> > >                         xe_exec_queue_put(q);
> > >                         mutex_lock(&guc->submission_state.lock);
> > 
> > With everything above I don't think this new code below is needed.
> > 
> > But to make sure we know what we are doing, how about this from [2]
> > before the xe_exec_queue_put.
> > 
> > xe_gt_assert(..., !drm_sched_is_stopped(sched));
> 
> Yeah I agree this seems like a good idea. It also follows what we're
> doing in the other state changes.
> 
> > 
> > Wanna try out these suggestions? It is always possible I made a
> > mistake
> > here.
> 
> Really appreciate the feedback Matt. Yeah I'll take a look and get back
> if it still doesn't work here.
> 

Yea, the state machine is little hard to wrap your head around. Happy to
help.

Matt

> Thanks,
> Stuart
> 
> > 
> > Matt
> > 
> > [2]
> > https://patchwork.freedesktop.org/patch/681606/?series=155315&rev=3
> > 
> > > +               } else {
> > > +                       /*
> > > +                        * Make sure queues which were killed as
> > > part of a
> > > +                        * wedge are cleaned up properly. Clean up
> > > any
> > > +                        * dangling scheduler tasks and pending
> > > exec queue
> > > +                        * deregistration.
> > > +                        */
> > > +                       xe_sched_submission_start(&q->guc->sched);
> > > +                       if (exec_queue_pending_disable(q))
> > > +                               __guc_exec_queue_destroy(guc, q);
> > >                 }
> > >         }
> > >         mutex_unlock(&guc->submission_state.lock);
> > > -- 
> > > 2.34.1
> > > 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-10-23 18:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-20 21:45 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-20 21:45 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-20 21:45 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-20 21:45 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-20 21:45 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-22 20:30   ` Matthew Brost
2025-10-23 17:18     ` Summers, Stuart
2025-10-20 21:45 ` [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge Stuart Summers
2025-10-22 21:15   ` Matthew Brost
2025-10-23 17:43     ` Summers, Stuart
2025-10-23 18:26       ` Matthew Brost
2025-10-20 21:45 ` [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc Stuart Summers
2025-10-20 22:05   ` Matthew Brost
2025-10-20 22:07     ` Summers, Stuart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox