* [PATCH 4/6] drm/xe: Add new exec queue trace points
2025-10-14 18:09 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
@ 2025-10-14 18:09 ` Stuart Summers
0 siblings, 0 replies; 19+ messages in thread
From: Stuart Summers @ 2025-10-14 18:09 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, Stuart Summers
Add an exec queue and guc exec queue trace point
to separate out which part of the stack is executing.
This is helpful because several of the guc specific
paths rely on responses from guc which is interesting
to view separately in the event the guc stops responding
in the middle of an operation that would expect a
response from guc otherwise.
Also add in the exec queue pointer information to
the trace events for easier tracking. Contexts (guc_ids)
can get re-used, so this just makes grepping a little
easier in this type of debug.
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
drivers/gpu/drm/xe/xe_exec_queue.c | 4 ++++
drivers/gpu/drm/xe/xe_guc_submit.c | 11 +++++++----
drivers/gpu/drm/xe/xe_trace.h | 22 ++++++++++++++++++++--
3 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 90cbc95f8e2e..a2ef381cf6d9 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -377,6 +377,8 @@ void xe_exec_queue_destroy(struct kref *ref)
struct xe_exec_queue *q = container_of(ref, struct xe_exec_queue, refcount);
struct xe_exec_queue *eq, *next;
+ trace_xe_exec_queue_destroy(q);
+
if (xe_exec_queue_uses_pxp(q))
xe_pxp_exec_queue_remove(gt_to_xe(q->gt)->pxp, q);
@@ -953,6 +955,8 @@ void xe_exec_queue_kill(struct xe_exec_queue *q)
{
struct xe_exec_queue *eq = q, *next;
+ trace_xe_exec_queue_kill(q);
+
list_for_each_entry_safe(eq, next, &eq->multi_gt_list,
multi_gt_link) {
q->ops->kill(eq);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index e9aa0625ce60..5ec1e4a83d68 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1003,9 +1003,12 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
}
mutex_lock(&guc->submission_state.lock);
- xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
- if (xe_exec_queue_get_unless_zero(q))
+ xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+ if (xe_exec_queue_get_unless_zero(q)) {
set_exec_queue_wedged(q);
+ trace_xe_exec_queue_wedge(q);
+ }
+ }
mutex_unlock(&guc->submission_state.lock);
}
@@ -1455,7 +1458,7 @@ static void __guc_exec_queue_destroy_async(struct work_struct *w)
struct xe_guc *guc = exec_queue_to_guc(q);
xe_pm_runtime_get(guc_to_xe(guc));
- trace_xe_exec_queue_destroy(q);
+ trace_xe_guc_exec_queue_destroy(q);
if (xe_exec_queue_is_lr(q))
cancel_work_sync(&ge->lr_tdr);
@@ -1716,7 +1719,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
static void guc_exec_queue_kill(struct xe_exec_queue *q)
{
- trace_xe_exec_queue_kill(q);
+ trace_xe_guc_exec_queue_kill(q);
set_exec_queue_killed(q);
__suspend_fence_signal(q);
xe_guc_exec_queue_trigger_cleanup(q);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 314f42fcbcbd..a5dd0c48d894 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -71,6 +71,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
TP_STRUCT__entry(
__string(dev, __dev_name_eq(q))
+ __field(struct xe_exec_queue *, q)
__field(enum xe_engine_class, class)
__field(u32, logical_mask)
__field(u8, gt_id)
@@ -82,6 +83,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
TP_fast_assign(
__assign_str(dev);
+ __entry->q = q;
__entry->class = q->class;
__entry->logical_mask = q->logical_mask;
__entry->gt_id = q->gt->info.id;
@@ -91,8 +93,9 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
__entry->flags = q->flags;
),
- TP_printk("dev=%s, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
- __get_str(dev), __entry->class, __entry->logical_mask,
+ TP_printk("dev=%s, %p, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
+ __get_str(dev), __entry->q,
+ __entry->class, __entry->logical_mask,
__entry->gt_id, __entry->width, __entry->guc_id,
__entry->guc_state, __entry->flags)
);
@@ -147,11 +150,21 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_close,
TP_ARGS(q)
);
+DEFINE_EVENT(xe_exec_queue, xe_exec_queue_wedge,
+ TP_PROTO(struct xe_exec_queue *q),
+ TP_ARGS(q)
+);
+
DEFINE_EVENT(xe_exec_queue, xe_exec_queue_kill,
TP_PROTO(struct xe_exec_queue *q),
TP_ARGS(q)
);
+DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_kill,
+ TP_PROTO(struct xe_exec_queue *q),
+ TP_ARGS(q)
+);
+
DEFINE_EVENT(xe_exec_queue, xe_exec_queue_cleanup_entity,
TP_PROTO(struct xe_exec_queue *q),
TP_ARGS(q)
@@ -162,6 +175,11 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_destroy,
TP_ARGS(q)
);
+DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_destroy,
+ TP_PROTO(struct xe_exec_queue *q),
+ TP_ARGS(q)
+);
+
DEFINE_EVENT(xe_exec_queue, xe_exec_queue_reset,
TP_PROTO(struct xe_exec_queue *q),
TP_ARGS(q)
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 0/6] Fix a couple of wedge corner-case memory leaks
@ 2025-10-27 18:04 Stuart Summers
2025-10-27 18:04 ` [PATCH 1/6] drm/xe: Add additional trace points for LRCs Stuart Summers
` (7 more replies)
0 siblings, 8 replies; 19+ messages in thread
From: Stuart Summers @ 2025-10-27 18:04 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin, Stuart Summers
Most of the patches in this series are just adding
some debug hints to help track these down. I split
these up in case we want to pick and choose which ones
to include in the tree. I found them useful.
The main interesting patch is the last one in the
series which fixes some corner cases when the
driver becomes wedged in the middle of either communication
with the DRM scheduler or in the event the GuC becomes
unresponsive. In both of these cases there is a chance
we could leak memory around the exec queue members
like the LRC and the LRC BO. This patch fixes those
scenarios.
This series depends on [1].
v2: Address feedback from Matt:
- Let the DRM scheduler handle pausing/unpausing
- Still do the wait after scheduling disable/deregister
as with the previous patch, but skip the intermediate
software-based schedule disable using the "banned"
flag and instead just jump straight to the deregister
handling which will fully reset the queue state.
Note that for this case I am seeing a hardware failure
after submitting to GuC but before receiving the
response from GuC. So even if we wedge in this case
(monitoring the hardware state change), the queue
itself is not wedged because of the active GuC
submission (CT is not stalled at that point).
v3: Add back in the xe pause checks and instead just kickstart
message handling in the guc_submi_fini() routine before
doing the async wait there.
v4: Handle the CT communication loss during wedge asynchronously
Also combine those last two patches into one to handle
wedge cleanup generally.
v5: Add a new patch with a little documentation on the GuC
submission handling stages.
Move the scheduler kickstart and destruction call on the
dangling queues into the wedged_fini() callback. These
only get called now for queues which are in an error
state - wedge was called, but these weren't fully
cleaned up as seen by the lack of exec_queue reference
at the time of wedging.
Also fix the migration ordering teardown reference mistake
pointed by Matt in the previous series rev.
v6: Implement and test against [1] with the changes Matt suggested.
[1]: https://patchwork.freedesktop.org/series/155315/
Stuart Summers (6):
drm/xe: Add additional trace points for LRCs
drm/xe: Add a trace point for VM close
drm/xe: Add the BO pointer info to the BO trace
drm/xe: Add new exec queue trace points
drm/xe: Correct migration VM teardown order
drm/xe: Clean up GuC software state after a wedge
drivers/gpu/drm/xe/xe_exec_queue.c | 4 +++
drivers/gpu/drm/xe/xe_guc_submit.c | 17 +++++++++---
drivers/gpu/drm/xe/xe_lrc.c | 4 +++
drivers/gpu/drm/xe/xe_lrc.h | 3 +++
drivers/gpu/drm/xe/xe_migrate.c | 7 ++---
drivers/gpu/drm/xe/xe_trace.h | 22 ++++++++++++++--
drivers/gpu/drm/xe/xe_trace_bo.h | 12 +++++++--
drivers/gpu/drm/xe/xe_trace_lrc.h | 42 +++++++++++++++++++++++++++++-
drivers/gpu/drm/xe/xe_vm.c | 2 ++
9 files changed, 101 insertions(+), 12 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/6] drm/xe: Add additional trace points for LRCs
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
@ 2025-10-27 18:04 ` Stuart Summers
2025-10-28 19:46 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 2/6] drm/xe: Add a trace point for VM close Stuart Summers
` (6 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Stuart Summers @ 2025-10-27 18:04 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin, Stuart Summers
Add trace points to indicate when an LRC has been
created and destroyed or get and put.
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
drivers/gpu/drm/xe/xe_lrc.c | 4 +++
drivers/gpu/drm/xe/xe_lrc.h | 3 +++
drivers/gpu/drm/xe/xe_trace_lrc.h | 42 ++++++++++++++++++++++++++++++-
3 files changed, 48 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index b5083c99dd50..42d1c861fe18 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1575,6 +1575,8 @@ struct xe_lrc *xe_lrc_create(struct xe_hw_engine *hwe, struct xe_vm *vm,
return ERR_PTR(err);
}
+ trace_xe_lrc_create(lrc);
+
return lrc;
}
@@ -1589,6 +1591,8 @@ void xe_lrc_destroy(struct kref *ref)
{
struct xe_lrc *lrc = container_of(ref, struct xe_lrc, refcount);
+ trace_xe_lrc_destroy(lrc);
+
xe_lrc_finish(lrc);
kfree(lrc);
}
diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
index 2fb628da5c43..fd67810f9812 100644
--- a/drivers/gpu/drm/xe/xe_lrc.h
+++ b/drivers/gpu/drm/xe/xe_lrc.h
@@ -8,6 +8,7 @@
#include <linux/types.h>
#include "xe_lrc_types.h"
+#include "xe_trace_lrc.h"
struct drm_printer;
struct xe_bb;
@@ -61,6 +62,7 @@ void xe_lrc_destroy(struct kref *ref);
static inline struct xe_lrc *xe_lrc_get(struct xe_lrc *lrc)
{
kref_get(&lrc->refcount);
+ trace_xe_lrc_get(lrc);
return lrc;
}
@@ -73,6 +75,7 @@ static inline struct xe_lrc *xe_lrc_get(struct xe_lrc *lrc)
*/
static inline void xe_lrc_put(struct xe_lrc *lrc)
{
+ trace_xe_lrc_put(lrc);
kref_put(&lrc->refcount, xe_lrc_destroy);
}
diff --git a/drivers/gpu/drm/xe/xe_trace_lrc.h b/drivers/gpu/drm/xe/xe_trace_lrc.h
index d525cbee1e34..e8daa5d323e7 100644
--- a/drivers/gpu/drm/xe/xe_trace_lrc.h
+++ b/drivers/gpu/drm/xe/xe_trace_lrc.h
@@ -13,7 +13,6 @@
#include <linux/types.h>
#include "xe_gt_types.h"
-#include "xe_lrc.h"
#include "xe_lrc_types.h"
#define __dev_name_lrc(lrc) dev_name(gt_to_xe((lrc)->fence_ctx.gt)->drm.dev)
@@ -42,6 +41,47 @@ TRACE_EVENT(xe_lrc_update_timestamp,
__get_str(device_id))
);
+DECLARE_EVENT_CLASS(xe_lrc,
+ TP_PROTO(struct xe_lrc *lrc),
+ TP_ARGS(lrc),
+
+ TP_STRUCT__entry(
+ __field(struct xe_lrc *, lrc)
+ __string(name, lrc->fence_ctx.name)
+ __string(device_id, __dev_name_lrc(lrc))
+ ),
+
+ TP_fast_assign(
+ __entry->lrc = lrc;
+ __assign_str(name);
+ __assign_str(device_id);
+ ),
+
+ TP_printk("lrc=:%p lrc->name=%s device_id:%s",
+ __entry->lrc, __get_str(name),
+ __get_str(device_id))
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_create,
+ TP_PROTO(struct xe_lrc *lrc),
+ TP_ARGS(lrc)
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_destroy,
+ TP_PROTO(struct xe_lrc *lrc),
+ TP_ARGS(lrc)
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_get,
+ TP_PROTO(struct xe_lrc *lrc),
+ TP_ARGS(lrc)
+);
+
+DEFINE_EVENT(xe_lrc, xe_lrc_put,
+ TP_PROTO(struct xe_lrc *lrc),
+ TP_ARGS(lrc)
+);
+
#endif
/* This part must be outside protection */
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/6] drm/xe: Add a trace point for VM close
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-27 18:04 ` [PATCH 1/6] drm/xe: Add additional trace points for LRCs Stuart Summers
@ 2025-10-27 18:04 ` Stuart Summers
2025-10-28 20:03 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
` (5 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Stuart Summers @ 2025-10-27 18:04 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin, Stuart Summers
All better tracking of error cases when calling through
xe_vm_close_and_put().
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
drivers/gpu/drm/xe/xe_trace_bo.h | 6 ++++++
drivers/gpu/drm/xe/xe_vm.c | 2 ++
2 files changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_trace_bo.h b/drivers/gpu/drm/xe/xe_trace_bo.h
index 86323cf3be2c..238311cfb816 100644
--- a/drivers/gpu/drm/xe/xe_trace_bo.h
+++ b/drivers/gpu/drm/xe/xe_trace_bo.h
@@ -14,6 +14,7 @@
#include "xe_bo.h"
#include "xe_bo_types.h"
+#include "xe_exec_queue_types.h"
#include "xe_vm.h"
#define __dev_name_bo(bo) dev_name(xe_bo_device(bo)->drm.dev)
@@ -223,6 +224,11 @@ DEFINE_EVENT(xe_vm, xe_vm_free,
TP_ARGS(vm)
);
+DEFINE_EVENT(xe_vm, xe_vm_close,
+ TP_PROTO(struct xe_vm *vm),
+ TP_ARGS(vm)
+);
+
DEFINE_EVENT(xe_vm, xe_vm_cpu_bind,
TP_PROTO(struct xe_vm *vm),
TP_ARGS(vm)
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 00f3520dec38..e383982bad8a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1672,6 +1672,8 @@ static void xe_vm_close(struct xe_vm *vm)
bool bound;
int idx;
+ trace_xe_vm_close(vm);
+
bound = drm_dev_enter(&xe->drm, &idx);
down_write(&vm->lock);
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-27 18:04 ` [PATCH 1/6] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-27 18:04 ` [PATCH 2/6] drm/xe: Add a trace point for VM close Stuart Summers
@ 2025-10-27 18:04 ` Stuart Summers
2025-10-28 20:11 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 4/6] drm/xe: Add new exec queue trace points Stuart Summers
` (4 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Stuart Summers @ 2025-10-27 18:04 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin, Stuart Summers
Just a little extra detail to make BOs easier to track
through the trace event log.
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
drivers/gpu/drm/xe/xe_trace_bo.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_trace_bo.h b/drivers/gpu/drm/xe/xe_trace_bo.h
index 238311cfb816..0d795733e0f0 100644
--- a/drivers/gpu/drm/xe/xe_trace_bo.h
+++ b/drivers/gpu/drm/xe/xe_trace_bo.h
@@ -27,6 +27,7 @@ DECLARE_EVENT_CLASS(xe_bo,
TP_STRUCT__entry(
__string(dev, __dev_name_bo(bo))
+ __field(struct xe_bo *, bo)
__field(size_t, size)
__field(u32, flags)
__field(struct xe_vm *, vm)
@@ -34,13 +35,14 @@ DECLARE_EVENT_CLASS(xe_bo,
TP_fast_assign(
__assign_str(dev);
+ __entry->bo = bo;
__entry->size = xe_bo_size(bo);
__entry->flags = bo->flags;
__entry->vm = bo->vm;
),
- TP_printk("dev=%s, size=%zu, flags=0x%02x, vm=%p",
- __get_str(dev), __entry->size,
+ TP_printk("dev=%s, %p, size=%zu, flags=0x%02x, vm=%p",
+ __get_str(dev), __entry->bo, __entry->size,
__entry->flags, __entry->vm)
);
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 4/6] drm/xe: Add new exec queue trace points
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
` (2 preceding siblings ...)
2025-10-27 18:04 ` [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
@ 2025-10-27 18:04 ` Stuart Summers
2025-10-28 20:29 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 5/6] drm/xe: Correct migration VM teardown order Stuart Summers
` (3 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Stuart Summers @ 2025-10-27 18:04 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin, Stuart Summers
Add an exec queue and guc exec queue trace point
to separate out which part of the stack is executing.
This is helpful because several of the guc specific
paths rely on responses from guc which is interesting
to view separately in the event the guc stops responding
in the middle of an operation that would expect a
response from guc otherwise.
Also add in the exec queue pointer information to
the trace events for easier tracking. Contexts (guc_ids)
can get re-used, so this just makes grepping a little
easier in this type of debug.
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
drivers/gpu/drm/xe/xe_exec_queue.c | 4 ++++
drivers/gpu/drm/xe/xe_guc_submit.c | 11 +++++++----
drivers/gpu/drm/xe/xe_trace.h | 22 ++++++++++++++++++++--
3 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 90cbc95f8e2e..a2ef381cf6d9 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -377,6 +377,8 @@ void xe_exec_queue_destroy(struct kref *ref)
struct xe_exec_queue *q = container_of(ref, struct xe_exec_queue, refcount);
struct xe_exec_queue *eq, *next;
+ trace_xe_exec_queue_destroy(q);
+
if (xe_exec_queue_uses_pxp(q))
xe_pxp_exec_queue_remove(gt_to_xe(q->gt)->pxp, q);
@@ -953,6 +955,8 @@ void xe_exec_queue_kill(struct xe_exec_queue *q)
{
struct xe_exec_queue *eq = q, *next;
+ trace_xe_exec_queue_kill(q);
+
list_for_each_entry_safe(eq, next, &eq->multi_gt_list,
multi_gt_link) {
q->ops->kill(eq);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index d4ffdb71ef3d..3f672355b3bb 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1004,9 +1004,12 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
}
mutex_lock(&guc->submission_state.lock);
- xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
- if (xe_exec_queue_get_unless_zero(q))
+ xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+ if (xe_exec_queue_get_unless_zero(q)) {
set_exec_queue_wedged(q);
+ trace_xe_exec_queue_wedge(q);
+ }
+ }
mutex_unlock(&guc->submission_state.lock);
}
@@ -1456,7 +1459,7 @@ static void __guc_exec_queue_destroy_async(struct work_struct *w)
struct xe_guc *guc = exec_queue_to_guc(q);
xe_pm_runtime_get(guc_to_xe(guc));
- trace_xe_exec_queue_destroy(q);
+ trace_xe_guc_exec_queue_destroy(q);
if (xe_exec_queue_is_lr(q))
cancel_work_sync(&ge->lr_tdr);
@@ -1727,7 +1730,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
static void guc_exec_queue_kill(struct xe_exec_queue *q)
{
- trace_xe_exec_queue_kill(q);
+ trace_xe_guc_exec_queue_kill(q);
set_exec_queue_killed(q);
__suspend_fence_signal(q);
xe_guc_exec_queue_trigger_cleanup(q);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 314f42fcbcbd..a5dd0c48d894 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -71,6 +71,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
TP_STRUCT__entry(
__string(dev, __dev_name_eq(q))
+ __field(struct xe_exec_queue *, q)
__field(enum xe_engine_class, class)
__field(u32, logical_mask)
__field(u8, gt_id)
@@ -82,6 +83,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
TP_fast_assign(
__assign_str(dev);
+ __entry->q = q;
__entry->class = q->class;
__entry->logical_mask = q->logical_mask;
__entry->gt_id = q->gt->info.id;
@@ -91,8 +93,9 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
__entry->flags = q->flags;
),
- TP_printk("dev=%s, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
- __get_str(dev), __entry->class, __entry->logical_mask,
+ TP_printk("dev=%s, %p, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
+ __get_str(dev), __entry->q,
+ __entry->class, __entry->logical_mask,
__entry->gt_id, __entry->width, __entry->guc_id,
__entry->guc_state, __entry->flags)
);
@@ -147,11 +150,21 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_close,
TP_ARGS(q)
);
+DEFINE_EVENT(xe_exec_queue, xe_exec_queue_wedge,
+ TP_PROTO(struct xe_exec_queue *q),
+ TP_ARGS(q)
+);
+
DEFINE_EVENT(xe_exec_queue, xe_exec_queue_kill,
TP_PROTO(struct xe_exec_queue *q),
TP_ARGS(q)
);
+DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_kill,
+ TP_PROTO(struct xe_exec_queue *q),
+ TP_ARGS(q)
+);
+
DEFINE_EVENT(xe_exec_queue, xe_exec_queue_cleanup_entity,
TP_PROTO(struct xe_exec_queue *q),
TP_ARGS(q)
@@ -162,6 +175,11 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_destroy,
TP_ARGS(q)
);
+DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_destroy,
+ TP_PROTO(struct xe_exec_queue *q),
+ TP_ARGS(q)
+);
+
DEFINE_EVENT(xe_exec_queue, xe_exec_queue_reset,
TP_PROTO(struct xe_exec_queue *q),
TP_ARGS(q)
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 5/6] drm/xe: Correct migration VM teardown order
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
` (3 preceding siblings ...)
2025-10-27 18:04 ` [PATCH 4/6] drm/xe: Add new exec queue trace points Stuart Summers
@ 2025-10-27 18:04 ` Stuart Summers
2025-10-27 19:46 ` Matthew Brost
2025-10-27 18:04 ` [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge Stuart Summers
` (2 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Stuart Summers @ 2025-10-27 18:04 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin, Stuart Summers
Adjust the sequence of the migration teardown to match what
is happening in the init() function.
v2: Take a reference to the migrate queue before put (Matt)
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
---
drivers/gpu/drm/xe/xe_migrate.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 56a5804726e9..9d60bc2b57e4 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -98,17 +98,18 @@ struct xe_migrate {
static void xe_migrate_fini(void *arg)
{
struct xe_migrate *m = arg;
+ struct xe_vm *vm = m->q->vm;
- xe_vm_lock(m->q->vm, false);
+ xe_vm_lock(vm, false);
xe_bo_unpin(m->pt_bo);
- xe_vm_unlock(m->q->vm);
+ xe_vm_unlock(vm);
dma_fence_put(m->fence);
xe_bo_put(m->pt_bo);
drm_suballoc_manager_fini(&m->vm_update_sa);
mutex_destroy(&m->job_mutex);
- xe_vm_close_and_put(m->q->vm);
xe_exec_queue_put(m->q);
+ xe_vm_close_and_put(vm);
}
static u64 xe_migrate_vm_addr(u64 slot, u32 level)
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
` (4 preceding siblings ...)
2025-10-27 18:04 ` [PATCH 5/6] drm/xe: Correct migration VM teardown order Stuart Summers
@ 2025-10-27 18:04 ` Stuart Summers
2025-10-27 18:16 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev6) Patchwork
2025-10-27 18:17 ` ✗ CI.KUnit: failure " Patchwork
7 siblings, 0 replies; 19+ messages in thread
From: Stuart Summers @ 2025-10-27 18:04 UTC (permalink / raw)
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin, Stuart Summers
When the driver is wedged during a hardware failure, there
is a chance the queue kill coming from those events can
race with either the scheduler teardown or the queue
deregistration with GuC. Basically the following two
scenarios can occur (from event trace):
Scheduler start missing:
xe_exec_queue_create
xe_exec_queue_kill
xe_guc_exec_queue_kill
xe_exec_queue_destroy
GuC CT response missing:
xe_exec_queue_create
xe_exec_queue_register
xe_exec_queue_scheduling_enable
xe_exec_queue_scheduling_done
xe_exec_queue_kill
xe_guc_exec_queue_kill
xe_exec_queue_close
xe_exec_queue_destroy
xe_exec_queue_cleanup_entity
xe_exec_queue_scheduling_disable
The above traces depend also on inclusion of [1].
In the first scenario, the queue is created, but killed
prior to completing the message cleanup. In the second,
we go through a full registration before killing. The
CT communication happens in that last call to
xe_exec_queue_scheduling_disable.
We expect to then get a call to xe_guc_exec_queue_destroy
in both cases if the aforementioned scheduler/GuC CT communication
had happened, which we are missing here, hence missing any
LRC/BO cleanup in the exec queues in question.
Once the scheduler rework in [2] is available, simply ensure
all queues are either marked as wedged or cleaned up
explicitly by destroying any remaining registered queues.
Without this change, if we inject wedges in the above scenarios
we can expect the following when the DRM memory tracking is
enabled (see CONFIG_DRM_DEBUG_MM):
[ 129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 + 00008000]: inserted at
drm_mm_insert_node_in_range+0x2ec/0x4b0
__xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
__xe_bo_create_locked+0x184/0x520 [xe]
xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
xe_bo_create_pin_map+0x13/0x20 [xe]
xe_lrc_create+0x139/0x18e0 [xe]
xe_exec_queue_create+0x22f/0x3e0 [xe]
xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
drm_ioctl_kernel+0x9f/0xf0
drm_ioctl+0x20f/0x440
xe_drm_ioctl+0x121/0x150 [xe]
__x64_sys_ioctl+0x8c/0xe0
do_syscall_64+0x4c/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 + 00008000]: inserted at
drm_mm_insert_node_in_range+0x2ec/0x4b0
__xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
__xe_bo_create_locked+0x184/0x520 [xe]
xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
xe_bo_create_pin_map+0x13/0x20 [xe]
xe_lrc_create+0x139/0x18e0 [xe]
xe_exec_queue_create+0x22f/0x3e0 [xe]
xe_exec_queue_create_bind+0x7f/0xd0 [xe]
xe_vm_create+0x4aa/0x8b0 [xe]
xe_vm_create_ioctl+0x17b/0x420 [xe]
drm_ioctl_kernel+0x9f/0xf0
drm_ioctl+0x20f/0x440
xe_drm_ioctl+0x121/0x150 [xe]
__x64_sys_ioctl+0x8c/0xe0
do_syscall_64+0x4c/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
v2: Pulled in [2] as suggested by Matt and used his recommendation
of destroying registered queues at the time of wedging.
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
[1] https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
[2] https://patchwork.freedesktop.org/series/155315/
---
drivers/gpu/drm/xe/xe_guc_submit.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 3f672355b3bb..dc26aa4b5f47 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -296,6 +296,8 @@ static void guc_submit_wedged_fini(void *arg)
mutex_lock(&guc->submission_state.lock);
xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
+ xe_gt_assert(guc_to_gt(guc),
+ !drm_sched_is_stopped(&q->guc->sched.base));
if (exec_queue_wedged(q)) {
mutex_unlock(&guc->submission_state.lock);
xe_exec_queue_put(q);
@@ -972,6 +974,8 @@ static void xe_guc_exec_queue_trigger_cleanup(struct xe_exec_queue *q)
xe_sched_tdr_queue_imm(&q->guc->sched);
}
+static void __guc_exec_queue_destroy(struct xe_guc *guc, struct xe_exec_queue *q);
+
/**
* xe_guc_submit_wedge() - Wedge GuC submission
* @guc: the GuC object
@@ -1008,6 +1012,8 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
if (xe_exec_queue_get_unless_zero(q)) {
set_exec_queue_wedged(q);
trace_xe_exec_queue_wedge(q);
+ } else if (exec_queue_registered(q)) {
+ __guc_exec_queue_destroy(guc, q);
}
}
mutex_unlock(&guc->submission_state.lock);
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev6)
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
` (5 preceding siblings ...)
2025-10-27 18:04 ` [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge Stuart Summers
@ 2025-10-27 18:16 ` Patchwork
2025-10-27 18:17 ` ✗ CI.KUnit: failure " Patchwork
7 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2025-10-27 18:16 UTC (permalink / raw)
To: Stuart Summers; +Cc: intel-xe
== Series Details ==
Series: Fix a couple of wedge corner-case memory leaks (rev6)
URL : https://patchwork.freedesktop.org/series/155352/
State : warning
== Summary ==
+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
f867e605613af1770f90c4b0afd4a8f06424d1f0
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit c0b9b8ebd244f28b48397d561698ee61161b2d59
Author: Stuart Summers <stuart.summers@intel.com>
Date: Mon Oct 27 18:04:12 2025 +0000
drm/xe: Clean up GuC software state after a wedge
When the driver is wedged during a hardware failure, there
is a chance the queue kill coming from those events can
race with either the scheduler teardown or the queue
deregistration with GuC. Basically the following two
scenarios can occur (from event trace):
Scheduler start missing:
xe_exec_queue_create
xe_exec_queue_kill
xe_guc_exec_queue_kill
xe_exec_queue_destroy
GuC CT response missing:
xe_exec_queue_create
xe_exec_queue_register
xe_exec_queue_scheduling_enable
xe_exec_queue_scheduling_done
xe_exec_queue_kill
xe_guc_exec_queue_kill
xe_exec_queue_close
xe_exec_queue_destroy
xe_exec_queue_cleanup_entity
xe_exec_queue_scheduling_disable
The above traces depend also on inclusion of [1].
In the first scenario, the queue is created, but killed
prior to completing the message cleanup. In the second,
we go through a full registration before killing. The
CT communication happens in that last call to
xe_exec_queue_scheduling_disable.
We expect to then get a call to xe_guc_exec_queue_destroy
in both cases if the aforementioned scheduler/GuC CT communication
had happened, which we are missing here, hence missing any
LRC/BO cleanup in the exec queues in question.
Once the scheduler rework in [2] is available, simply ensure
all queues are either marked as wedged or cleaned up
explicitly by destroying any remaining registered queues.
Without this change, if we inject wedges in the above scenarios
we can expect the following when the DRM memory tracking is
enabled (see CONFIG_DRM_DEBUG_MM):
[ 129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 + 00008000]: inserted at
drm_mm_insert_node_in_range+0x2ec/0x4b0
__xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
__xe_bo_create_locked+0x184/0x520 [xe]
xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
xe_bo_create_pin_map+0x13/0x20 [xe]
xe_lrc_create+0x139/0x18e0 [xe]
xe_exec_queue_create+0x22f/0x3e0 [xe]
xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
drm_ioctl_kernel+0x9f/0xf0
drm_ioctl+0x20f/0x440
xe_drm_ioctl+0x121/0x150 [xe]
__x64_sys_ioctl+0x8c/0xe0
do_syscall_64+0x4c/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 + 00008000]: inserted at
drm_mm_insert_node_in_range+0x2ec/0x4b0
__xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
__xe_bo_create_locked+0x184/0x520 [xe]
xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
xe_bo_create_pin_map+0x13/0x20 [xe]
xe_lrc_create+0x139/0x18e0 [xe]
xe_exec_queue_create+0x22f/0x3e0 [xe]
xe_exec_queue_create_bind+0x7f/0xd0 [xe]
xe_vm_create+0x4aa/0x8b0 [xe]
xe_vm_create_ioctl+0x17b/0x420 [xe]
drm_ioctl_kernel+0x9f/0xf0
drm_ioctl+0x20f/0x440
xe_drm_ioctl+0x121/0x150 [xe]
__x64_sys_ioctl+0x8c/0xe0
do_syscall_64+0x4c/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
v2: Pulled in [2] as suggested by Matt and used his recommendation
of destroying registered queues at the time of wedging.
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
[1] https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
[2] https://patchwork.freedesktop.org/series/155315/
+ /mt/dim checkpatch a8f3783a2ccf78b04e48cf235f03629b80ffa3e0 drm-intel
70d334d19aff drm/xe: Add additional trace points for LRCs
-:78: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#78: FILE: drivers/gpu/drm/xe/xe_trace_lrc.h:45:
+DECLARE_EVENT_CLASS(xe_lrc,
+ TP_PROTO(struct xe_lrc *lrc),
-:81: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#81: FILE: drivers/gpu/drm/xe/xe_trace_lrc.h:48:
+ TP_STRUCT__entry(
-:87: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#87: FILE: drivers/gpu/drm/xe/xe_trace_lrc.h:54:
+ TP_fast_assign(
total: 0 errors, 0 warnings, 3 checks, 91 lines checked
a71482bd2861 drm/xe: Add a trace point for VM close
53265f817012 drm/xe: Add the BO pointer info to the BO trace
1c2eb0f97314 drm/xe: Add new exec queue trace points
b98f6625fd40 drm/xe: Correct migration VM teardown order
c0b9b8ebd244 drm/xe: Clean up GuC software state after a wedge
^ permalink raw reply [flat|nested] 19+ messages in thread
* ✗ CI.KUnit: failure for Fix a couple of wedge corner-case memory leaks (rev6)
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
` (6 preceding siblings ...)
2025-10-27 18:16 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev6) Patchwork
@ 2025-10-27 18:17 ` Patchwork
2025-10-27 18:18 ` Summers, Stuart
7 siblings, 1 reply; 19+ messages in thread
From: Patchwork @ 2025-10-27 18:17 UTC (permalink / raw)
To: Stuart Summers; +Cc: intel-xe
== Series Details ==
Series: Fix a couple of wedge corner-case memory leaks (rev6)
URL : https://patchwork.freedesktop.org/series/155352/
State : failure
== Summary ==
+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:In file included from ../include/linux/bitfield.h:10,
from ../drivers/gpu/drm/xe/xe_guc_submit.c:8:
../drivers/gpu/drm/xe/xe_guc_submit.c: In function ‘guc_submit_wedged_fini’:
../drivers/gpu/drm/xe/xe_guc_submit.c:300:31: error: implicit declaration of function ‘drm_sched_is_stopped’; did you mean ‘drm_sched_stop’? [-Werror=implicit-function-declaration]
300 | !drm_sched_is_stopped(&q->guc->sched.base));
| ^~~~~~~~~~~~~~~~~~~~
../include/linux/build_bug.h:30:63: note: in definition of macro ‘BUILD_BUG_ON_INVALID’
30 | #define BUILD_BUG_ON_INVALID(e) ((void)(sizeof((__force long)(e))))
| ^
../drivers/gpu/drm/xe/xe_assert.h:112:9: note: in expansion of macro ‘__xe_assert_msg’
112 | __xe_assert_msg(__xe, condition, \
| ^~~~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_assert.h:148:9: note: in expansion of macro ‘xe_assert_msg’
148 | xe_assert_msg(tile_to_xe(__tile), condition, "tile: %u VRAM %s\n" msg, \
| ^~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_assert.h:172:9: note: in expansion of macro ‘xe_tile_assert_msg’
172 | xe_tile_assert_msg(gt_to_tile(__gt), condition, "GT: %u type %d\n" msg, \
| ^~~~~~~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_assert.h:169:37: note: in expansion of macro ‘xe_gt_assert_msg’
169 | #define xe_gt_assert(gt, condition) xe_gt_assert_msg((gt), condition, "")
| ^~~~~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_guc_submit.c:299:17: note: in expansion of macro ‘xe_gt_assert’
299 | xe_gt_assert(guc_to_gt(guc),
| ^~~~~~~~~~~~
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:287: drivers/gpu/drm/xe/xe_guc_submit.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:556: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:556: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:556: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:556: drivers] Error 2
make[2]: *** [/kernel/Makefile:2010: .] Error 2
make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
make: *** [Makefile:248: __sub-make] Error 2
[18:16:47] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[18:16:51] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: CI.KUnit: failure for Fix a couple of wedge corner-case memory leaks (rev6)
2025-10-27 18:17 ` ✗ CI.KUnit: failure " Patchwork
@ 2025-10-27 18:18 ` Summers, Stuart
2025-10-27 20:10 ` Summers, Stuart
0 siblings, 1 reply; 19+ messages in thread
From: Summers, Stuart @ 2025-10-27 18:18 UTC (permalink / raw)
To: intel-xe@lists.freedesktop.org
On Mon, 2025-10-27 at 18:17 +0000, Patchwork wrote:
> == Series Details ==
>
> Series: Fix a couple of wedge corner-case memory leaks (rev6)
> URL : https://patchwork.freedesktop.org/series/155352/
> State : failure
>
> == Summary ==
>
> + trap cleanup EXIT
> + /kernel/tools/testing/kunit/kunit.py run --kunitconfig
> /kernel/drivers/gpu/drm/xe/.kunitconfig
> ERROR:root:In file included from ../include/linux/bitfield.h:10,
> from ../drivers/gpu/drm/xe/xe_guc_submit.c:8:
> ../drivers/gpu/drm/xe/xe_guc_submit.c: In function
> ‘guc_submit_wedged_fini’:
> ../drivers/gpu/drm/xe/xe_guc_submit.c:300:31: error: implicit
> declaration of function ‘drm_sched_is_stopped’; did you mean
> ‘drm_sched_stop’? [-Werror=implicit-function-declaration]
Hm.. I didn't see this when testing locally. I'll double check here.
-Stuart
> 300 | !drm_sched_is_stopped(&q->guc-
> >sched.base));
> | ^~~~~~~~~~~~~~~~~~~~
> ../include/linux/build_bug.h:30:63: note: in definition of macro
> ‘BUILD_BUG_ON_INVALID’
> 30 | #define BUILD_BUG_ON_INVALID(e) ((void)(sizeof((__force
> long)(e))))
> |
> ^
> ../drivers/gpu/drm/xe/xe_assert.h:112:9: note: in expansion of macro
> ‘__xe_assert_msg’
> 112 | __xe_assert_msg(__xe,
> condition, \
> | ^~~~~~~~~~~~~~~
> ../drivers/gpu/drm/xe/xe_assert.h:148:9: note: in expansion of macro
> ‘xe_assert_msg’
> 148 | xe_assert_msg(tile_to_xe(__tile), condition, "tile:
> %u VRAM %s\n" msg, \
> | ^~~~~~~~~~~~~
> ../drivers/gpu/drm/xe/xe_assert.h:172:9: note: in expansion of macro
> ‘xe_tile_assert_msg’
> 172 | xe_tile_assert_msg(gt_to_tile(__gt), condition, "GT:
> %u type %d\n" msg, \
> | ^~~~~~~~~~~~~~~~~~
> ../drivers/gpu/drm/xe/xe_assert.h:169:37: note: in expansion of macro
> ‘xe_gt_assert_msg’
> 169 | #define xe_gt_assert(gt, condition) xe_gt_assert_msg((gt),
> condition, "")
> | ^~~~~~~~~~~~~~~~
> ../drivers/gpu/drm/xe/xe_guc_submit.c:299:17: note: in expansion of
> macro ‘xe_gt_assert’
> 299 | xe_gt_assert(guc_to_gt(guc),
> | ^~~~~~~~~~~~
> cc1: some warnings being treated as errors
> make[7]: *** [../scripts/Makefile.build:287:
> drivers/gpu/drm/xe/xe_guc_submit.o] Error 1
> make[7]: *** Waiting for unfinished jobs....
> make[6]: *** [../scripts/Makefile.build:556: drivers/gpu/drm/xe]
> Error 2
> make[5]: *** [../scripts/Makefile.build:556: drivers/gpu/drm] Error 2
> make[4]: *** [../scripts/Makefile.build:556: drivers/gpu] Error 2
> make[3]: *** [../scripts/Makefile.build:556: drivers] Error 2
> make[2]: *** [/kernel/Makefile:2010: .] Error 2
> make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
> make: *** [Makefile:248: __sub-make] Error 2
>
> [18:16:47] Configuring KUnit Kernel ...
> Generating .config ...
> Populating config with:
> $ make ARCH=um O=.kunit olddefconfig
> [18:16:51] Building KUnit Kernel ...
> Populating config with:
> $ make ARCH=um O=.kunit olddefconfig
> Building with:
> $ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --
> jobs=48
> + cleanup
> ++ stat -c %u:%g /kernel
> + chown -R 1003:1003 /kernel
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 5/6] drm/xe: Correct migration VM teardown order
2025-10-27 18:04 ` [PATCH 5/6] drm/xe: Correct migration VM teardown order Stuart Summers
@ 2025-10-27 19:46 ` Matthew Brost
0 siblings, 0 replies; 19+ messages in thread
From: Matthew Brost @ 2025-10-27 19:46 UTC (permalink / raw)
To: Stuart Summers
Cc: intel-xe, niranjana.vishwanathapura, zhanjun.dong, shuicheng.lin
On Mon, Oct 27, 2025 at 06:04:11PM +0000, Stuart Summers wrote:
> Adjust the sequence of the migration teardown to match what
> is happening in the init() function.
>
> v2: Take a reference to the migrate queue before put (Matt)
>
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> ---
> drivers/gpu/drm/xe/xe_migrate.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index 56a5804726e9..9d60bc2b57e4 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -98,17 +98,18 @@ struct xe_migrate {
> static void xe_migrate_fini(void *arg)
> {
> struct xe_migrate *m = arg;
> + struct xe_vm *vm = m->q->vm;
>
> - xe_vm_lock(m->q->vm, false);
> + xe_vm_lock(vm, false);
> xe_bo_unpin(m->pt_bo);
> - xe_vm_unlock(m->q->vm);
> + xe_vm_unlock(vm);
>
> dma_fence_put(m->fence);
> xe_bo_put(m->pt_bo);
> drm_suballoc_manager_fini(&m->vm_update_sa);
> mutex_destroy(&m->job_mutex);
> - xe_vm_close_and_put(m->q->vm);
> xe_exec_queue_put(m->q);
> + xe_vm_close_and_put(vm);
> }
>
> static u64 xe_migrate_vm_addr(u64 slot, u32 level)
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: CI.KUnit: failure for Fix a couple of wedge corner-case memory leaks (rev6)
2025-10-27 18:18 ` Summers, Stuart
@ 2025-10-27 20:10 ` Summers, Stuart
0 siblings, 0 replies; 19+ messages in thread
From: Summers, Stuart @ 2025-10-27 20:10 UTC (permalink / raw)
To: intel-xe@lists.freedesktop.org
On Mon, 2025-10-27 at 18:18 +0000, Summers, Stuart wrote:
> On Mon, 2025-10-27 at 18:17 +0000, Patchwork wrote:
> > == Series Details ==
> >
> > Series: Fix a couple of wedge corner-case memory leaks (rev6)
> > URL : https://patchwork.freedesktop.org/series/155352/
> > State : failure
> >
> > == Summary ==
> >
> > + trap cleanup EXIT
> > + /kernel/tools/testing/kunit/kunit.py run --kunitconfig
> > /kernel/drivers/gpu/drm/xe/.kunitconfig
> > ERROR:root:In file included from ../include/linux/bitfield.h:10,
> > from ../drivers/gpu/drm/xe/xe_guc_submit.c:8:
> > ../drivers/gpu/drm/xe/xe_guc_submit.c: In function
> > ‘guc_submit_wedged_fini’:
> > ../drivers/gpu/drm/xe/xe_guc_submit.c:300:31: error: implicit
> > declaration of function ‘drm_sched_is_stopped’; did you mean
> > ‘drm_sched_stop’? [-Werror=implicit-function-declaration]
>
> Hm.. I didn't see this when testing locally. I'll double check here.
Ah right, this comes with
https://patchwork.freedesktop.org/patch/681606/?series=155315&rev=3
which isn't available yet. I'm working on reviewing that series as
well. I still think it's worth getting feedback on this series in
parallel, but let me know if there are any concerns there.
Thanks,
Stuart
>
> -Stuart
>
> > 300 | !drm_sched_is_stopped(&q->guc-
> > > sched.base));
> > | ^~~~~~~~~~~~~~~~~~~~
> > ../include/linux/build_bug.h:30:63: note: in definition of macro
> > ‘BUILD_BUG_ON_INVALID’
> > 30 | #define BUILD_BUG_ON_INVALID(e) ((void)(sizeof((__force
> > long)(e))))
> >
> > |
> > ^
> > ../drivers/gpu/drm/xe/xe_assert.h:112:9: note: in expansion of
> > macro
> > ‘__xe_assert_msg’
> > 112 | __xe_assert_msg(__xe,
> > condition, \
> > | ^~~~~~~~~~~~~~~
> > ../drivers/gpu/drm/xe/xe_assert.h:148:9: note: in expansion of
> > macro
> > ‘xe_assert_msg’
> > 148 | xe_assert_msg(tile_to_xe(__tile), condition, "tile:
> > %u VRAM %s\n" msg, \
> > | ^~~~~~~~~~~~~
> > ../drivers/gpu/drm/xe/xe_assert.h:172:9: note: in expansion of
> > macro
> > ‘xe_tile_assert_msg’
> > 172 | xe_tile_assert_msg(gt_to_tile(__gt), condition,
> > "GT:
> > %u type %d\n" msg, \
> > | ^~~~~~~~~~~~~~~~~~
> > ../drivers/gpu/drm/xe/xe_assert.h:169:37: note: in expansion of
> > macro
> > ‘xe_gt_assert_msg’
> > 169 | #define xe_gt_assert(gt, condition) xe_gt_assert_msg((gt),
> > condition, "")
> > | ^~~~~~~~~~~~~~~~
> > ../drivers/gpu/drm/xe/xe_guc_submit.c:299:17: note: in expansion of
> > macro ‘xe_gt_assert’
> > 299 | xe_gt_assert(guc_to_gt(guc),
> > | ^~~~~~~~~~~~
> > cc1: some warnings being treated as errors
> > make[7]: *** [../scripts/Makefile.build:287:
> > drivers/gpu/drm/xe/xe_guc_submit.o] Error 1
> > make[7]: *** Waiting for unfinished jobs....
> > make[6]: *** [../scripts/Makefile.build:556: drivers/gpu/drm/xe]
> > Error 2
> > make[5]: *** [../scripts/Makefile.build:556: drivers/gpu/drm] Error
> > 2
> > make[4]: *** [../scripts/Makefile.build:556: drivers/gpu] Error 2
> > make[3]: *** [../scripts/Makefile.build:556: drivers] Error 2
> > make[2]: *** [/kernel/Makefile:2010: .] Error 2
> > make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
> > make: *** [Makefile:248: __sub-make] Error 2
> >
> > [18:16:47] Configuring KUnit Kernel ...
> > Generating .config ...
> > Populating config with:
> > $ make ARCH=um O=.kunit olddefconfig
> > [18:16:51] Building KUnit Kernel ...
> > Populating config with:
> > $ make ARCH=um O=.kunit olddefconfig
> > Building with:
> > $ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --
> > jobs=48
> > + cleanup
> > ++ stat -c %u:%g /kernel
> > + chown -R 1003:1003 /kernel
> >
> >
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/6] drm/xe: Add additional trace points for LRCs
2025-10-27 18:04 ` [PATCH 1/6] drm/xe: Add additional trace points for LRCs Stuart Summers
@ 2025-10-28 19:46 ` Matt Atwood
2025-10-28 20:11 ` Summers, Stuart
0 siblings, 1 reply; 19+ messages in thread
From: Matt Atwood @ 2025-10-28 19:46 UTC (permalink / raw)
To: Stuart Summers
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin
On Mon, Oct 27, 2025 at 06:04:07PM +0000, Stuart Summers wrote:
> Add trace points to indicate when an LRC has been
> created and destroyed or get and put.
>
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
> drivers/gpu/drm/xe/xe_lrc.c | 4 +++
> drivers/gpu/drm/xe/xe_lrc.h | 3 +++
> drivers/gpu/drm/xe/xe_trace_lrc.h | 42 ++++++++++++++++++++++++++++++-
> 3 files changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
> index b5083c99dd50..42d1c861fe18 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.c
> +++ b/drivers/gpu/drm/xe/xe_lrc.c
> @@ -1575,6 +1575,8 @@ struct xe_lrc *xe_lrc_create(struct xe_hw_engine *hwe, struct xe_vm *vm,
> return ERR_PTR(err);
> }
>
> + trace_xe_lrc_create(lrc);
> +
> return lrc;
> }
>
> @@ -1589,6 +1591,8 @@ void xe_lrc_destroy(struct kref *ref)
> {
> struct xe_lrc *lrc = container_of(ref, struct xe_lrc, refcount);
>
> + trace_xe_lrc_destroy(lrc);
> +
> xe_lrc_finish(lrc);
> kfree(lrc);
> }
> diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
> index 2fb628da5c43..fd67810f9812 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.h
> +++ b/drivers/gpu/drm/xe/xe_lrc.h
> @@ -8,6 +8,7 @@
> #include <linux/types.h>
>
> #include "xe_lrc_types.h"
> +#include "xe_trace_lrc.h"
>
> struct drm_printer;
> struct xe_bb;
> @@ -61,6 +62,7 @@ void xe_lrc_destroy(struct kref *ref);
> static inline struct xe_lrc *xe_lrc_get(struct xe_lrc *lrc)
> {
> kref_get(&lrc->refcount);
> + trace_xe_lrc_get(lrc);
> return lrc;
> }
>
> @@ -73,6 +75,7 @@ static inline struct xe_lrc *xe_lrc_get(struct xe_lrc *lrc)
> */
> static inline void xe_lrc_put(struct xe_lrc *lrc)
> {
> + trace_xe_lrc_put(lrc);
> kref_put(&lrc->refcount, xe_lrc_destroy);
> }
>
> diff --git a/drivers/gpu/drm/xe/xe_trace_lrc.h b/drivers/gpu/drm/xe/xe_trace_lrc.h
> index d525cbee1e34..e8daa5d323e7 100644
> --- a/drivers/gpu/drm/xe/xe_trace_lrc.h
> +++ b/drivers/gpu/drm/xe/xe_trace_lrc.h
> @@ -13,7 +13,6 @@
> #include <linux/types.h>
>
> #include "xe_gt_types.h"
> -#include "xe_lrc.h"
> #include "xe_lrc_types.h"
>
> #define __dev_name_lrc(lrc) dev_name(gt_to_xe((lrc)->fence_ctx.gt)->drm.dev)
> @@ -42,6 +41,47 @@ TRACE_EVENT(xe_lrc_update_timestamp,
> __get_str(device_id))
> );
>
> +DECLARE_EVENT_CLASS(xe_lrc,
> + TP_PROTO(struct xe_lrc *lrc),
> + TP_ARGS(lrc),
> +
> + TP_STRUCT__entry(
This looks wrong style wise, but is done in every instance we do
anything like this. Maybe Matt Brost can chime in on why the style is
this way, and if a change is suggested it can be done in a follow up
patch.
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
> + __field(struct xe_lrc *, lrc)
> + __string(name, lrc->fence_ctx.name)
> + __string(device_id, __dev_name_lrc(lrc))
> + ),
> +
> + TP_fast_assign(
> + __entry->lrc = lrc;
> + __assign_str(name);
> + __assign_str(device_id);
> + ),
> +
> + TP_printk("lrc=:%p lrc->name=%s device_id:%s",
> + __entry->lrc, __get_str(name),
> + __get_str(device_id))
> +);
> +
> +DEFINE_EVENT(xe_lrc, xe_lrc_create,
> + TP_PROTO(struct xe_lrc *lrc),
> + TP_ARGS(lrc)
> +);
> +
> +DEFINE_EVENT(xe_lrc, xe_lrc_destroy,
> + TP_PROTO(struct xe_lrc *lrc),
> + TP_ARGS(lrc)
> +);
> +
> +DEFINE_EVENT(xe_lrc, xe_lrc_get,
> + TP_PROTO(struct xe_lrc *lrc),
> + TP_ARGS(lrc)
> +);
> +
> +DEFINE_EVENT(xe_lrc, xe_lrc_put,
> + TP_PROTO(struct xe_lrc *lrc),
> + TP_ARGS(lrc)
> +);
> +
> #endif
>
> /* This part must be outside protection */
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/6] drm/xe: Add a trace point for VM close
2025-10-27 18:04 ` [PATCH 2/6] drm/xe: Add a trace point for VM close Stuart Summers
@ 2025-10-28 20:03 ` Matt Atwood
2025-10-28 20:10 ` Summers, Stuart
0 siblings, 1 reply; 19+ messages in thread
From: Matt Atwood @ 2025-10-28 20:03 UTC (permalink / raw)
To: Stuart Summers
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin
On Mon, Oct 27, 2025 at 06:04:08PM +0000, Stuart Summers wrote:
> All better tracking of error cases when calling through
s/all/add
> xe_vm_close_and_put().
>
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
> drivers/gpu/drm/xe/xe_trace_bo.h | 6 ++++++
> drivers/gpu/drm/xe/xe_vm.c | 2 ++
> 2 files changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_trace_bo.h b/drivers/gpu/drm/xe/xe_trace_bo.h
> index 86323cf3be2c..238311cfb816 100644
> --- a/drivers/gpu/drm/xe/xe_trace_bo.h
> +++ b/drivers/gpu/drm/xe/xe_trace_bo.h
> @@ -14,6 +14,7 @@
>
> #include "xe_bo.h"
> #include "xe_bo_types.h"
> +#include "xe_exec_queue_types.h"
This doesnt appear to be required for this patch and should be included
in the patch titled "Add new exec queue trace points".
With its removal.
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
> #include "xe_vm.h"
>
> #define __dev_name_bo(bo) dev_name(xe_bo_device(bo)->drm.dev)
> @@ -223,6 +224,11 @@ DEFINE_EVENT(xe_vm, xe_vm_free,
> TP_ARGS(vm)
> );
>
> +DEFINE_EVENT(xe_vm, xe_vm_close,
> + TP_PROTO(struct xe_vm *vm),
> + TP_ARGS(vm)
> +);
> +
> DEFINE_EVENT(xe_vm, xe_vm_cpu_bind,
> TP_PROTO(struct xe_vm *vm),
> TP_ARGS(vm)
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 00f3520dec38..e383982bad8a 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -1672,6 +1672,8 @@ static void xe_vm_close(struct xe_vm *vm)
> bool bound;
> int idx;
>
> + trace_xe_vm_close(vm);
> +
> bound = drm_dev_enter(&xe->drm, &idx);
>
> down_write(&vm->lock);
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/6] drm/xe: Add a trace point for VM close
2025-10-28 20:03 ` Matt Atwood
@ 2025-10-28 20:10 ` Summers, Stuart
0 siblings, 0 replies; 19+ messages in thread
From: Summers, Stuart @ 2025-10-28 20:10 UTC (permalink / raw)
To: Atwood, Matthew S
Cc: intel-xe@lists.freedesktop.org, Brost, Matthew, Lin, Shuicheng,
Vishwanathapura, Niranjana, Dong, Zhanjun
On Tue, 2025-10-28 at 13:03 -0700, Matt Atwood wrote:
> On Mon, Oct 27, 2025 at 06:04:08PM +0000, Stuart Summers wrote:
> > All better tracking of error cases when calling through
> s/all/add
Ok
> > xe_vm_close_and_put().
> >
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_trace_bo.h | 6 ++++++
> > drivers/gpu/drm/xe/xe_vm.c | 2 ++
> > 2 files changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_trace_bo.h
> > b/drivers/gpu/drm/xe/xe_trace_bo.h
> > index 86323cf3be2c..238311cfb816 100644
> > --- a/drivers/gpu/drm/xe/xe_trace_bo.h
> > +++ b/drivers/gpu/drm/xe/xe_trace_bo.h
> > @@ -14,6 +14,7 @@
> >
> > #include "xe_bo.h"
> > #include "xe_bo_types.h"
> > +#include "xe_exec_queue_types.h"
> This doesnt appear to be required for this patch and should be
> included
> in the patch titled "Add new exec queue trace points".
Hm.. yeah this must have been held over from some debug I was doing.
Sorry about that! I'll make sure to clean these things up in the next
post.
Thanks,
Stuart
>
> With its removal.
>
> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
> > #include "xe_vm.h"
> >
> > #define __dev_name_bo(bo) dev_name(xe_bo_device(bo)->drm.dev)
> > @@ -223,6 +224,11 @@ DEFINE_EVENT(xe_vm, xe_vm_free,
> > TP_ARGS(vm)
> > );
> >
> > +DEFINE_EVENT(xe_vm, xe_vm_close,
> > + TP_PROTO(struct xe_vm *vm),
> > + TP_ARGS(vm)
> > +);
> > +
> > DEFINE_EVENT(xe_vm, xe_vm_cpu_bind,
> > TP_PROTO(struct xe_vm *vm),
> > TP_ARGS(vm)
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > index 00f3520dec38..e383982bad8a 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1672,6 +1672,8 @@ static void xe_vm_close(struct xe_vm *vm)
> > bool bound;
> > int idx;
> >
> > + trace_xe_vm_close(vm);
> > +
> > bound = drm_dev_enter(&xe->drm, &idx);
> >
> > down_write(&vm->lock);
> > --
> > 2.34.1
> >
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/6] drm/xe: Add additional trace points for LRCs
2025-10-28 19:46 ` Matt Atwood
@ 2025-10-28 20:11 ` Summers, Stuart
0 siblings, 0 replies; 19+ messages in thread
From: Summers, Stuart @ 2025-10-28 20:11 UTC (permalink / raw)
To: Atwood, Matthew S
Cc: intel-xe@lists.freedesktop.org, Brost, Matthew, Lin, Shuicheng,
Vishwanathapura, Niranjana, Dong, Zhanjun
On Tue, 2025-10-28 at 12:46 -0700, Matt Atwood wrote:
> On Mon, Oct 27, 2025 at 06:04:07PM +0000, Stuart Summers wrote:
> > Add trace points to indicate when an LRC has been
> > created and destroyed or get and put.
> >
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_lrc.c | 4 +++
> > drivers/gpu/drm/xe/xe_lrc.h | 3 +++
> > drivers/gpu/drm/xe/xe_trace_lrc.h | 42
> > ++++++++++++++++++++++++++++++-
> > 3 files changed, 48 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_lrc.c
> > b/drivers/gpu/drm/xe/xe_lrc.c
> > index b5083c99dd50..42d1c861fe18 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc.c
> > +++ b/drivers/gpu/drm/xe/xe_lrc.c
> > @@ -1575,6 +1575,8 @@ struct xe_lrc *xe_lrc_create(struct
> > xe_hw_engine *hwe, struct xe_vm *vm,
> > return ERR_PTR(err);
> > }
> >
> > + trace_xe_lrc_create(lrc);
> > +
> > return lrc;
> > }
> >
> > @@ -1589,6 +1591,8 @@ void xe_lrc_destroy(struct kref *ref)
> > {
> > struct xe_lrc *lrc = container_of(ref, struct xe_lrc,
> > refcount);
> >
> > + trace_xe_lrc_destroy(lrc);
> > +
> > xe_lrc_finish(lrc);
> > kfree(lrc);
> > }
> > diff --git a/drivers/gpu/drm/xe/xe_lrc.h
> > b/drivers/gpu/drm/xe/xe_lrc.h
> > index 2fb628da5c43..fd67810f9812 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc.h
> > +++ b/drivers/gpu/drm/xe/xe_lrc.h
> > @@ -8,6 +8,7 @@
> > #include <linux/types.h>
> >
> > #include "xe_lrc_types.h"
> > +#include "xe_trace_lrc.h"
> >
> > struct drm_printer;
> > struct xe_bb;
> > @@ -61,6 +62,7 @@ void xe_lrc_destroy(struct kref *ref);
> > static inline struct xe_lrc *xe_lrc_get(struct xe_lrc *lrc)
> > {
> > kref_get(&lrc->refcount);
> > + trace_xe_lrc_get(lrc);
> > return lrc;
> > }
> >
> > @@ -73,6 +75,7 @@ static inline struct xe_lrc *xe_lrc_get(struct
> > xe_lrc *lrc)
> > */
> > static inline void xe_lrc_put(struct xe_lrc *lrc)
> > {
> > + trace_xe_lrc_put(lrc);
> > kref_put(&lrc->refcount, xe_lrc_destroy);
> > }
> >
> > diff --git a/drivers/gpu/drm/xe/xe_trace_lrc.h
> > b/drivers/gpu/drm/xe/xe_trace_lrc.h
> > index d525cbee1e34..e8daa5d323e7 100644
> > --- a/drivers/gpu/drm/xe/xe_trace_lrc.h
> > +++ b/drivers/gpu/drm/xe/xe_trace_lrc.h
> > @@ -13,7 +13,6 @@
> > #include <linux/types.h>
> >
> > #include "xe_gt_types.h"
> > -#include "xe_lrc.h"
> > #include "xe_lrc_types.h"
> >
> > #define __dev_name_lrc(lrc) dev_name(gt_to_xe((lrc)-
> > >fence_ctx.gt)->drm.dev)
> > @@ -42,6 +41,47 @@ TRACE_EVENT(xe_lrc_update_timestamp,
> > __get_str(device_id))
> > );
> >
> > +DECLARE_EVENT_CLASS(xe_lrc,
> > + TP_PROTO(struct xe_lrc *lrc),
> > + TP_ARGS(lrc),
> > +
> > + TP_STRUCT__entry(
> This looks wrong style wise, but is done in every instance we do
> anything like this. Maybe Matt Brost can chime in on why the style is
> this way, and if a change is suggested it can be done in a follow up
> patch.
Yeah I had taken this from one of the existing tracefs entries. I agree
it doesn't look right. I'd like to pull that out separately though if
possible.
>
> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Thanks for the review!
-Stuart
> > + __field(struct xe_lrc *, lrc)
> > + __string(name, lrc->fence_ctx.name)
> > + __string(device_id, __dev_name_lrc(lrc))
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->lrc = lrc;
> > + __assign_str(name);
> > + __assign_str(device_id);
> > + ),
> > +
> > + TP_printk("lrc=:%p lrc->name=%s device_id:%s",
> > + __entry->lrc, __get_str(name),
> > + __get_str(device_id))
> > +);
> > +
> > +DEFINE_EVENT(xe_lrc, xe_lrc_create,
> > + TP_PROTO(struct xe_lrc *lrc),
> > + TP_ARGS(lrc)
> > +);
> > +
> > +DEFINE_EVENT(xe_lrc, xe_lrc_destroy,
> > + TP_PROTO(struct xe_lrc *lrc),
> > + TP_ARGS(lrc)
> > +);
> > +
> > +DEFINE_EVENT(xe_lrc, xe_lrc_get,
> > + TP_PROTO(struct xe_lrc *lrc),
> > + TP_ARGS(lrc)
> > +);
> > +
> > +DEFINE_EVENT(xe_lrc, xe_lrc_put,
> > + TP_PROTO(struct xe_lrc *lrc),
> > + TP_ARGS(lrc)
> > +);
> > +
> > #endif
> >
> > /* This part must be outside protection */
> > --
> > 2.34.1
> >
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace
2025-10-27 18:04 ` [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
@ 2025-10-28 20:11 ` Matt Atwood
0 siblings, 0 replies; 19+ messages in thread
From: Matt Atwood @ 2025-10-28 20:11 UTC (permalink / raw)
To: Stuart Summers
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin
On Mon, Oct 27, 2025 at 06:04:09PM +0000, Stuart Summers wrote:
> Just a little extra detail to make BOs easier to track
> through the trace event log.
>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
> drivers/gpu/drm/xe/xe_trace_bo.h | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_trace_bo.h b/drivers/gpu/drm/xe/xe_trace_bo.h
> index 238311cfb816..0d795733e0f0 100644
> --- a/drivers/gpu/drm/xe/xe_trace_bo.h
> +++ b/drivers/gpu/drm/xe/xe_trace_bo.h
> @@ -27,6 +27,7 @@ DECLARE_EVENT_CLASS(xe_bo,
>
> TP_STRUCT__entry(
> __string(dev, __dev_name_bo(bo))
> + __field(struct xe_bo *, bo)
> __field(size_t, size)
> __field(u32, flags)
> __field(struct xe_vm *, vm)
> @@ -34,13 +35,14 @@ DECLARE_EVENT_CLASS(xe_bo,
>
> TP_fast_assign(
> __assign_str(dev);
> + __entry->bo = bo;
> __entry->size = xe_bo_size(bo);
> __entry->flags = bo->flags;
> __entry->vm = bo->vm;
> ),
>
> - TP_printk("dev=%s, size=%zu, flags=0x%02x, vm=%p",
> - __get_str(dev), __entry->size,
> + TP_printk("dev=%s, %p, size=%zu, flags=0x%02x, vm=%p",
> + __get_str(dev), __entry->bo, __entry->size,
> __entry->flags, __entry->vm)
> );
>
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 4/6] drm/xe: Add new exec queue trace points
2025-10-27 18:04 ` [PATCH 4/6] drm/xe: Add new exec queue trace points Stuart Summers
@ 2025-10-28 20:29 ` Matt Atwood
0 siblings, 0 replies; 19+ messages in thread
From: Matt Atwood @ 2025-10-28 20:29 UTC (permalink / raw)
To: Stuart Summers
Cc: intel-xe, matthew.brost, niranjana.vishwanathapura, zhanjun.dong,
shuicheng.lin
On Mon, Oct 27, 2025 at 06:04:10PM +0000, Stuart Summers wrote:
> Add an exec queue and guc exec queue trace point
> to separate out which part of the stack is executing.
>
> This is helpful because several of the guc specific
> paths rely on responses from guc which is interesting
> to view separately in the event the guc stops responding
> in the middle of an operation that would expect a
> response from guc otherwise.
>
> Also add in the exec queue pointer information to
> the trace events for easier tracking. Contexts (guc_ids)
> can get re-used, so this just makes grepping a little
> easier in this type of debug.
>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
> drivers/gpu/drm/xe/xe_exec_queue.c | 4 ++++
> drivers/gpu/drm/xe/xe_guc_submit.c | 11 +++++++----
> drivers/gpu/drm/xe/xe_trace.h | 22 ++++++++++++++++++++--
> 3 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 90cbc95f8e2e..a2ef381cf6d9 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -377,6 +377,8 @@ void xe_exec_queue_destroy(struct kref *ref)
> struct xe_exec_queue *q = container_of(ref, struct xe_exec_queue, refcount);
> struct xe_exec_queue *eq, *next;
>
> + trace_xe_exec_queue_destroy(q);
> +
> if (xe_exec_queue_uses_pxp(q))
> xe_pxp_exec_queue_remove(gt_to_xe(q->gt)->pxp, q);
>
> @@ -953,6 +955,8 @@ void xe_exec_queue_kill(struct xe_exec_queue *q)
> {
> struct xe_exec_queue *eq = q, *next;
>
> + trace_xe_exec_queue_kill(q);
> +
> list_for_each_entry_safe(eq, next, &eq->multi_gt_list,
> multi_gt_link) {
> q->ops->kill(eq);
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index d4ffdb71ef3d..3f672355b3bb 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1004,9 +1004,12 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
> }
>
> mutex_lock(&guc->submission_state.lock);
> - xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
> - if (xe_exec_queue_get_unless_zero(q))
> + xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
> + if (xe_exec_queue_get_unless_zero(q)) {
> set_exec_queue_wedged(q);
> + trace_xe_exec_queue_wedge(q);
> + }
> + }
> mutex_unlock(&guc->submission_state.lock);
> }
>
> @@ -1456,7 +1459,7 @@ static void __guc_exec_queue_destroy_async(struct work_struct *w)
> struct xe_guc *guc = exec_queue_to_guc(q);
>
> xe_pm_runtime_get(guc_to_xe(guc));
> - trace_xe_exec_queue_destroy(q);
> + trace_xe_guc_exec_queue_destroy(q);
>
> if (xe_exec_queue_is_lr(q))
> cancel_work_sync(&ge->lr_tdr);
> @@ -1727,7 +1730,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
>
> static void guc_exec_queue_kill(struct xe_exec_queue *q)
> {
> - trace_xe_exec_queue_kill(q);
> + trace_xe_guc_exec_queue_kill(q);
> set_exec_queue_killed(q);
> __suspend_fence_signal(q);
> xe_guc_exec_queue_trigger_cleanup(q);
> diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
> index 314f42fcbcbd..a5dd0c48d894 100644
> --- a/drivers/gpu/drm/xe/xe_trace.h
> +++ b/drivers/gpu/drm/xe/xe_trace.h
> @@ -71,6 +71,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
>
> TP_STRUCT__entry(
> __string(dev, __dev_name_eq(q))
> + __field(struct xe_exec_queue *, q)
> __field(enum xe_engine_class, class)
> __field(u32, logical_mask)
> __field(u8, gt_id)
> @@ -82,6 +83,7 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
>
> TP_fast_assign(
> __assign_str(dev);
> + __entry->q = q;
> __entry->class = q->class;
> __entry->logical_mask = q->logical_mask;
> __entry->gt_id = q->gt->info.id;
> @@ -91,8 +93,9 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
> __entry->flags = q->flags;
> ),
>
> - TP_printk("dev=%s, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
> - __get_str(dev), __entry->class, __entry->logical_mask,
> + TP_printk("dev=%s, %p, %d:0x%x, gt=%d, width=%d, guc_id=%d, guc_state=0x%x, flags=0x%x",
> + __get_str(dev), __entry->q,
> + __entry->class, __entry->logical_mask,
> __entry->gt_id, __entry->width, __entry->guc_id,
> __entry->guc_state, __entry->flags)
> );
> @@ -147,11 +150,21 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_close,
> TP_ARGS(q)
> );
>
> +DEFINE_EVENT(xe_exec_queue, xe_exec_queue_wedge,
> + TP_PROTO(struct xe_exec_queue *q),
> + TP_ARGS(q)
> +);
> +
> DEFINE_EVENT(xe_exec_queue, xe_exec_queue_kill,
> TP_PROTO(struct xe_exec_queue *q),
> TP_ARGS(q)
> );
>
> +DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_kill,
> + TP_PROTO(struct xe_exec_queue *q),
> + TP_ARGS(q)
> +);
> +
> DEFINE_EVENT(xe_exec_queue, xe_exec_queue_cleanup_entity,
> TP_PROTO(struct xe_exec_queue *q),
> TP_ARGS(q)
> @@ -162,6 +175,11 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_destroy,
> TP_ARGS(q)
> );
>
> +DEFINE_EVENT(xe_exec_queue, xe_guc_exec_queue_destroy,
> + TP_PROTO(struct xe_exec_queue *q),
> + TP_ARGS(q)
> +);
> +
> DEFINE_EVENT(xe_exec_queue, xe_exec_queue_reset,
> TP_PROTO(struct xe_exec_queue *q),
> TP_ARGS(q)
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-10-28 20:29 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-27 18:04 ` [PATCH 1/6] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-28 19:46 ` Matt Atwood
2025-10-28 20:11 ` Summers, Stuart
2025-10-27 18:04 ` [PATCH 2/6] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-28 20:03 ` Matt Atwood
2025-10-28 20:10 ` Summers, Stuart
2025-10-27 18:04 ` [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-28 20:11 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 4/6] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-28 20:29 ` Matt Atwood
2025-10-27 18:04 ` [PATCH 5/6] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-27 19:46 ` Matthew Brost
2025-10-27 18:04 ` [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge Stuart Summers
2025-10-27 18:16 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev6) Patchwork
2025-10-27 18:17 ` ✗ CI.KUnit: failure " Patchwork
2025-10-27 18:18 ` Summers, Stuart
2025-10-27 20:10 ` Summers, Stuart
-- strict thread matches above, loose matches on Subject: below --
2025-10-14 18:09 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-14 18:09 ` [PATCH 4/6] drm/xe: Add new exec queue trace points Stuart Summers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox