From: Stuart Summers <stuart.summers@intel.com>
Cc: intel-xe@lists.freedesktop.org, matthew.brost@intel.com,
niranjana.vishwanathapura@intel.com, zhanjun.dong@intel.com,
shuicheng.lin@intel.com,
Stuart Summers <stuart.summers@intel.com>
Subject: [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge
Date: Mon, 20 Oct 2025 21:45:28 +0000 [thread overview]
Message-ID: <20251020214529.354365-7-stuart.summers@intel.com> (raw)
In-Reply-To: <20251020214529.354365-1-stuart.summers@intel.com>
When the driver is wedged during a hardware failure, there
is a chance the queue kill coming from those events can
race with either the scheduler teardown or the queue
deregistration with GuC. Basically the following two
scenarios can occur (from event trace):
Scheduler start missing:
xe_exec_queue_create
xe_exec_queue_kill
xe_guc_exec_queue_kill
xe_exec_queue_destroy
GuC CT response missing:
xe_exec_queue_create
xe_exec_queue_register
xe_exec_queue_scheduling_enable
xe_exec_queue_scheduling_done
xe_exec_queue_kill
xe_guc_exec_queue_kill
xe_exec_queue_close
xe_exec_queue_destroy
xe_exec_queue_cleanup_entity
xe_exec_queue_scheduling_disable
The above traces depend also on inclusion of [1].
In the first scenario, the queue is created, but killed
prior to completing the message cleanup. In the second,
we go through a full registration before killing. The
CT communication happens in that last call to
xe_exec_queue_scheduling_disable.
We expect to then get a call to xe_guc_exec_queue_destroy
in both cases if the aforementioned scheduler/GuC CT communication
had happened, which we are missing here, hence missing any
LRC/BO cleanup in the exec queues in question.
Since this sequence seems specific to the wedge case
as described above, add a targeted scheduler start
and guc deregistration handler to the wedged_fini()
routine.
Without this change, if we inject wedges in the above scenarios
we can expect the following when the DRM memory tracking is
enabled (see CONFIG_DRM_DEBUG_MM):
[ 129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 + 00008000]: inserted at
drm_mm_insert_node_in_range+0x2ec/0x4b0
__xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
__xe_bo_create_locked+0x184/0x520 [xe]
xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
xe_bo_create_pin_map+0x13/0x20 [xe]
xe_lrc_create+0x139/0x18e0 [xe]
xe_exec_queue_create+0x22f/0x3e0 [xe]
xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
drm_ioctl_kernel+0x9f/0xf0
drm_ioctl+0x20f/0x440
xe_drm_ioctl+0x121/0x150 [xe]
__x64_sys_ioctl+0x8c/0xe0
do_syscall_64+0x4c/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 + 00008000]: inserted at
drm_mm_insert_node_in_range+0x2ec/0x4b0
__xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
__xe_bo_create_locked+0x184/0x520 [xe]
xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
xe_bo_create_pin_map+0x13/0x20 [xe]
xe_lrc_create+0x139/0x18e0 [xe]
xe_exec_queue_create+0x22f/0x3e0 [xe]
xe_exec_queue_create_bind+0x7f/0xd0 [xe]
xe_vm_create+0x4aa/0x8b0 [xe]
xe_vm_create_ioctl+0x17b/0x420 [xe]
drm_ioctl_kernel+0x9f/0xf0
drm_ioctl+0x20f/0x440
xe_drm_ioctl+0x121/0x150 [xe]
__x64_sys_ioctl+0x8c/0xe0
do_syscall_64+0x4c/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Signed-off-by: Stuart Summers <stuart.summers@intel.com>
[1] https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
---
drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 5ec1e4a83d68..a11ae4e70809 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -287,6 +287,8 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
xa_destroy(&guc->submission_state.exec_queue_lookup);
}
+static void __guc_exec_queue_destroy(struct xe_guc *guc, struct xe_exec_queue *q);
+
static void guc_submit_wedged_fini(void *arg)
{
struct xe_guc *guc = arg;
@@ -299,6 +301,16 @@ static void guc_submit_wedged_fini(void *arg)
mutex_unlock(&guc->submission_state.lock);
xe_exec_queue_put(q);
mutex_lock(&guc->submission_state.lock);
+ } else {
+ /*
+ * Make sure queues which were killed as part of a
+ * wedge are cleaned up properly. Clean up any
+ * dangling scheduler tasks and pending exec queue
+ * deregistration.
+ */
+ xe_sched_submission_start(&q->guc->sched);
+ if (exec_queue_pending_disable(q))
+ __guc_exec_queue_destroy(guc, q);
}
}
mutex_unlock(&guc->submission_state.lock);
--
2.34.1
next prev parent reply other threads:[~2025-10-20 21:45 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-20 21:45 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-20 21:45 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-20 21:45 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-20 21:45 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-20 21:45 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-22 20:30 ` Matthew Brost
2025-10-23 17:18 ` Summers, Stuart
2025-10-20 21:45 ` Stuart Summers [this message]
2025-10-22 21:15 ` [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge Matthew Brost
2025-10-23 17:43 ` Summers, Stuart
2025-10-23 18:26 ` Matthew Brost
2025-10-20 21:45 ` [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc Stuart Summers
2025-10-20 22:05 ` Matthew Brost
2025-10-20 22:07 ` Summers, Stuart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251020214529.354365-7-stuart.summers@intel.com \
--to=stuart.summers@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=niranjana.vishwanathapura@intel.com \
--cc=shuicheng.lin@intel.com \
--cc=zhanjun.dong@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox