* [PATCH v5 1/6] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort
[not found] <20260128231634.982494-1-zhanjun.dong@intel.com>
@ 2026-01-28 23:16 ` Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 2/6] drm/xe: Forcefully tear down exec queues in GuC submit fini Zhanjun Dong
` (2 subsequent siblings)
3 siblings, 0 replies; 4+ messages in thread
From: Zhanjun Dong @ 2026-01-28 23:16 UTC (permalink / raw)
To: intel-xe; +Cc: Matthew Brost, stable, Zhanjun Dong, Stuart Summers
From: Matthew Brost <matthew.brost@intel.com>
xe_guc_submit_pause_abort is intended to be called after something
disastrous occurs (e.g., VF migration fails, device wedging, or driver
unload) and should immediately trigger the teardown of remaining
submission state. With that, kill any remaining queues in this function.
Fixes: 7c4b7e34c83b ("drm/xe/vf: Abort VF post migration recovery on failure")
Cc: stable@vger.kernel.org
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
---
drivers/gpu/drm/xe/xe_guc_submit.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 456f549c16f6..d61bd0094e0b 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -2774,8 +2774,7 @@ void xe_guc_submit_pause_abort(struct xe_guc *guc)
continue;
xe_sched_submission_start(sched);
- if (exec_queue_killed_or_banned_or_wedged(q))
- xe_guc_exec_queue_trigger_cleanup(q);
+ guc_exec_queue_kill(q);
}
mutex_unlock(&guc->submission_state.lock);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v5 2/6] drm/xe: Forcefully tear down exec queues in GuC submit fini
[not found] <20260128231634.982494-1-zhanjun.dong@intel.com>
2026-01-28 23:16 ` [PATCH v5 1/6] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Zhanjun Dong
@ 2026-01-28 23:16 ` Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 3/6] drm/xe: Trigger queue cleanup if not in wedged mode 2 Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 5/6] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED Zhanjun Dong
3 siblings, 0 replies; 4+ messages in thread
From: Zhanjun Dong @ 2026-01-28 23:16 UTC (permalink / raw)
To: intel-xe; +Cc: Matthew Brost, stable, Zhanjun Dong
From: Matthew Brost <matthew.brost@intel.com>
In GuC submit fini, forcefully tear down any exec queues by disabling
CTs, stopping the scheduler (which cleans up lost G2H), killing all
remaining queues, and resuming scheduling to allow any remaining cleanup
actions to complete and signal any remaining fences.
guc_submit_fini requires access to device hardware. Using a device-managed
action guarantees the correct ordering of cleanup.
v3:
- Add page fault fix
v2:
- Fix VF failure (CI)
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
drivers/gpu/drm/xe/xe_guc_submit.c | 31 +++++++++++++++++++++---------
1 file changed, 22 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index d61bd0094e0b..92ea32423838 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -239,13 +239,21 @@ static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
EXEC_QUEUE_STATE_BANNED));
}
-static void guc_submit_fini(struct drm_device *drm, void *arg)
+static int __xe_guc_submit_reset_prepare(struct xe_guc *guc);
+
+static void guc_submit_fini(void *arg)
{
struct xe_guc *guc = arg;
struct xe_device *xe = guc_to_xe(guc);
struct xe_gt *gt = guc_to_gt(guc);
int ret;
+ /* Forcefully kill any remaining exec queues */
+ xe_guc_ct_stop(&guc->ct);
+ __xe_guc_submit_reset_prepare(guc);
+ xe_guc_submit_stop(guc);
+ xe_guc_submit_pause_abort(guc);
+
ret = wait_event_timeout(guc->submission_state.fini_wq,
xa_empty(&guc->submission_state.exec_queue_lookup),
HZ * 5);
@@ -326,7 +334,7 @@ int xe_guc_submit_init(struct xe_guc *guc, unsigned int num_ids)
guc->submission_state.initialized = true;
- return drmm_add_action_or_reset(&xe->drm, guc_submit_fini, guc);
+ return devm_add_action_or_reset(xe->drm.dev, guc_submit_fini, guc);
}
/*
@@ -2354,16 +2362,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
}
}
-int xe_guc_submit_reset_prepare(struct xe_guc *guc)
+static int __xe_guc_submit_reset_prepare(struct xe_guc *guc)
{
int ret;
- if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
- return 0;
-
- if (!guc->submission_state.initialized)
- return 0;
-
/*
* Using an atomic here rather than submission_state.lock as this
* function can be called while holding the CT lock (engine reset
@@ -2378,6 +2380,17 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
return ret;
}
+int xe_guc_submit_reset_prepare(struct xe_guc *guc)
+{
+ if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
+ return 0;
+
+ if (!guc->submission_state.initialized)
+ return 0;
+
+ return __xe_guc_submit_reset_prepare(guc);
+}
+
void xe_guc_submit_reset_wait(struct xe_guc *guc)
{
wait_event(guc->ct.wq, xe_device_wedged(guc_to_xe(guc)) ||
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v5 3/6] drm/xe: Trigger queue cleanup if not in wedged mode 2
[not found] <20260128231634.982494-1-zhanjun.dong@intel.com>
2026-01-28 23:16 ` [PATCH v5 1/6] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 2/6] drm/xe: Forcefully tear down exec queues in GuC submit fini Zhanjun Dong
@ 2026-01-28 23:16 ` Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 5/6] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED Zhanjun Dong
3 siblings, 0 replies; 4+ messages in thread
From: Zhanjun Dong @ 2026-01-28 23:16 UTC (permalink / raw)
To: intel-xe; +Cc: Zhanjun Dong, stable, Matthew Brost
The intent of wedging a device is to allow queues to continue running
only in wedged mode 2. In other modes, queues should initiate cleanup
and signal all remaining fences. Fix xe_guc_submit_wedge to correctly
clean up queues when wedge mode != 2.
Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
---
drivers/gpu/drm/xe/xe_guc_submit.c | 32 ++++++++++++++++++------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 92ea32423838..612ded5878fd 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1326,6 +1326,7 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
*/
void xe_guc_submit_wedge(struct xe_guc *guc)
{
+ struct xe_device *xe = guc_to_xe(guc);
struct xe_gt *gt = guc_to_gt(guc);
struct xe_exec_queue *q;
unsigned long index;
@@ -1340,20 +1341,25 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
if (!guc->submission_state.initialized)
return;
- err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
- guc_submit_wedged_fini, guc);
- if (err) {
- xe_gt_err(gt, "Failed to register clean-up in wedged.mode=%s; "
- "Although device is wedged.\n",
- xe_wedged_mode_to_string(XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET));
- return;
- }
+ if (xe->wedged.mode == 2) {
+ err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
+ guc_submit_wedged_fini, guc);
+ if (err) {
+ xe_gt_err(gt, "Failed to register clean-up on wedged.mode=2; "
+ "Although device is wedged.\n");
+ return;
+ }
- mutex_lock(&guc->submission_state.lock);
- xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
- if (xe_exec_queue_get_unless_zero(q))
- set_exec_queue_wedged(q);
- mutex_unlock(&guc->submission_state.lock);
+ mutex_lock(&guc->submission_state.lock);
+ xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
+ if (xe_exec_queue_get_unless_zero(q))
+ set_exec_queue_wedged(q);
+ mutex_unlock(&guc->submission_state.lock);
+ } else {
+ /* Forcefully kill any remaining exec queues, signal fences */
+ xe_guc_submit_stop(guc);
+ xe_guc_submit_pause_abort(guc);
+ }
}
static bool guc_submit_hint_wedged(struct xe_guc *guc)
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v5 5/6] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED
[not found] <20260128231634.982494-1-zhanjun.dong@intel.com>
` (2 preceding siblings ...)
2026-01-28 23:16 ` [PATCH v5 3/6] drm/xe: Trigger queue cleanup if not in wedged mode 2 Zhanjun Dong
@ 2026-01-28 23:16 ` Zhanjun Dong
3 siblings, 0 replies; 4+ messages in thread
From: Zhanjun Dong @ 2026-01-28 23:16 UTC (permalink / raw)
To: intel-xe; +Cc: Zhanjun Dong, stable, Matthew Brost
The GuC CT state transition requires moving to the STOP state before
entering the DISABLED state. Update the driver teardown sequence to make
the proper state machine transitions.
Fixes: ee4b32220a6b ("drm/xe/guc: Add devm release action to safely tear down CT")
Cc: stable@vger.kernel.org
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
drivers/gpu/drm/xe/xe_guc_ct.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index dfbf76037b04..6a658f085e0f 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -345,6 +345,7 @@ static void guc_action_disable_ct(void *arg)
{
struct xe_guc_ct *ct = arg;
+ xe_guc_ct_stop(ct);
guc_ct_change_state(ct, XE_GUC_CT_STATE_DISABLED);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-01-28 23:16 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260128231634.982494-1-zhanjun.dong@intel.com>
2026-01-28 23:16 ` [PATCH v5 1/6] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 2/6] drm/xe: Forcefully tear down exec queues in GuC submit fini Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 3/6] drm/xe: Trigger queue cleanup if not in wedged mode 2 Zhanjun Dong
2026-01-28 23:16 ` [PATCH v5 5/6] drm/xe/guc: Ensure CT state transitions via STOP before DISABLED Zhanjun Dong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox