From: "Dong, Zhanjun" <zhanjun.dong@intel.com>
To: Matthew Brost <matthew.brost@intel.com>,
<intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini
Date: Thu, 8 Jan 2026 14:00:15 -0500 [thread overview]
Message-ID: <5a99db81-ebbe-4dfe-a528-1063c4bcf1d1@intel.com> (raw)
In-Reply-To: <20251218214418.4037401-3-matthew.brost@intel.com>
On 2025-12-18 4:44 p.m., Matthew Brost wrote:
> In GuC submit fini, forcefully tear down any exec queues by disabling
> CTs, stopping the scheduler (which cleans up lost G2H), killing all
> remaining queues, and resuming scheduling to allow any remaining cleanup
> actions to complete and signal any remaining fences.
>
> v2:
> - Fix VF failure (CI)
>
> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>
> ---
>
> This fix will not apply outright to any stable kernel as it depeneds on
> functions which have added in the KMD since the original commit. Likely
> will have to manually send out patches to stable for kernel which we'd
> like to fix.
> ---
> drivers/gpu/drm/xe/xe_guc_submit.c | 27 ++++++++++++++++++++-------
> 1 file changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 071cbfec2401..58ec94439df1 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -289,6 +289,8 @@ static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
> EXEC_QUEUE_STATE_BANNED));
> }
>
> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc);
> +
> static void guc_submit_fini(struct drm_device *drm, void *arg)
> {
> struct xe_guc *guc = arg;
> @@ -296,6 +298,12 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
> struct xe_gt *gt = guc_to_gt(guc);
> int ret;
>
> + /* Forcefully kill any remaining exec queues */
> + xe_guc_ct_stop(&guc->ct);
> + __xe_guc_submit_reset_prepare(guc);
> + xe_guc_submit_stop(guc);
> + xe_guc_submit_pause_abort(guc);
> +
Tested this series over
265d13795b45 drm-tip: 2026y-01m-06d-08h-06m-43s UTC integration manifest
===(CI_DRM_17772) and (xe-4335) with (IGT_8685)===
and run test xe_fault_injection --r probe-fail-guc-xe_guc_mmio_send_recv
--debug
got few problems:
1. Assertion ct->g2h_outstanding == 0 triggered
call stack shows:
[ 708.967261] xe_guc_ct_disable+0x17/0x80 [xe]
[ 709.043382] xe_guc_sanitize+0x31/0x50 [xe]
[ 709.119557] xe_uc_load_hw+0x187/0x2a0 [xe]
2. Page fault
[ 740.822070] BUG: unable to handle page fault for address:
ffffc9000c80fc50
[ 740.828896] #PF: supervisor write access in kernel mode
[ 740.834063] #PF: error_code(0x0002) - not-present page
[ 740.839145] PGD 100000067 P4D 100000067 PUD 100ad4067 PMD 0
[ 740.844738] Oops: Oops: 0002 [#2] SMP NOPTI
[ 740.848880] CPU: 2 UID: 0 PID: 169 Comm: kworker/2:2 Tainted: G S M
UD W 6.19.0-rc4+xu4335+ #3 PREEMPT(voluntary)
[ 740.859964] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK,
[U]=USER, [D]=DIE, [W]=WARN
[ 740.867952] Hardware name: Intel Corporation Meteor Lake Client
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
MTLPFWI1.R00.4122.D21.2408281317 08/28/2024
[ 740.881081] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe]
[ 740.887820] RIP: 0010:xe_ggtt_set_pte+0x53/0x350 [xe]
[ 740.892900] Code: e2 48 89 45 d0 31 c0 f7 c6 ff 0f 00 00 75 56 49 3b
5c 24 08 0f 83 a8 01 00 00 49 8b 84 24 b0 00 00 00 48 c1 eb 0c 48 8d 04
d8 <4c> 89 38 48 8b 45 d0 65 48 2b 05 e6 41 d1 e2 0f 85 e1 02 00 00 48
[ 740.911428] RSP: 0018:ffffc9000074b9f0 EFLAGS: 00010202
[ 740.916599] RAX: ffffc9000c80fc50 RBX: 0000000000001f8a RCX:
0000000000000000
[ 740.923653] RDX: 0000000000000000 RSI: 0000000001f8a000 RDI:
ffff888132562628
[ 740.930705] RBP: ffffc9000074ba88 R08: 0000000000000000 R09:
ffff888168188000
[ 740.937758] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff888132562628
[ 740.944807] R13: 0000000000000000 R14: ffff88816818a768 R15:
0000000000000000
[ 740.951861] FS: 0000000000000000(0000) GS:ffff8884ebbe0000(0000)
knlGS:0000000000000000
[ 740.959850] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 740.965534] CR2: ffffc9000c80fc50 CR3: 0000000132923003 CR4:
0000000000f72ef0
[ 740.972585] PKRU: 55555554
[ 740.975268] Call Trace:
[ 740.977694] <TASK>
[ 740.979778] ? __mutex_lock+0xae/0x1080
[ 740.983583] xe_ggtt_clear+0xa1/0x260 [xe]
[ 740.987716] ? lock_release+0x1df/0x280
[ 740.991519] ? pm_runtime_get_conditional+0x66/0x150
[ 740.996436] ggtt_node_remove+0xb2/0x140 [xe]
[ 741.000829] xe_ggtt_node_remove+0x40/0xa0 [xe]
[ 741.005393] xe_ggtt_remove_bo+0x87/0x250 [xe]
[ 741.009874] ? _raw_write_unlock+0x22/0x50
[ 741.013927] ? drm_vma_offset_remove+0x65/0x80
[ 741.018324] xe_ttm_bo_destroy+0xd4/0x310 [xe]
[ 741.022800] ttm_bo_release+0x70/0x330 [ttm]
[ 741.027032] ? vunmap+0x4a/0x70
[ 741.030147] ? vunmap+0x4a/0x70
[ 741.033260] ttm_bo_fini+0x3c/0x70 [ttm]
[ 741.037145] xe_gem_object_free+0x1a/0x30 [xe]
[ 741.041618] drm_gem_object_free+0x1d/0x40
[ 741.045671] xe_bo_put+0x136/0x1c0 [xe]
[ 741.049548] xe_lrc_destroy+0x47/0x60 [xe]
[ 741.053691] xe_exec_queue_fini+0x85/0xd0 [xe]
[ 741.058172] __guc_exec_queue_destroy_async+0x7c/0x190 [xe]
[ 741.063770] process_one_work+0x22e/0x6b0
[ 741.067741] worker_thread+0x1a0/0x370
[ 741.071456] ? __pfx_worker_thread+0x10/0x10
[ 741.075683] kthread+0x11f/0x250
[ 741.078882] ? __pfx_kthread+0x10/0x10
[ 741.082594] ret_from_fork+0x337/0x390
[ 741.086315] ? __pfx_kthread+0x10/0x10
[ 741.090027] ret_from_fork_asm+0x1a/0x30
[ 741.093909] </TASK>
Sounds like call xe_guc_submit_pause_abort here might cause trouble.
That's why I call it in guc_fini_hw, which make the test passed.
Regards,
Zhanjun Dong
> ret = wait_event_timeout(guc->submission_state.fini_wq,
> xa_empty(&guc->submission_state.exec_queue_lookup),
> HZ * 5);
> @@ -2459,16 +2467,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
> }
> }
>
> -int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc)
> {
> int ret;
>
> - if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
> - return 0;
> -
> - if (!guc->submission_state.initialized)
> - return 0;
> -
> /*
> * Using an atomic here rather than submission_state.lock as this
> * function can be called while holding the CT lock (engine reset
> @@ -2483,6 +2485,17 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> return ret;
> }
>
> +int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> +{
> + if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
> + return 0;
> +
> + if (!guc->submission_state.initialized)
> + return 0;
> +
> + return __xe_guc_submit_reset_prepare(guc);
> +}
> +
> void xe_guc_submit_reset_wait(struct xe_guc *guc)
> {
> wait_event(guc->ct.wq, xe_device_wedged(guc_to_xe(guc)) ||
next prev parent reply other threads:[~2026-01-08 19:00 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-18 21:44 [PATCH v2 0/3] Attempt to fixup reset, wedge, unload corner cases Matthew Brost
2025-12-18 21:44 ` [PATCH v2 1/3] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Matthew Brost
2025-12-18 23:36 ` Summers, Stuart
2025-12-18 21:44 ` [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini Matthew Brost
2025-12-18 23:36 ` Summers, Stuart
2025-12-19 1:15 ` Matthew Brost
2026-01-08 19:00 ` Dong, Zhanjun [this message]
2026-01-08 19:17 ` Matthew Brost
2026-01-14 22:35 ` Dong, Zhanjun
2026-02-06 5:50 ` Matthew Brost
2026-02-06 20:29 ` Dong, Zhanjun
2025-12-18 21:44 ` [PATCH v2 3/3] drm/xe: Trigger queue cleanup if not in wedged mode 2 Matthew Brost
2025-12-18 23:45 ` Summers, Stuart
2025-12-19 1:10 ` Matthew Brost
2025-12-18 23:08 ` ✓ CI.KUnit: success for Attempt to fixup reset, wedge, unload corner cases Patchwork
2025-12-18 23:44 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-20 1:22 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5a99db81-ebbe-4dfe-a528-1063c4bcf1d1@intel.com \
--to=zhanjun.dong@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox