From: "Dong, Zhanjun" <zhanjun.dong@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini
Date: Wed, 14 Jan 2026 17:35:38 -0500 [thread overview]
Message-ID: <ae2f2a0f-8ecf-406a-816b-5d62f50e1377@intel.com> (raw)
In-Reply-To: <aWAC4EyhqZZT5tbe@lstrano-desk.jf.intel.com>
On 2026-01-08 2:17 p.m., Matthew Brost wrote:
> On Thu, Jan 08, 2026 at 02:00:15PM -0500, Dong, Zhanjun wrote:
>>
>>
>> On 2025-12-18 4:44 p.m., Matthew Brost wrote:
>>> In GuC submit fini, forcefully tear down any exec queues by disabling
>>> CTs, stopping the scheduler (which cleans up lost G2H), killing all
>>> remaining queues, and resuming scheduling to allow any remaining cleanup
>>> actions to complete and signal any remaining fences.
>>>
>>> v2:
>>> - Fix VF failure (CI)
>>>
>>> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>
>>> ---
>>>
>>> This fix will not apply outright to any stable kernel as it depeneds on
>>> functions which have added in the KMD since the original commit. Likely
>>> will have to manually send out patches to stable for kernel which we'd
>>> like to fix.
>>> ---
>>> drivers/gpu/drm/xe/xe_guc_submit.c | 27 ++++++++++++++++++++-------
>>> 1 file changed, 20 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> index 071cbfec2401..58ec94439df1 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> @@ -289,6 +289,8 @@ static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
>>> EXEC_QUEUE_STATE_BANNED));
>>> }
>>> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc);
>>> +
>>> static void guc_submit_fini(struct drm_device *drm, void *arg)
>>> {
>>> struct xe_guc *guc = arg;
>>> @@ -296,6 +298,12 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
>>> struct xe_gt *gt = guc_to_gt(guc);
>>> int ret;
>>> + /* Forcefully kill any remaining exec queues */
>>> + xe_guc_ct_stop(&guc->ct);
>>> + __xe_guc_submit_reset_prepare(guc);
>>> + xe_guc_submit_stop(guc);
>>> + xe_guc_submit_pause_abort(guc);
>>> +
>>
>> Tested this series over
>> 265d13795b45 drm-tip: 2026y-01m-06d-08h-06m-43s UTC integration manifest
>> ===(CI_DRM_17772) and (xe-4335) with (IGT_8685)===
>>
>> and run test xe_fault_injection --r probe-fail-guc-xe_guc_mmio_send_recv
>> --debug
>> got few problems:
>> 1. Assertion ct->g2h_outstanding == 0 triggered
>> call stack shows:
>> [ 708.967261] xe_guc_ct_disable+0x17/0x80 [xe]
>> [ 709.043382] xe_guc_sanitize+0x31/0x50 [xe]
>> [ 709.119557] xe_uc_load_hw+0x187/0x2a0 [xe]
>
> Above is a different problem. Just delete xe_guc_sanitize from
> xe_uc_load_hw, that call is nonsense left over from the i915 port.
>
> xe_guc_sanitize / xe_uc_sanitize everywhere probably needs a look if
> those calls make any bit of sense.
Agree
>
>>
>> 2. Page fault
>> [ 740.822070] BUG: unable to handle page fault for address:
>> ffffc9000c80fc50
>> [ 740.828896] #PF: supervisor write access in kernel mode
>> [ 740.834063] #PF: error_code(0x0002) - not-present page
>> [ 740.839145] PGD 100000067 P4D 100000067 PUD 100ad4067 PMD 0
>> [ 740.844738] Oops: Oops: 0002 [#2] SMP NOPTI
>> [ 740.848880] CPU: 2 UID: 0 PID: 169 Comm: kworker/2:2 Tainted: G S M UD W
>> 6.19.0-rc4+xu4335+ #3 PREEMPT(voluntary)
>> [ 740.859964] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK, [U]=USER,
>> [D]=DIE, [W]=WARN
>> [ 740.867952] Hardware name: Intel Corporation Meteor Lake Client
>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.4122.D21.2408281317
>> 08/28/2024
>> [ 740.881081] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe]
>> [ 740.887820] RIP: 0010:xe_ggtt_set_pte+0x53/0x350 [xe]
>> [ 740.892900] Code: e2 48 89 45 d0 31 c0 f7 c6 ff 0f 00 00 75 56 49 3b 5c
>> 24 08 0f 83 a8 01 00 00 49 8b 84 24 b0 00 00 00 48 c1 eb 0c 48 8d 04 d8 <4c>
>> 89 38 48 8b 45 d0 65 48 2b 05 e6 41 d1 e2 0f 85 e1 02 00 00 48
>> [ 740.911428] RSP: 0018:ffffc9000074b9f0 EFLAGS: 00010202
>> [ 740.916599] RAX: ffffc9000c80fc50 RBX: 0000000000001f8a RCX:
>> 0000000000000000
>> [ 740.923653] RDX: 0000000000000000 RSI: 0000000001f8a000 RDI:
>> ffff888132562628
>> [ 740.930705] RBP: ffffc9000074ba88 R08: 0000000000000000 R09:
>> ffff888168188000
>> [ 740.937758] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffff888132562628
>> [ 740.944807] R13: 0000000000000000 R14: ffff88816818a768 R15:
>> 0000000000000000
>> [ 740.951861] FS: 0000000000000000(0000) GS:ffff8884ebbe0000(0000)
>> knlGS:0000000000000000
>> [ 740.959850] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 740.965534] CR2: ffffc9000c80fc50 CR3: 0000000132923003 CR4:
>> 0000000000f72ef0
>> [ 740.972585] PKRU: 55555554
>> [ 740.975268] Call Trace:
>> [ 740.977694] <TASK>
>> [ 740.979778] ? __mutex_lock+0xae/0x1080
>> [ 740.983583] xe_ggtt_clear+0xa1/0x260 [xe]
>> [ 740.987716] ? lock_release+0x1df/0x280
>> [ 740.991519] ? pm_runtime_get_conditional+0x66/0x150
>> [ 740.996436] ggtt_node_remove+0xb2/0x140 [xe]
>> [ 741.000829] xe_ggtt_node_remove+0x40/0xa0 [xe]
>> [ 741.005393] xe_ggtt_remove_bo+0x87/0x250 [xe]
>> [ 741.009874] ? _raw_write_unlock+0x22/0x50
>> [ 741.013927] ? drm_vma_offset_remove+0x65/0x80
>> [ 741.018324] xe_ttm_bo_destroy+0xd4/0x310 [xe]
>> [ 741.022800] ttm_bo_release+0x70/0x330 [ttm]
>> [ 741.027032] ? vunmap+0x4a/0x70
>> [ 741.030147] ? vunmap+0x4a/0x70
>> [ 741.033260] ttm_bo_fini+0x3c/0x70 [ttm]
>> [ 741.037145] xe_gem_object_free+0x1a/0x30 [xe]
>> [ 741.041618] drm_gem_object_free+0x1d/0x40
>> [ 741.045671] xe_bo_put+0x136/0x1c0 [xe]
>> [ 741.049548] xe_lrc_destroy+0x47/0x60 [xe]
>> [ 741.053691] xe_exec_queue_fini+0x85/0xd0 [xe]
>> [ 741.058172] __guc_exec_queue_destroy_async+0x7c/0x190 [xe]
>> [ 741.063770] process_one_work+0x22e/0x6b0
>> [ 741.067741] worker_thread+0x1a0/0x370
>> [ 741.071456] ? __pfx_worker_thread+0x10/0x10
>> [ 741.075683] kthread+0x11f/0x250
>> [ 741.078882] ? __pfx_kthread+0x10/0x10
>> [ 741.082594] ret_from_fork+0x337/0x390
>> [ 741.086315] ? __pfx_kthread+0x10/0x10
>> [ 741.090027] ret_from_fork_asm+0x1a/0x30
>> [ 741.093909] </TASK>
>>
>> Sounds like call xe_guc_submit_pause_abort here might cause trouble. That's
>> why I call it in guc_fini_hw, which make the test passed.
>>
>
> Thanks for the info. guc_fini_hw isn't definitely isn't the right place
> though as that is registered before xe_guc_submit_init is called.
>
> If I'm understanding the trace correctly - guc_submit_fini should be on
> the devm exit handler.
>
> Want to give my two suggestions a try? Also feel free run with these
> patch / take over if you bandwidth. It is unlikely I'll have bandwidth
> to pick these back up for at least a week or so.
With more debug print on begin(^)/end($) of
guc_fini_hw/mmio_fini/guc_submit_fini:
[ 183.000171] ZD guc_fini_hw ^
[ 183.000187] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0:
GT1: GuC CT communication channel disabled
[ 183.003374] ZD guc_fini_hw $
[ 183.116889] ZD __xe_exec_queue_fini q:ffff88816a92d000 flag:0
lrc.bo:ffff88816baa8800
[ 183.129725] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0:
GT0: GuC CT communication channel stopped
[ 183.130487] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0:
GT0: GuC CT communication channel disabled
[ 183.131138] ZD guc_fini_hw ^
[ 183.131146] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0:
GT0: GuC CT communication channel disabled
[ 183.134163] ZD guc_fini_hw $
[ 183.235099] xe 0000:00:02.0: [drm:intel_pps_vdd_off_sync_unlocked
[xe]] [ENCODER:505:DDI A/PHY A] PPS 0 turning VDD off
[ 183.238289] xe 0000:00:02.0: [drm:intel_pps_vdd_off_sync_unlocked
[xe]] [ENCODER:505:DDI A/PHY A] PPS 0 PP_STATUS: 0x00000000 PP_CONTROL:
0x00000060
[ 183.238415] xe 0000:00:02.0: [drm:intel_power_well_disable [xe]]
disabling AUX_A
[ 183.238621] xe 0000:00:02.0: [drm:wait_panel_power_cycle [xe]]
[ENCODER:505:DDI A/PHY A] PPS 0 wait for panel power cycle (500 ms
remaining)
[ 183.747985] xe 0000:00:02.0: [drm:wait_panel_status [xe]]
[ENCODER:505:DDI A/PHY A] PPS 0 mask: 0xb800000f value: 0x00000000
PP_STATUS: 0x00000000 PP_CONTROL: 0x00000060
[ 183.758418] xe 0000:00:02.0: [drm:wait_panel_status [xe]] Wait complete
[ 183.774541] ZD mmio_fini ^
[ 183.774551] ZD mmio_fini $
[ 183.777314] xe 0000:00:02.0: [drm:drm_pagemap_shrinker_fini
[drm_gpusvm_helper]] Destroying dpagemap shrinker.
[ 183.789419] ZD guc_submit_fini ^
[ 183.792669] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0:
GT1: GuC CT communication channel stopped
[ 183.793409] ZD xe_guc_submit_pause_abort q:ffff88811d5fd000 flag:10
[ 183.799955] ZD __xe_exec_queue_fini q:ffff88811d5fd600 flag:10
lrc.bo:ffff888168fa6800
[ 183.807866] ZD guc_submit_fini start drain_workqueue
[ 183.807920] ZD __xe_exec_queue_fini q:ffff88811d5fd000 flag:90
lrc.bo:ffff888168fa5000
[ 183.820685] ZD xe_ggtt_remove_bo bo:ffff888168fa6800
ggtt:ffff88812c695628
[ 183.827536] ZD xe_ggtt_remove_bo bo:ffff888168fa5000
ggtt:ffff88812c695628
[ 183.834390] ZD xe_ggtt_clear ggtt:ffff88812c695628 start:33239040
gsm:ffffc9000c800000 gsm.:ffffc9000c80fd98
[ 183.844343] BUG: unable to handle page fault for address:
ffffc9000c80fd98
[ 183.851153] #PF: supervisor write access in kernel mode
[ 183.856324] #PF: error_code(0x0002) - not-present page
[ 183.861406] PGD 100000067 P4D 100000067 PUD 100ac9067 PMD 0
[ 183.867001] Oops: Oops: 0002 [#1] SMP NOPTI
[ 183.871143] CPU: 7 UID: 0 PID: 298 Comm: kworker/7:2 Tainted: G S M U
W 6.19.0-rc5+xu4373+ #13 PREEMPT(voluntary)
[ 183.882305] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK,
[U]=USER, [W]=WARN
[ 183.889524] Hardware name: Intel Corporation Meteor Lake Client
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
MTLPFWI1.R00.4122.D21.2408281317 08/28/2024
[ 183.902650] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe]
[ 183.909399] RIP: 0010:xe_ggtt_set_pte+0x5b/0x360 [xe]
[ 183.914482] Code: c6 ff 0f 00 00 75 5e 49 8b 44 24 10 49 03 44 24 08
48 39 c3 0f 83 b0 01 00 00 49 8b 84 24 b8 00 00 00 48 c1 eb 0c 48 8d 04
d8 <4c> 89 38 48 8b 45 d0 65 48 2b 05 1e 41 d1 e2 0f 85 e9 02 00 00 48
[ 183.933007] RSP: 0018:ffffc90001ce79c8 EFLAGS: 00010202
[ 183.938179] RAX: ffffc9000c80fd98 RBX: 0000000000001fb3 RCX:
0000000000000000
[ 183.945234] RDX: 0000000000000000 RSI: 0000000001fb3000 RDI:
ffff88812c695628
[ 183.952285] RBP: ffffc90001ce7a60 R08: 0000000000000000 R09:
0000000000000000
[ 183.959338] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff88812c695628
[ 183.966388] R13: ffff8881329ea768 R14: ffff8881329ea768 R15:
0000000000000000
[ 183.973438] FS: 0000000000000000(0000) GS:ffff8884ebe60000(0000)
knlGS:0000000000000000
[ 183.981431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 183.987110] CR2: ffffc9000c80fd98 CR3: 000000010b9c5006 CR4:
0000000000f72ef0
[ 183.994159] PKRU: 55555554
[ 183.996847] Call Trace:
[ 183.999267] <TASK>
[ 184.001356] ? vprintk_default+0x1d/0x30
[ 184.005244] ? vprintk+0x18/0x50
[ 184.008446] ? _printk+0x57/0x80
[ 184.011648] xe_ggtt_clear+0x104/0x2a0 [xe]
[ 184.015878] ? mark_held_locks+0x4d/0x90
[ 184.019767] ggtt_node_remove+0xb2/0x140 [xe]
[ 184.024164] xe_ggtt_node_remove+0x40/0xa0 [xe]
[ 184.028728] xe_ggtt_remove_bo+0xa4/0x2e0 [xe]
[ 184.033210] ? _raw_write_unlock+0x22/0x50
[ 184.037271] ? drm_vma_offset_remove+0x65/0x80
[ 184.041672] xe_ttm_bo_destroy+0xae/0x2d0 [xe]
[ 184.046150] ttm_bo_release+0x70/0x330 [ttm]
[ 184.050382] ? vunmap+0x4a/0x70
[ 184.053494] ? vunmap+0x4a/0x70
[ 184.056609] ttm_bo_fini+0x3c/0x70 [ttm]
[ 184.060491] xe_gem_object_free+0x1a/0x30 [xe]
[ 184.064966] drm_gem_object_free+0x1d/0x40
[ 184.069018] xe_bo_put+0x123/0x180 [xe]
[ 184.072898] xe_lrc_destroy+0x47/0x60 [xe]
[ 184.077041] __xe_exec_queue_fini+0x93/0xd0 [xe]
[ 184.081693] xe_exec_queue_fini+0x2b/0x60 [xe]
[ 184.086171] __guc_exec_queue_destroy_async+0x6c/0x170 [xe]
[ 184.091769] process_one_work+0x22e/0x6b0
[ 184.095737] worker_thread+0x1a0/0x370
[ 184.099448] ? __pfx_worker_thread+0x10/0x10
[ 184.103676] kthread+0x11f/0x250
[ 184.106877] ? __pfx_kthread+0x10/0x10
[ 184.110586] ret_from_fork+0x337/0x390
[ 184.114301] ? __pfx_kthread+0x10/0x10
[ 184.118011] ret_from_fork_asm+0x1a/0x30
[ 184.121900] </TASK>
So the root cause of the page fault should be:
1.mmio_fini do pci_iounmap
2.writeq in xe_ggtt_set_pte access valiad address (ffffc9000c80fd98)
3.Since already unmapped in step 1, the page fault tiggered.
The excution order of fini(s) is:
guc_fini_hw (for each guc)
mmio_fini
guc_submit_fini
meanwhile, it is the destroy worker perform the bo release action, that
causes problem, the worker out of sync with the managed actions.
Regards,
Zhanjun Dong
>
> Matt
>
>> Regards,
>> Zhanjun Dong
>>
>>> ret = wait_event_timeout(guc->submission_state.fini_wq,
>>> xa_empty(&guc->submission_state.exec_queue_lookup),
>>> HZ * 5);
>>> @@ -2459,16 +2467,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
>>> }
>>> }
>>> -int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> {
>>> int ret;
>>> - if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
>>> - return 0;
>>> -
>>> - if (!guc->submission_state.initialized)
>>> - return 0;
>>> -
>>> /*
>>> * Using an atomic here rather than submission_state.lock as this
>>> * function can be called while holding the CT lock (engine reset
>>> @@ -2483,6 +2485,17 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> return ret;
>>> }
>>> +int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> +{
>>> + if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
>>> + return 0;
>>> +
>>> + if (!guc->submission_state.initialized)
>>> + return 0;
>>> +
>>> + return __xe_guc_submit_reset_prepare(guc);
>>> +}
>>> +
>>> void xe_guc_submit_reset_wait(struct xe_guc *guc)
>>> {
>>> wait_event(guc->ct.wq, xe_device_wedged(guc_to_xe(guc)) ||
>>
next prev parent reply other threads:[~2026-01-14 22:35 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-18 21:44 [PATCH v2 0/3] Attempt to fixup reset, wedge, unload corner cases Matthew Brost
2025-12-18 21:44 ` [PATCH v2 1/3] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Matthew Brost
2025-12-18 23:36 ` Summers, Stuart
2025-12-18 21:44 ` [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini Matthew Brost
2025-12-18 23:36 ` Summers, Stuart
2025-12-19 1:15 ` Matthew Brost
2026-01-08 19:00 ` Dong, Zhanjun
2026-01-08 19:17 ` Matthew Brost
2026-01-14 22:35 ` Dong, Zhanjun [this message]
2026-02-06 5:50 ` Matthew Brost
2026-02-06 20:29 ` Dong, Zhanjun
2025-12-18 21:44 ` [PATCH v2 3/3] drm/xe: Trigger queue cleanup if not in wedged mode 2 Matthew Brost
2025-12-18 23:45 ` Summers, Stuart
2025-12-19 1:10 ` Matthew Brost
2025-12-18 23:08 ` ✓ CI.KUnit: success for Attempt to fixup reset, wedge, unload corner cases Patchwork
2025-12-18 23:44 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-20 1:22 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ae2f2a0f-8ecf-406a-816b-5d62f50e1377@intel.com \
--to=zhanjun.dong@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox