Re: [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Dong, Zhanjun" <zhanjun.dong@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini
Date: Wed, 14 Jan 2026 17:35:38 -0500	[thread overview]
Message-ID: <ae2f2a0f-8ecf-406a-816b-5d62f50e1377@intel.com> (raw)
In-Reply-To: <aWAC4EyhqZZT5tbe@lstrano-desk.jf.intel.com>



On 2026-01-08 2:17 p.m., Matthew Brost wrote:
> On Thu, Jan 08, 2026 at 02:00:15PM -0500, Dong, Zhanjun wrote:
>>
>>
>> On 2025-12-18 4:44 p.m., Matthew Brost wrote:
>>> In GuC submit fini, forcefully tear down any exec queues by disabling
>>> CTs, stopping the scheduler (which cleans up lost G2H), killing all
>>> remaining queues, and resuming scheduling to allow any remaining cleanup
>>> actions to complete and signal any remaining fences.
>>>
>>> v2:
>>>    - Fix VF failure (CI)
>>>
>>> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>
>>> ---
>>>
>>> This fix will not apply outright to any stable kernel as it depeneds on
>>> functions which have added in the KMD since the original commit. Likely
>>> will have to manually send out patches to stable for kernel which we'd
>>> like to fix.
>>> ---
>>>    drivers/gpu/drm/xe/xe_guc_submit.c | 27 ++++++++++++++++++++-------
>>>    1 file changed, 20 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> index 071cbfec2401..58ec94439df1 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> @@ -289,6 +289,8 @@ static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
>>>    		 EXEC_QUEUE_STATE_BANNED));
>>>    }
>>> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc);
>>> +
>>>    static void guc_submit_fini(struct drm_device *drm, void *arg)
>>>    {
>>>    	struct xe_guc *guc = arg;
>>> @@ -296,6 +298,12 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
>>>    	struct xe_gt *gt = guc_to_gt(guc);
>>>    	int ret;
>>> +	/* Forcefully kill any remaining exec queues */
>>> +	xe_guc_ct_stop(&guc->ct);
>>> +	__xe_guc_submit_reset_prepare(guc);
>>> +	xe_guc_submit_stop(guc);
>>> +	xe_guc_submit_pause_abort(guc);
>>> +
>>
>> Tested this series over
>> 265d13795b45 drm-tip: 2026y-01m-06d-08h-06m-43s UTC integration manifest
>> ===(CI_DRM_17772) and (xe-4335) with (IGT_8685)===
>>
>> and run test xe_fault_injection --r probe-fail-guc-xe_guc_mmio_send_recv
>> --debug
>> got few problems:
>> 1. Assertion ct->g2h_outstanding == 0 triggered
>> call stack shows:
>> [  708.967261]  xe_guc_ct_disable+0x17/0x80 [xe]
>> [  709.043382]  xe_guc_sanitize+0x31/0x50 [xe]
>> [  709.119557]  xe_uc_load_hw+0x187/0x2a0 [xe]
> 
> Above is a different problem. Just delete xe_guc_sanitize from
> xe_uc_load_hw, that call is nonsense left over from the i915 port.
> 
> xe_guc_sanitize / xe_uc_sanitize everywhere probably needs a look if
> those calls make any bit of sense.
Agree
> 
>>
>> 2. Page fault
>> [  740.822070] BUG: unable to handle page fault for address:
>> ffffc9000c80fc50
>> [  740.828896] #PF: supervisor write access in kernel mode
>> [  740.834063] #PF: error_code(0x0002) - not-present page
>> [  740.839145] PGD 100000067 P4D 100000067 PUD 100ad4067 PMD 0
>> [  740.844738] Oops: Oops: 0002 [#2] SMP NOPTI
>> [  740.848880] CPU: 2 UID: 0 PID: 169 Comm: kworker/2:2 Tainted: G S M UD W
>> 6.19.0-rc4+xu4335+ #3 PREEMPT(voluntary)
>> [  740.859964] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK, [U]=USER,
>> [D]=DIE, [W]=WARN
>> [  740.867952] Hardware name: Intel Corporation Meteor Lake Client
>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.4122.D21.2408281317
>> 08/28/2024
>> [  740.881081] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe]
>> [  740.887820] RIP: 0010:xe_ggtt_set_pte+0x53/0x350 [xe]
>> [  740.892900] Code: e2 48 89 45 d0 31 c0 f7 c6 ff 0f 00 00 75 56 49 3b 5c
>> 24 08 0f 83 a8 01 00 00 49 8b 84 24 b0 00 00 00 48 c1 eb 0c 48 8d 04 d8 <4c>
>> 89 38 48 8b 45 d0 65 48 2b 05 e6 41 d1 e2 0f 85 e1 02 00 00 48
>> [  740.911428] RSP: 0018:ffffc9000074b9f0 EFLAGS: 00010202
>> [  740.916599] RAX: ffffc9000c80fc50 RBX: 0000000000001f8a RCX:
>> 0000000000000000
>> [  740.923653] RDX: 0000000000000000 RSI: 0000000001f8a000 RDI:
>> ffff888132562628
>> [  740.930705] RBP: ffffc9000074ba88 R08: 0000000000000000 R09:
>> ffff888168188000
>> [  740.937758] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffff888132562628
>> [  740.944807] R13: 0000000000000000 R14: ffff88816818a768 R15:
>> 0000000000000000
>> [  740.951861] FS:  0000000000000000(0000) GS:ffff8884ebbe0000(0000)
>> knlGS:0000000000000000
>> [  740.959850] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  740.965534] CR2: ffffc9000c80fc50 CR3: 0000000132923003 CR4:
>> 0000000000f72ef0
>> [  740.972585] PKRU: 55555554
>> [  740.975268] Call Trace:
>> [  740.977694]  <TASK>
>> [  740.979778]  ? __mutex_lock+0xae/0x1080
>> [  740.983583]  xe_ggtt_clear+0xa1/0x260 [xe]
>> [  740.987716]  ? lock_release+0x1df/0x280
>> [  740.991519]  ? pm_runtime_get_conditional+0x66/0x150
>> [  740.996436]  ggtt_node_remove+0xb2/0x140 [xe]
>> [  741.000829]  xe_ggtt_node_remove+0x40/0xa0 [xe]
>> [  741.005393]  xe_ggtt_remove_bo+0x87/0x250 [xe]
>> [  741.009874]  ? _raw_write_unlock+0x22/0x50
>> [  741.013927]  ? drm_vma_offset_remove+0x65/0x80
>> [  741.018324]  xe_ttm_bo_destroy+0xd4/0x310 [xe]
>> [  741.022800]  ttm_bo_release+0x70/0x330 [ttm]
>> [  741.027032]  ? vunmap+0x4a/0x70
>> [  741.030147]  ? vunmap+0x4a/0x70
>> [  741.033260]  ttm_bo_fini+0x3c/0x70 [ttm]
>> [  741.037145]  xe_gem_object_free+0x1a/0x30 [xe]
>> [  741.041618]  drm_gem_object_free+0x1d/0x40
>> [  741.045671]  xe_bo_put+0x136/0x1c0 [xe]
>> [  741.049548]  xe_lrc_destroy+0x47/0x60 [xe]
>> [  741.053691]  xe_exec_queue_fini+0x85/0xd0 [xe]
>> [  741.058172]  __guc_exec_queue_destroy_async+0x7c/0x190 [xe]
>> [  741.063770]  process_one_work+0x22e/0x6b0
>> [  741.067741]  worker_thread+0x1a0/0x370
>> [  741.071456]  ? __pfx_worker_thread+0x10/0x10
>> [  741.075683]  kthread+0x11f/0x250
>> [  741.078882]  ? __pfx_kthread+0x10/0x10
>> [  741.082594]  ret_from_fork+0x337/0x390
>> [  741.086315]  ? __pfx_kthread+0x10/0x10
>> [  741.090027]  ret_from_fork_asm+0x1a/0x30
>> [  741.093909]  </TASK>
>>
>> Sounds like call xe_guc_submit_pause_abort here might cause trouble. That's
>> why I call it in guc_fini_hw, which make the test passed.
>>
> 
> Thanks for the info. guc_fini_hw isn't definitely isn't the right place
> though as that is registered before xe_guc_submit_init is called.
> 
> If I'm understanding the trace correctly - guc_submit_fini should be on
> the devm exit handler.
> 
> Want to give my two suggestions a try? Also feel free run with these
> patch / take over if you bandwidth. It is unlikely I'll have bandwidth
> to pick these back up for at least a week or so.

With more debug print on begin(^)/end($) of 
guc_fini_hw/mmio_fini/guc_submit_fini:
[  183.000171] ZD guc_fini_hw ^
[  183.000187] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: 
GT1: GuC CT communication channel disabled
[  183.003374] ZD guc_fini_hw $
[  183.116889] ZD __xe_exec_queue_fini q:ffff88816a92d000 flag:0 
lrc.bo:ffff88816baa8800
[  183.129725] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: 
GT0: GuC CT communication channel stopped
[  183.130487] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: 
GT0: GuC CT communication channel disabled
[  183.131138] ZD guc_fini_hw ^
[  183.131146] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: 
GT0: GuC CT communication channel disabled
[  183.134163] ZD guc_fini_hw $
[  183.235099] xe 0000:00:02.0: [drm:intel_pps_vdd_off_sync_unlocked 
[xe]] [ENCODER:505:DDI A/PHY A] PPS 0 turning VDD off
[  183.238289] xe 0000:00:02.0: [drm:intel_pps_vdd_off_sync_unlocked 
[xe]] [ENCODER:505:DDI A/PHY A] PPS 0 PP_STATUS: 0x00000000 PP_CONTROL: 
0x00000060
[  183.238415] xe 0000:00:02.0: [drm:intel_power_well_disable [xe]] 
disabling AUX_A
[  183.238621] xe 0000:00:02.0: [drm:wait_panel_power_cycle [xe]] 
[ENCODER:505:DDI A/PHY A] PPS 0 wait for panel power cycle (500 ms 
remaining)
[  183.747985] xe 0000:00:02.0: [drm:wait_panel_status [xe]] 
[ENCODER:505:DDI A/PHY A] PPS 0 mask: 0xb800000f value: 0x00000000 
PP_STATUS: 0x00000000 PP_CONTROL: 0x00000060
[  183.758418] xe 0000:00:02.0: [drm:wait_panel_status [xe]] Wait complete
[  183.774541] ZD mmio_fini ^
[  183.774551] ZD mmio_fini $
[  183.777314] xe 0000:00:02.0: [drm:drm_pagemap_shrinker_fini 
[drm_gpusvm_helper]] Destroying dpagemap shrinker.
[  183.789419] ZD guc_submit_fini ^
[  183.792669] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: 
GT1: GuC CT communication channel stopped
[  183.793409] ZD xe_guc_submit_pause_abort q:ffff88811d5fd000 flag:10
[  183.799955] ZD __xe_exec_queue_fini q:ffff88811d5fd600 flag:10 
lrc.bo:ffff888168fa6800
[  183.807866] ZD guc_submit_fini start drain_workqueue
[  183.807920] ZD __xe_exec_queue_fini q:ffff88811d5fd000 flag:90 
lrc.bo:ffff888168fa5000
[  183.820685] ZD xe_ggtt_remove_bo bo:ffff888168fa6800 
ggtt:ffff88812c695628
[  183.827536] ZD xe_ggtt_remove_bo bo:ffff888168fa5000 
ggtt:ffff88812c695628
[  183.834390] ZD xe_ggtt_clear ggtt:ffff88812c695628 start:33239040 
gsm:ffffc9000c800000 gsm.:ffffc9000c80fd98
[  183.844343] BUG: unable to handle page fault for address: 
ffffc9000c80fd98
[  183.851153] #PF: supervisor write access in kernel mode
[  183.856324] #PF: error_code(0x0002) - not-present page
[  183.861406] PGD 100000067 P4D 100000067 PUD 100ac9067 PMD 0
[  183.867001] Oops: Oops: 0002 [#1] SMP NOPTI
[  183.871143] CPU: 7 UID: 0 PID: 298 Comm: kworker/7:2 Tainted: G S M U 
  W           6.19.0-rc5+xu4373+ #13 PREEMPT(voluntary)
[  183.882305] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK, 
[U]=USER, [W]=WARN
[  183.889524] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS 
MTLPFWI1.R00.4122.D21.2408281317 08/28/2024
[  183.902650] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe]
[  183.909399] RIP: 0010:xe_ggtt_set_pte+0x5b/0x360 [xe]
[  183.914482] Code: c6 ff 0f 00 00 75 5e 49 8b 44 24 10 49 03 44 24 08 
48 39 c3 0f 83 b0 01 00 00 49 8b 84 24 b8 00 00 00 48 c1 eb 0c 48 8d 04 
d8 <4c> 89 38 48 8b 45 d0 65 48 2b 05 1e 41 d1 e2 0f 85 e9 02 00 00 48
[  183.933007] RSP: 0018:ffffc90001ce79c8 EFLAGS: 00010202
[  183.938179] RAX: ffffc9000c80fd98 RBX: 0000000000001fb3 RCX: 
0000000000000000
[  183.945234] RDX: 0000000000000000 RSI: 0000000001fb3000 RDI: 
ffff88812c695628
[  183.952285] RBP: ffffc90001ce7a60 R08: 0000000000000000 R09: 
0000000000000000
[  183.959338] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff88812c695628
[  183.966388] R13: ffff8881329ea768 R14: ffff8881329ea768 R15: 
0000000000000000
[  183.973438] FS:  0000000000000000(0000) GS:ffff8884ebe60000(0000) 
knlGS:0000000000000000
[  183.981431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  183.987110] CR2: ffffc9000c80fd98 CR3: 000000010b9c5006 CR4: 
0000000000f72ef0
[  183.994159] PKRU: 55555554
[  183.996847] Call Trace:
[  183.999267]  <TASK>
[  184.001356]  ? vprintk_default+0x1d/0x30
[  184.005244]  ? vprintk+0x18/0x50
[  184.008446]  ? _printk+0x57/0x80
[  184.011648]  xe_ggtt_clear+0x104/0x2a0 [xe]
[  184.015878]  ? mark_held_locks+0x4d/0x90
[  184.019767]  ggtt_node_remove+0xb2/0x140 [xe]
[  184.024164]  xe_ggtt_node_remove+0x40/0xa0 [xe]
[  184.028728]  xe_ggtt_remove_bo+0xa4/0x2e0 [xe]
[  184.033210]  ? _raw_write_unlock+0x22/0x50
[  184.037271]  ? drm_vma_offset_remove+0x65/0x80
[  184.041672]  xe_ttm_bo_destroy+0xae/0x2d0 [xe]
[  184.046150]  ttm_bo_release+0x70/0x330 [ttm]
[  184.050382]  ? vunmap+0x4a/0x70
[  184.053494]  ? vunmap+0x4a/0x70
[  184.056609]  ttm_bo_fini+0x3c/0x70 [ttm]
[  184.060491]  xe_gem_object_free+0x1a/0x30 [xe]
[  184.064966]  drm_gem_object_free+0x1d/0x40
[  184.069018]  xe_bo_put+0x123/0x180 [xe]
[  184.072898]  xe_lrc_destroy+0x47/0x60 [xe]
[  184.077041]  __xe_exec_queue_fini+0x93/0xd0 [xe]
[  184.081693]  xe_exec_queue_fini+0x2b/0x60 [xe]
[  184.086171]  __guc_exec_queue_destroy_async+0x6c/0x170 [xe]
[  184.091769]  process_one_work+0x22e/0x6b0
[  184.095737]  worker_thread+0x1a0/0x370
[  184.099448]  ? __pfx_worker_thread+0x10/0x10
[  184.103676]  kthread+0x11f/0x250
[  184.106877]  ? __pfx_kthread+0x10/0x10
[  184.110586]  ret_from_fork+0x337/0x390
[  184.114301]  ? __pfx_kthread+0x10/0x10
[  184.118011]  ret_from_fork_asm+0x1a/0x30
[  184.121900]  </TASK>

So the root cause of the page fault should be:
1.mmio_fini do pci_iounmap
2.writeq in xe_ggtt_set_pte access valiad address (ffffc9000c80fd98)
3.Since already unmapped in step 1, the page fault tiggered.

The excution order of fini(s) is:
guc_fini_hw (for each guc)
mmio_fini
guc_submit_fini

meanwhile, it is the destroy worker perform the bo release action, that 
causes problem, the worker out of sync with the managed actions.

Regards,
Zhanjun Dong


> 
> Matt
> 
>> Regards,
>> Zhanjun Dong
>>
>>>    	ret = wait_event_timeout(guc->submission_state.fini_wq,
>>>    				 xa_empty(&guc->submission_state.exec_queue_lookup),
>>>    				 HZ * 5);
>>> @@ -2459,16 +2467,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
>>>    	}
>>>    }
>>> -int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>>    {
>>>    	int ret;
>>> -	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
>>> -		return 0;
>>> -
>>> -	if (!guc->submission_state.initialized)
>>> -		return 0;
>>> -
>>>    	/*
>>>    	 * Using an atomic here rather than submission_state.lock as this
>>>    	 * function can be called while holding the CT lock (engine reset
>>> @@ -2483,6 +2485,17 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>>    	return ret;
>>>    }
>>> +int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> +{
>>> +	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
>>> +		return 0;
>>> +
>>> +	if (!guc->submission_state.initialized)
>>> +		return 0;
>>> +
>>> +	return __xe_guc_submit_reset_prepare(guc);
>>> +}
>>> +
>>>    void xe_guc_submit_reset_wait(struct xe_guc *guc)
>>>    {
>>>    	wait_event(guc->ct.wq, xe_device_wedged(guc_to_xe(guc)) ||
>>

next prev parent reply	other threads:[~2026-01-14 22:35 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-18 21:44 [PATCH v2 0/3] Attempt to fixup reset, wedge, unload corner cases Matthew Brost
2025-12-18 21:44 ` [PATCH v2 1/3] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Matthew Brost
2025-12-18 23:36   ` Summers, Stuart
2025-12-18 21:44 ` [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini Matthew Brost
2025-12-18 23:36   ` Summers, Stuart
2025-12-19  1:15     ` Matthew Brost
2026-01-08 19:00   ` Dong, Zhanjun
2026-01-08 19:17     ` Matthew Brost
2026-01-14 22:35       ` Dong, Zhanjun [this message]
2026-02-06  5:50         ` Matthew Brost
2026-02-06 20:29           ` Dong, Zhanjun
2025-12-18 21:44 ` [PATCH v2 3/3] drm/xe: Trigger queue cleanup if not in wedged mode 2 Matthew Brost
2025-12-18 23:45   ` Summers, Stuart
2025-12-19  1:10     ` Matthew Brost
2025-12-18 23:08 ` ✓ CI.KUnit: success for Attempt to fixup reset, wedge, unload corner cases Patchwork
2025-12-18 23:44 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-20  1:22 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae2f2a0f-8ecf-406a-816b-5d62f50e1377@intel.com \
    --to=zhanjun.dong@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox