Re: [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Dong, Zhanjun" <zhanjun.dong@intel.com>
To: Matthew Brost <matthew.brost@intel.com>,
	<intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini
Date: Thu, 8 Jan 2026 14:00:15 -0500	[thread overview]
Message-ID: <5a99db81-ebbe-4dfe-a528-1063c4bcf1d1@intel.com> (raw)
In-Reply-To: <20251218214418.4037401-3-matthew.brost@intel.com>



On 2025-12-18 4:44 p.m., Matthew Brost wrote:
> In GuC submit fini, forcefully tear down any exec queues by disabling
> CTs, stopping the scheduler (which cleans up lost G2H), killing all
> remaining queues, and resuming scheduling to allow any remaining cleanup
> actions to complete and signal any remaining fences.
> 
> v2:
>   - Fix VF failure (CI)
> 
> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> ---
> 
> This fix will not apply outright to any stable kernel as it depeneds on
> functions which have added in the KMD since the original commit. Likely
> will have to manually send out patches to stable for kernel which we'd
> like to fix.
> ---
>   drivers/gpu/drm/xe/xe_guc_submit.c | 27 ++++++++++++++++++++-------
>   1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 071cbfec2401..58ec94439df1 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -289,6 +289,8 @@ static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q)
>   		 EXEC_QUEUE_STATE_BANNED));
>   }
>   
> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc);
> +
>   static void guc_submit_fini(struct drm_device *drm, void *arg)
>   {
>   	struct xe_guc *guc = arg;
> @@ -296,6 +298,12 @@ static void guc_submit_fini(struct drm_device *drm, void *arg)
>   	struct xe_gt *gt = guc_to_gt(guc);
>   	int ret;
>   
> +	/* Forcefully kill any remaining exec queues */
> +	xe_guc_ct_stop(&guc->ct);
> +	__xe_guc_submit_reset_prepare(guc);
> +	xe_guc_submit_stop(guc);
> +	xe_guc_submit_pause_abort(guc);
> +

Tested this series over
265d13795b45 drm-tip: 2026y-01m-06d-08h-06m-43s UTC integration manifest 
===(CI_DRM_17772) and (xe-4335) with (IGT_8685)===

and run test xe_fault_injection --r probe-fail-guc-xe_guc_mmio_send_recv 
--debug
got few problems:
1. Assertion ct->g2h_outstanding == 0 triggered
call stack shows:
[  708.967261]  xe_guc_ct_disable+0x17/0x80 [xe]
[  709.043382]  xe_guc_sanitize+0x31/0x50 [xe]
[  709.119557]  xe_uc_load_hw+0x187/0x2a0 [xe]

2. Page fault
[  740.822070] BUG: unable to handle page fault for address: 
ffffc9000c80fc50
[  740.828896] #PF: supervisor write access in kernel mode
[  740.834063] #PF: error_code(0x0002) - not-present page
[  740.839145] PGD 100000067 P4D 100000067 PUD 100ad4067 PMD 0
[  740.844738] Oops: Oops: 0002 [#2] SMP NOPTI
[  740.848880] CPU: 2 UID: 0 PID: 169 Comm: kworker/2:2 Tainted: G S M 
UD W           6.19.0-rc4+xu4335+ #3 PREEMPT(voluntary)
[  740.859964] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK, 
[U]=USER, [D]=DIE, [W]=WARN
[  740.867952] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS 
MTLPFWI1.R00.4122.D21.2408281317 08/28/2024
[  740.881081] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe]
[  740.887820] RIP: 0010:xe_ggtt_set_pte+0x53/0x350 [xe]
[  740.892900] Code: e2 48 89 45 d0 31 c0 f7 c6 ff 0f 00 00 75 56 49 3b 
5c 24 08 0f 83 a8 01 00 00 49 8b 84 24 b0 00 00 00 48 c1 eb 0c 48 8d 04 
d8 <4c> 89 38 48 8b 45 d0 65 48 2b 05 e6 41 d1 e2 0f 85 e1 02 00 00 48
[  740.911428] RSP: 0018:ffffc9000074b9f0 EFLAGS: 00010202
[  740.916599] RAX: ffffc9000c80fc50 RBX: 0000000000001f8a RCX: 
0000000000000000
[  740.923653] RDX: 0000000000000000 RSI: 0000000001f8a000 RDI: 
ffff888132562628
[  740.930705] RBP: ffffc9000074ba88 R08: 0000000000000000 R09: 
ffff888168188000
[  740.937758] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff888132562628
[  740.944807] R13: 0000000000000000 R14: ffff88816818a768 R15: 
0000000000000000
[  740.951861] FS:  0000000000000000(0000) GS:ffff8884ebbe0000(0000) 
knlGS:0000000000000000
[  740.959850] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  740.965534] CR2: ffffc9000c80fc50 CR3: 0000000132923003 CR4: 
0000000000f72ef0
[  740.972585] PKRU: 55555554
[  740.975268] Call Trace:
[  740.977694]  <TASK>
[  740.979778]  ? __mutex_lock+0xae/0x1080
[  740.983583]  xe_ggtt_clear+0xa1/0x260 [xe]
[  740.987716]  ? lock_release+0x1df/0x280
[  740.991519]  ? pm_runtime_get_conditional+0x66/0x150
[  740.996436]  ggtt_node_remove+0xb2/0x140 [xe]
[  741.000829]  xe_ggtt_node_remove+0x40/0xa0 [xe]
[  741.005393]  xe_ggtt_remove_bo+0x87/0x250 [xe]
[  741.009874]  ? _raw_write_unlock+0x22/0x50
[  741.013927]  ? drm_vma_offset_remove+0x65/0x80
[  741.018324]  xe_ttm_bo_destroy+0xd4/0x310 [xe]
[  741.022800]  ttm_bo_release+0x70/0x330 [ttm]
[  741.027032]  ? vunmap+0x4a/0x70
[  741.030147]  ? vunmap+0x4a/0x70
[  741.033260]  ttm_bo_fini+0x3c/0x70 [ttm]
[  741.037145]  xe_gem_object_free+0x1a/0x30 [xe]
[  741.041618]  drm_gem_object_free+0x1d/0x40
[  741.045671]  xe_bo_put+0x136/0x1c0 [xe]
[  741.049548]  xe_lrc_destroy+0x47/0x60 [xe]
[  741.053691]  xe_exec_queue_fini+0x85/0xd0 [xe]
[  741.058172]  __guc_exec_queue_destroy_async+0x7c/0x190 [xe]
[  741.063770]  process_one_work+0x22e/0x6b0
[  741.067741]  worker_thread+0x1a0/0x370
[  741.071456]  ? __pfx_worker_thread+0x10/0x10
[  741.075683]  kthread+0x11f/0x250
[  741.078882]  ? __pfx_kthread+0x10/0x10
[  741.082594]  ret_from_fork+0x337/0x390
[  741.086315]  ? __pfx_kthread+0x10/0x10
[  741.090027]  ret_from_fork_asm+0x1a/0x30
[  741.093909]  </TASK>

Sounds like call xe_guc_submit_pause_abort here might cause trouble. 
That's why I call it in guc_fini_hw, which make the test passed.

Regards,
Zhanjun Dong

>   	ret = wait_event_timeout(guc->submission_state.fini_wq,
>   				 xa_empty(&guc->submission_state.exec_queue_lookup),
>   				 HZ * 5);
> @@ -2459,16 +2467,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
>   	}
>   }
>   
> -int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc)
>   {
>   	int ret;
>   
> -	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
> -		return 0;
> -
> -	if (!guc->submission_state.initialized)
> -		return 0;
> -
>   	/*
>   	 * Using an atomic here rather than submission_state.lock as this
>   	 * function can be called while holding the CT lock (engine reset
> @@ -2483,6 +2485,17 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>   	return ret;
>   }
>   
> +int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> +{
> +	if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc)))
> +		return 0;
> +
> +	if (!guc->submission_state.initialized)
> +		return 0;
> +
> +	return __xe_guc_submit_reset_prepare(guc);
> +}
> +
>   void xe_guc_submit_reset_wait(struct xe_guc *guc)
>   {
>   	wait_event(guc->ct.wq, xe_device_wedged(guc_to_xe(guc)) ||

next prev parent reply	other threads:[~2026-01-08 19:00 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-18 21:44 [PATCH v2 0/3] Attempt to fixup reset, wedge, unload corner cases Matthew Brost
2025-12-18 21:44 ` [PATCH v2 1/3] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Matthew Brost
2025-12-18 23:36   ` Summers, Stuart
2025-12-18 21:44 ` [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini Matthew Brost
2025-12-18 23:36   ` Summers, Stuart
2025-12-19  1:15     ` Matthew Brost
2026-01-08 19:00   ` Dong, Zhanjun [this message]
2026-01-08 19:17     ` Matthew Brost
2026-01-14 22:35       ` Dong, Zhanjun
2026-02-06  5:50         ` Matthew Brost
2026-02-06 20:29           ` Dong, Zhanjun
2025-12-18 21:44 ` [PATCH v2 3/3] drm/xe: Trigger queue cleanup if not in wedged mode 2 Matthew Brost
2025-12-18 23:45   ` Summers, Stuart
2025-12-19  1:10     ` Matthew Brost
2025-12-18 23:08 ` ✓ CI.KUnit: success for Attempt to fixup reset, wedge, unload corner cases Patchwork
2025-12-18 23:44 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-20  1:22 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a99db81-ebbe-4dfe-a528-1063c4bcf1d1@intel.com \
    --to=zhanjun.dong@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox