From: Matthew Brost <matthew.brost@intel.com>
To: "Lin, Shuicheng" <shuicheng.lin@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Wajdeczko, Michal" <Michal.Wajdeczko@intel.com>,
"De Marchi, Lucas" <lucas.demarchi@intel.com>,
"Lis, Tomasz" <tomasz.lis@intel.com>
Subject: Re: [PATCH] drm/xe/guc: Check CT enable state before deregistering exec queue
Date: Tue, 7 Oct 2025 08:09:51 -0700 [thread overview]
Message-ID: <aOUtP+15UGWK1YaB@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <DM4PR11MB5456A079CF5E4276AE13580EEAE0A@DM4PR11MB5456.namprd11.prod.outlook.com>
On Tue, Oct 07, 2025 at 08:59:25AM -0600, Lin, Shuicheng wrote:
> Hi all,
> Could you please help me review it? Thanks in advance.
>
> Best Regards
> Shuicheng
>
> On Sat, Oct 4, 2025 10:31 AM Shuicheng Lin wrote:
> > In normal operation, a registered exec queue is disabled and deregistered
> > through the GuC, and freed only after the GuC confirms completion. However,
> > if the driver is forced to unbind while the exec queue is still running, the user
> > may call exec_destroy() after the GuC has already been stopped and CT
> > communication disabled.
> >
> > In this case, the driver cannot receive a response from the GuC, preventing
> > proper cleanup of exec queue resources. Fix this by directly releasing the
> > resources when CT is disabled.
> >
> > Here is the failure dmesg log:
> > "
> > [ 468.089581] ---[ end trace 0000000000000000 ]--- [ 468.089608] pci
> > 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535)
> > [ 468.090558] pci 0000:03:00.0: [drm] GT0: total 65535
> > [ 468.090562] pci 0000:03:00.0: [drm] GT0: used 1
> > [ 468.090564] pci 0000:03:00.0: [drm] GT0: range 1..1 (1)
> > [ 468.092716] ------------[ cut here ]------------ [ 468.092719] WARNING:
> > CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298
> > ttm_vram_mgr_fini+0xf8/0x130 [xe] "
> >
> > Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_guc_submit.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 53024eb5670b..9d33f63d972b 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1489,7 +1489,7 @@ static void
> > __guc_exec_queue_process_msg_cleanup(struct xe_sched_msg *msg)
> > xe_gt_assert(guc_to_gt(guc), !(q->flags &
> > EXEC_QUEUE_FLAG_PERMANENT));
> > trace_xe_exec_queue_cleanup_entity(q);
> >
> > - if (exec_queue_registered(q))
> > + if (exec_queue_registered(q) && xe_guc_ct_enabled(&guc->ct))
Checking the CT here is not the correct approach — this is a state that
can change at any time. For example, the CT may go down and come back
during VF migration. If you race with that state change, the GuC will
hold references to the queue, and things will quickly go sideways.
I think what you need here is a driver unbinding check. Do we wedge the
device during unbind? If so, you could do something like:
if (exec_queue_registered(q) && !xe_device_wedged(xe))
Show more lines If we don’t wedge the device, we’ll need some
xe_device-level function to indicate that the driver is being unbound.
Matt
> > disable_scheduling_deregister(guc, q);
> > else
> > __guc_exec_queue_destroy(guc, q);
> > --
> > 2.49.0
>
next prev parent reply other threads:[~2025-10-07 15:10 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-04 17:30 [PATCH] drm/xe/guc: Check CT enable state before deregistering exec queue Shuicheng Lin
2025-10-04 17:52 ` ✓ CI.KUnit: success for " Patchwork
2025-10-04 18:27 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-04 19:43 ` ✓ Xe.CI.Full: " Patchwork
2025-10-07 14:59 ` [PATCH] " Lin, Shuicheng
2025-10-07 15:09 ` Matthew Brost [this message]
2025-10-07 17:59 ` Lin, Shuicheng
2025-10-07 18:37 ` Matthew Brost
2025-10-08 17:49 ` Lin, Shuicheng
2025-10-10 17:25 ` [PATCH v2] drm/xe/guc: Check GuC running " Shuicheng Lin
2025-10-11 15:13 ` Matthew Brost
2025-10-11 21:35 ` Lin, Shuicheng
2025-10-13 2:06 ` Matthew Brost
2025-10-14 8:58 ` Matthew Auld
2025-10-14 15:15 ` Lin, Shuicheng
2025-10-10 17:36 ` ✓ CI.KUnit: success for drm/xe/guc: Check CT enable state before deregistering exec queue (rev2) Patchwork
2025-10-10 18:28 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-11 0:11 ` ✓ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aOUtP+15UGWK1YaB@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=Michal.Wajdeczko@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=shuicheng.lin@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox