From: Matthew Brost <matthew.brost@intel.com>
To: "Lin, Shuicheng" <shuicheng.lin@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Wajdeczko, Michal" <Michal.Wajdeczko@intel.com>,
"De Marchi, Lucas" <lucas.demarchi@intel.com>,
"Lis, Tomasz" <tomasz.lis@intel.com>
Subject: Re: [PATCH] drm/xe/guc: Check CT enable state before deregistering exec queue
Date: Tue, 7 Oct 2025 11:37:10 -0700 [thread overview]
Message-ID: <aOVd1vor49g67mHt@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <DM4PR11MB5456FBB99875139BBC207BA9EAE0A@DM4PR11MB5456.namprd11.prod.outlook.com>
On Tue, Oct 07, 2025 at 11:59:30AM -0600, Lin, Shuicheng wrote:
> On Tue, Oct 7, 2025 8:10 AM Matthew Brost wrote:
> > On Tue, Oct 07, 2025 at 08:59:25AM -0600, Lin, Shuicheng wrote:
> > > Hi all,
> > > Could you please help me review it? Thanks in advance.
> > >
> > > Best Regards
> > > Shuicheng
> > >
> > > On Sat, Oct 4, 2025 10:31 AM Shuicheng Lin wrote:
> > > > In normal operation, a registered exec queue is disabled and
> > > > deregistered through the GuC, and freed only after the GuC confirms
> > > > completion. However, if the driver is forced to unbind while the
> > > > exec queue is still running, the user may call exec_destroy() after
> > > > the GuC has already been stopped and CT communication disabled.
> > > >
> > > > In this case, the driver cannot receive a response from the GuC,
> > > > preventing proper cleanup of exec queue resources. Fix this by
> > > > directly releasing the resources when CT is disabled.
> > > >
> > > > Here is the failure dmesg log:
> > > > "
> > > > [ 468.089581] ---[ end trace 0000000000000000 ]--- [ 468.089608]
> > > > pci
> > > > 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535)
> > > > [ 468.090558] pci 0000:03:00.0: [drm] GT0: total 65535
> > > > [ 468.090562] pci 0000:03:00.0: [drm] GT0: used 1
> > > > [ 468.090564] pci 0000:03:00.0: [drm] GT0: range 1..1 (1)
> > > > [ 468.092716] ------------[ cut here ]------------ [ 468.092719] WARNING:
> > > > CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298
> > > > ttm_vram_mgr_fini+0xf8/0x130 [xe] "
> > > >
> > > > Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
> > > > ---
> > > > drivers/gpu/drm/xe/xe_guc_submit.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > index 53024eb5670b..9d33f63d972b 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > @@ -1489,7 +1489,7 @@ static void
> > > > __guc_exec_queue_process_msg_cleanup(struct xe_sched_msg *msg)
> > > > xe_gt_assert(guc_to_gt(guc), !(q->flags &
> > > > EXEC_QUEUE_FLAG_PERMANENT));
> > > > trace_xe_exec_queue_cleanup_entity(q);
> > > >
> > > > - if (exec_queue_registered(q))
> > > > + if (exec_queue_registered(q) && xe_guc_ct_enabled(&guc->ct))
> >
> > Checking the CT here is not the correct approach — this is a state that can
> > change at any time. For example, the CT may go down and come back during
> > VF migration. If you race with that state change, the GuC will hold references
> > to the queue, and things will quickly go sideways.
> >
> > I think what you need here is a driver unbinding check. Do we wedge the
> > device during unbind? If so, you could do something like:
> >
> > if (exec_queue_registered(q) && !xe_device_wedged(xe))
> >
> > Show more lines If we don’t wedge the device, we’ll need some xe_device-
> > level function to indicate that the driver is being unbound.
>
> wedged flag won't be set during unbind process. How about "gt->uc.guc.submission_state.enabled" set in xe_gt_sanitize()? I guess no since it is like xe_guc_ct_enabled().
xe_guc_ct_enabled state can change in quite a few places, I think
uc.guc.submission_state.enabled is bit more restrictive.
> How about xe_uc_fw_is_running(), or we should not rely on the guc state, as it may be changed later?
I think either of those would likely work. It shouldn't be possible to
enter a scheduler work queue in normal operation where either of those
is clear (i.e., I believe this is only possible if the driver is
unloading / unbinding or we have wedged the device).
> You prefer to add a new variable like "xe->unbound", and check the bind state here, right?
> Thanks.
This might be better (?), a bit of a bikeshed and I'm personally fine
with either. Other might have a stonger preference than myself though.
Matt
>
> Shuicheng
>
> >
> > Matt
> >
> > > > disable_scheduling_deregister(guc, q);
> > > > else
> > > > __guc_exec_queue_destroy(guc, q);
> > > > --
> > > > 2.49.0
> > >
next prev parent reply other threads:[~2025-10-07 18:37 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-04 17:30 [PATCH] drm/xe/guc: Check CT enable state before deregistering exec queue Shuicheng Lin
2025-10-04 17:52 ` ✓ CI.KUnit: success for " Patchwork
2025-10-04 18:27 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-04 19:43 ` ✓ Xe.CI.Full: " Patchwork
2025-10-07 14:59 ` [PATCH] " Lin, Shuicheng
2025-10-07 15:09 ` Matthew Brost
2025-10-07 17:59 ` Lin, Shuicheng
2025-10-07 18:37 ` Matthew Brost [this message]
2025-10-08 17:49 ` Lin, Shuicheng
2025-10-10 17:25 ` [PATCH v2] drm/xe/guc: Check GuC running " Shuicheng Lin
2025-10-11 15:13 ` Matthew Brost
2025-10-11 21:35 ` Lin, Shuicheng
2025-10-13 2:06 ` Matthew Brost
2025-10-14 8:58 ` Matthew Auld
2025-10-14 15:15 ` Lin, Shuicheng
2025-10-10 17:36 ` ✓ CI.KUnit: success for drm/xe/guc: Check CT enable state before deregistering exec queue (rev2) Patchwork
2025-10-10 18:28 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-11 0:11 ` ✓ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aOVd1vor49g67mHt@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=Michal.Wajdeczko@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=shuicheng.lin@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox