Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: "Lin, Shuicheng" <shuicheng.lin@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Wajdeczko, Michal" <Michal.Wajdeczko@intel.com>,
	"De Marchi, Lucas" <lucas.demarchi@intel.com>,
	"Lis, Tomasz" <tomasz.lis@intel.com>
Subject: Re: [PATCH] drm/xe/guc: Check CT enable state before deregistering exec queue
Date: Tue, 7 Oct 2025 11:37:10 -0700	[thread overview]
Message-ID: <aOVd1vor49g67mHt@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <DM4PR11MB5456FBB99875139BBC207BA9EAE0A@DM4PR11MB5456.namprd11.prod.outlook.com>

On Tue, Oct 07, 2025 at 11:59:30AM -0600, Lin, Shuicheng wrote:
> On Tue, Oct 7, 2025 8:10 AM Matthew Brost wrote:
> > On Tue, Oct 07, 2025 at 08:59:25AM -0600, Lin, Shuicheng wrote:
> > > Hi all,
> > > Could you please help me review it? Thanks in advance.
> > >
> > > Best Regards
> > > Shuicheng
> > >
> > > On Sat, Oct 4, 2025 10:31 AM Shuicheng Lin wrote:
> > > > In normal operation, a registered exec queue is disabled and
> > > > deregistered through the GuC, and freed only after the GuC confirms
> > > > completion. However, if the driver is forced to unbind while the
> > > > exec queue is still running, the user may call exec_destroy() after
> > > > the GuC has already been stopped and CT communication disabled.
> > > >
> > > > In this case, the driver cannot receive a response from the GuC,
> > > > preventing proper cleanup of exec queue resources. Fix this by
> > > > directly releasing the resources when CT is disabled.
> > > >
> > > > Here is the failure dmesg log:
> > > > "
> > > > [  468.089581] ---[ end trace 0000000000000000 ]--- [  468.089608]
> > > > pci
> > > > 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535)
> > > > [  468.090558] pci 0000:03:00.0: [drm] GT0:     total 65535
> > > > [  468.090562] pci 0000:03:00.0: [drm] GT0:     used 1
> > > > [  468.090564] pci 0000:03:00.0: [drm] GT0:     range 1..1 (1)
> > > > [  468.092716] ------------[ cut here ]------------ [  468.092719] WARNING:
> > > > CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298
> > > > ttm_vram_mgr_fini+0xf8/0x130 [xe] "
> > > >
> > > > Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_guc_submit.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > index 53024eb5670b..9d33f63d972b 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > @@ -1489,7 +1489,7 @@ static void
> > > > __guc_exec_queue_process_msg_cleanup(struct xe_sched_msg *msg)
> > > >  	xe_gt_assert(guc_to_gt(guc), !(q->flags &
> > > > EXEC_QUEUE_FLAG_PERMANENT));
> > > >  	trace_xe_exec_queue_cleanup_entity(q);
> > > >
> > > > -	if (exec_queue_registered(q))
> > > > +	if (exec_queue_registered(q) && xe_guc_ct_enabled(&guc->ct))
> > 
> > Checking the CT here is not the correct approach — this is a state that can
> > change at any time. For example, the CT may go down and come back during
> > VF migration. If you race with that state change, the GuC will hold references
> > to the queue, and things will quickly go sideways.
> > 
> > I think what you need here is a driver unbinding check. Do we wedge the
> > device during unbind? If so, you could do something like:
> > 
> > if (exec_queue_registered(q) && !xe_device_wedged(xe))
> > 
> > Show more lines If we don’t wedge the device, we’ll need some xe_device-
> > level function to indicate that the driver is being unbound.
> 
> wedged flag won't be set during unbind process.  How about "gt->uc.guc.submission_state.enabled" set in xe_gt_sanitize()? I guess no since it is like xe_guc_ct_enabled().

xe_guc_ct_enabled state can change in quite a few places, I think
uc.guc.submission_state.enabled is bit more restrictive.

> How about xe_uc_fw_is_running(), or we should not rely on the guc state, as it may be changed later?

I think either of those would likely work. It shouldn't be possible to
enter a scheduler work queue in normal operation where either of those
is clear (i.e., I believe this is only possible if the driver is
unloading / unbinding or we have wedged the device).

> You prefer to add a new variable like "xe->unbound", and check the bind state here, right?
> Thanks. 

This might be better (?), a bit of a bikeshed and I'm personally fine
with either. Other might have a stonger preference than myself though.

Matt

> 
> Shuicheng
> 
> > 
> > Matt
> > 
> > > >  		disable_scheduling_deregister(guc, q);
> > > >  	else
> > > >  		__guc_exec_queue_destroy(guc, q);
> > > > --
> > > > 2.49.0
> > >

  reply	other threads:[~2025-10-07 18:37 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-04 17:30 [PATCH] drm/xe/guc: Check CT enable state before deregistering exec queue Shuicheng Lin
2025-10-04 17:52 ` ✓ CI.KUnit: success for " Patchwork
2025-10-04 18:27 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-04 19:43 ` ✓ Xe.CI.Full: " Patchwork
2025-10-07 14:59 ` [PATCH] " Lin, Shuicheng
2025-10-07 15:09   ` Matthew Brost
2025-10-07 17:59     ` Lin, Shuicheng
2025-10-07 18:37       ` Matthew Brost [this message]
2025-10-08 17:49         ` Lin, Shuicheng
2025-10-10 17:25 ` [PATCH v2] drm/xe/guc: Check GuC running " Shuicheng Lin
2025-10-11 15:13   ` Matthew Brost
2025-10-11 21:35     ` Lin, Shuicheng
2025-10-13  2:06       ` Matthew Brost
2025-10-14  8:58   ` Matthew Auld
2025-10-14 15:15     ` Lin, Shuicheng
2025-10-10 17:36 ` ✓ CI.KUnit: success for drm/xe/guc: Check CT enable state before deregistering exec queue (rev2) Patchwork
2025-10-10 18:28 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-11  0:11 ` ✓ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOVd1vor49g67mHt@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Michal.Wajdeczko@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=shuicheng.lin@intel.com \
    --cc=tomasz.lis@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox