Re: [PATCH v2] drm/xe/guc: Check GuC running state before deregistering exec queue

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Lin, Shuicheng" <shuicheng.lin@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"De Marchi, Lucas" <lucas.demarchi@intel.com>,
	"Auld, Matthew" <matthew.auld@intel.com>,
	"Wajdeczko, Michal" <Michal.Wajdeczko@intel.com>
Subject: Re: [PATCH v2] drm/xe/guc: Check GuC running state before deregistering exec queue
Date: Sun, 12 Oct 2025 19:06:26 -0700	[thread overview]
Message-ID: <aOxeoq+ZC5BDnFSd@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <DM4PR11MB5456FF896510A5EABEB9A365EAECA@DM4PR11MB5456.namprd11.prod.outlook.com>

On Sat, Oct 11, 2025 at 03:35:34PM -0600, Lin, Shuicheng wrote:
> On Sat, Oct 11, 2025 8:13 AM Matthew Brost wrote:
> > On Fri, Oct 10, 2025 at 05:25:29PM +0000, Shuicheng Lin wrote:
> > > In normal operation, a registered exec queue is disabled and
> > > deregistered through the GuC, and freed only after the GuC confirms
> > > completion. However, if the driver is forced to unbind while the exec
> > > queue is still running, the user may call exec_destroy() after the GuC
> > > has already been stopped and CT communication disabled.
> > >
> > > In this case, the driver cannot receive a response from the GuC,
> > > preventing proper cleanup of exec queue resources. Fix this by
> > > directly releasing the resources when GuC is not running.
> > >
> > > Here is the failure dmesg log:
> > > "
> > > [  468.089581] ---[ end trace 0000000000000000 ]--- [  468.089608] pci
> > > 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535)
> > > [  468.090558] pci 0000:03:00.0: [drm] GT0:     total 65535
> > > [  468.090562] pci 0000:03:00.0: [drm] GT0:     used 1
> > > [  468.090564] pci 0000:03:00.0: [drm] GT0:     range 1..1 (1)
> > > [  468.092716] ------------[ cut here ]------------ [  468.092719]
> > > WARNING: CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298
> > > ttm_vram_mgr_fini+0xf8/0x130 [xe] "
> > 
> > Does public bug for this exist, if so we need a Close + link in the commit message.
> > 
> > Also I believe this warrents a fixes tag - I can add one when merging this for you.
> > 
> 
> No. It was found during internal validation. I will share the bug number with you offline.
> 
> For the fix tag, the logic is implemented in the initial version of xe, then the function is renamed later. 
> So this patch cannot be applied to the initial code directly and makes me not sure about the fix tag.
> I will leave it to you. Thanks in advance for it.

Just so you know - the flow is always apply a fixes tag even if it may
not cleanly backport. We hope the stable maintainers can figure it out,
if not it on us to provide patches to stable kernels which the
maintainers of kernels can the apply.

Matt 

> 
> Shuicheng
> 
> > I'll wait on answer to my first question before merging but this LGTM.
> > Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > >
> > > v2: use xe_uc_fw_is_running() instead of xe_guc_ct_enabled().
> > >     As CT may go down and come back during VF migration.
> > >
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
> > >  1 file changed, 12 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index e9aa0625ce60..0ef67d3523a7 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -44,6 +44,7 @@
> > >  #include "xe_ring_ops_types.h"
> > >  #include "xe_sched_job.h"
> > >  #include "xe_trace.h"
> > > +#include "xe_uc_fw.h"
> > >  #include "xe_vm.h"
> > >
> > >  static struct xe_guc *
> > > @@ -1501,7 +1502,17 @@ static void
> > __guc_exec_queue_process_msg_cleanup(struct xe_sched_msg *msg)
> > >  	xe_gt_assert(guc_to_gt(guc), !(q->flags &
> > EXEC_QUEUE_FLAG_PERMANENT));
> > >  	trace_xe_exec_queue_cleanup_entity(q);
> > >
> > > -	if (exec_queue_registered(q))
> > > +	/*
> > > +	 * Expected state transitions for cleanup:
> > > +	 * - If the exec queue is registered and GuC firmware is running, we must
> > first
> > > +	 *   disable scheduling and deregister the queue to ensure proper
> > teardown and
> > > +	 *   resource release in the GuC, then destroy the exec queue on driver
> > side.
> > > +	 * - If the GuC is already stopped (e.g., during driver unload or GPU reset),
> > > +	 *   we cannot expect a response for the deregister request. In this case,
> > > +	 *   it is safe to directly destroy the exec queue on driver side, as the GuC
> > > +	 *   will not process further requests and all resources must be cleaned up
> > locally.
> > > +	 */
> > > +	if (exec_queue_registered(q) && xe_uc_fw_is_running(&guc->fw))
> > >  		disable_scheduling_deregister(guc, q);
> > >  	else
> > >  		__guc_exec_queue_destroy(guc, q);
> > > --
> > > 2.49.0
> > >

next prev parent reply	other threads:[~2025-10-13  2:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-04 17:30 [PATCH] drm/xe/guc: Check CT enable state before deregistering exec queue Shuicheng Lin
2025-10-04 17:52 ` ✓ CI.KUnit: success for " Patchwork
2025-10-04 18:27 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-04 19:43 ` ✓ Xe.CI.Full: " Patchwork
2025-10-07 14:59 ` [PATCH] " Lin, Shuicheng
2025-10-07 15:09   ` Matthew Brost
2025-10-07 17:59     ` Lin, Shuicheng
2025-10-07 18:37       ` Matthew Brost
2025-10-08 17:49         ` Lin, Shuicheng
2025-10-10 17:25 ` [PATCH v2] drm/xe/guc: Check GuC running " Shuicheng Lin
2025-10-11 15:13   ` Matthew Brost
2025-10-11 21:35     ` Lin, Shuicheng
2025-10-13  2:06       ` Matthew Brost [this message]
2025-10-14  8:58   ` Matthew Auld
2025-10-14 15:15     ` Lin, Shuicheng
2025-10-10 17:36 ` ✓ CI.KUnit: success for drm/xe/guc: Check CT enable state before deregistering exec queue (rev2) Patchwork
2025-10-10 18:28 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-11  0:11 ` ✓ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOxeoq+ZC5BDnFSd@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Michal.Wajdeczko@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=shuicheng.lin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox