Re: [PATCH v3 03/10] drm/xe/guc_submit: Support cancelling submission

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: "Dong, Zhanjun" <zhanjun.dong@intel.com>,
	<intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v3 03/10] drm/xe/guc_submit: Support cancelling submission
Date: Sun, 15 Mar 2026 19:31:10 -0700	[thread overview]
Message-ID: <abdrboTkN82XHFTw@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <abaC4Y80fFU18eCx@black.igk.intel.com>

On Sun, Mar 15, 2026 at 10:58:57AM +0100, Raag Jadav wrote:
> On Fri, Mar 13, 2026 at 11:37:03AM -0400, Dong, Zhanjun wrote:
> > On 2026-03-08 9:55 a.m., Raag Jadav wrote:
> > > In preparation of usecases which require cancelling submission before
> > > PCIe FLR, introduce xe_guc_submit_cancel() helper. This cancels and
> > > frees any in-flight jobs on the scheduler.
> > 
> > Could you put more info on why add new cancel functions rather than call
> > existing xe_sched_submission_stop?
> > From commit message, it looks very similar to stop, which also do stop +
> > free action.
> 

Let me start by saying these GuC interfaces for global control are badly
named and undocumented, which is entirely my fault. We should clean
these up.

Let me explain what we currently have in place:

- xe_guc_submit_stop — Stops all scheduling on all queues, cleans up any
  lost expected G2H, and triggers queue teardown on any queues with jobs
  that have started but not completed.

- xe_guc_submit_start — Starts scheduling on all queues and resubmits
  any jobs on queues that were not torn down in xe_guc_submit_stop,

- xe_guc_submit_pause — Stops scheduling on all queues.

- xe_guc_submit_pause_abort — Starts scheduling on all queues and
  initiates teardown on all queues.

- xe_guc_submit_unpause — Resumes scheduling on all queues.

The use cases are:

- GT reset: xe_guc_submit_stop / xe_guc_submit_start

- Runtime PM d3cold: xe_guc_submit_stop / xe_guc_submit_start

- Runtime PM non-d3cold: xe_guc_submit_pause / xe_guc_submit_unpause

- Wedging: xe_guc_submit_stop / xe_guc_submit_pause_abort (added in [1])

- Driver unload: xe_guc_submit_stop / xe_guc_submit_pause_abort (added
  in [1])

I think for FLR the combination you want is xe_guc_submit_stop /
xe_guc_submit_pause_abort — tear down all queues (and if the device is
already wedged, we’ve already done this [1] , but doing it again is
fine). However, this will tear down the kernel queues that are required
for the driver to become functional again.

So now that I think about it, I probably gave bad advice regarding
xe_exec_queue_reinit after [1]. In xe_migrate_reinit, just drop the ref
to m->q and create a new queue instead - I assume we can allocate memory
in FLR, if not this answer changes (e.g., we'd also need a hook to reach
in GuC backend to reinit the queues flags after all of its jobs have
drained).

> IIUC submission_stop() doesn't free any jobs, it just stops the scheduler

Initiatiating queue teardown will signal all fences, thus free the jobs.
So submission_stop can do this depending queue / job state -
xe_guc_submit_pause_abort will do this all queues. We just merged patch
changing xe_guc_submit_pause_abort behavior last week too [1], which
will affect your series if the device is wedged.

[1] https://patchwork.freedesktop.org/series/162978/

> and cancels wq used to run jobs. But this leaves the jobs on scheduler's
> pending list behind if they're not on the wq yet, which results in timeout.
> So perhaps I used the terminology wrong, will update this.
> 
> Also, I know it's a bit hacky to directly bork the scheduler's pending list
> so this can definitely use some standardization. Open to suggestions.
> 
> Raag
> 
> > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > ---
> > > v3: Cancel in-flight jobs before FLR
> > > ---
> > >   drivers/gpu/drm/xe/xe_gpu_scheduler.c | 11 +++++++++++
> > >   drivers/gpu/drm/xe/xe_gpu_scheduler.h |  1 +
> > >   drivers/gpu/drm/xe/xe_guc_submit.c    | 24 ++++++++++++++++++++++++
> > >   drivers/gpu/drm/xe/xe_guc_submit.h    |  1 +
> > >   4 files changed, 37 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.c b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > > index 9c8004d5dd91..c012dbe84540 100644
> > > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.c
> > > @@ -90,6 +90,17 @@ void xe_sched_fini(struct xe_gpu_scheduler *sched)
> > >   	drm_sched_fini(&sched->base);
> > >   }
> > > +void xe_sched_submission_cancel(struct xe_gpu_scheduler *sched)
> > > +{
> > > +	struct drm_gpu_scheduler *base = &sched->base;
> > > +	struct drm_sched_job *job, *tmp;
> > > +
> > > +	list_for_each_entry_safe_reverse(job, tmp, &base->pending_list, list) {
> > > +		list_del(&job->list);
> > > +		base->ops->free_job(job);
> > > +	}

Never do this. Use the queue teardown flows, which signal the fences and
therefore free the jobs. I can see how you reasoned this, but I suggest
rebasing on [1], as I believe it includes some pieces that were
previously missing to make FLR work.

Matt

> > > +}
> > > +
> > >   void xe_sched_submission_start(struct xe_gpu_scheduler *sched)
> > >   {
> > >   	drm_sched_wqueue_start(&sched->base);
> > > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > index 664c2db56af3..ba7892db8428 100644
> > > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > @@ -19,6 +19,7 @@ int xe_sched_init(struct xe_gpu_scheduler *sched,
> > >   		  struct device *dev);
> > >   void xe_sched_fini(struct xe_gpu_scheduler *sched);
> > > +void xe_sched_submission_cancel(struct xe_gpu_scheduler *sched);
> > >   void xe_sched_submission_start(struct xe_gpu_scheduler *sched);
> > >   void xe_sched_submission_stop(struct xe_gpu_scheduler *sched);
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index de716c1fb18e..cba544cc185c 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -2399,6 +2399,30 @@ void xe_guc_submit_stop(struct xe_guc *guc)
> > >   }
> > > +/**
> > > + * xe_guc_submit_cancel - Cancel all runs of submission tasks on given GuC.
> > > + * @guc: the &xe_guc struct instance whose scheduler is to be cancelled
> > > + */
> > > +void xe_guc_submit_cancel(struct xe_guc *guc)
> > > +{
> > > +	struct xe_exec_queue *q;
> > > +	unsigned long index;
> > > +
> > > +	mutex_lock(&guc->submission_state.lock);
> > > +
> > > +	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
> > > +		struct xe_gpu_scheduler *sched = &q->guc->sched;
> > > +
> > > +		/* Prevent redundant attempts to cancel parallel queues */
> > > +		if (q->guc->id != index)
> > > +			continue;
> > > +
> > > +		xe_sched_submission_cancel(sched);
> > > +	}
> > > +
> > > +	mutex_unlock(&guc->submission_state.lock);
> > > +}
> > > +
> > >   static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
> > >   						       struct xe_exec_queue *q)
> > >   {
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> > > index b3839a90c142..f361a6d32fd3 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> > > @@ -16,6 +16,7 @@ int xe_guc_submit_init(struct xe_guc *guc, unsigned int num_ids);
> > >   int xe_guc_submit_enable(struct xe_guc *guc);
> > >   void xe_guc_submit_disable(struct xe_guc *guc);
> > > +void xe_guc_submit_cancel(struct xe_guc *guc);
> > >   int xe_guc_submit_reset_prepare(struct xe_guc *guc);
> > >   void xe_guc_submit_reset_wait(struct xe_guc *guc);
> > >   void xe_guc_submit_stop(struct xe_guc *guc);
> >

next prev parent reply	other threads:[~2026-03-16  2:31 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-08 13:55 [PATCH v3 00/10] Introduce Xe PCIe FLR Raag Jadav
2026-03-08 13:55 ` [PATCH v3 01/10] drm/xe/uc_fw: Allow re-initializing firmware Raag Jadav
2026-03-08 13:55 ` [PATCH v3 02/10] drm/xe/hw_fence: Synchronize fence irq before destroying the job Raag Jadav
2026-03-08 13:55 ` [PATCH v3 03/10] drm/xe/guc_submit: Support cancelling submission Raag Jadav
2026-03-13 15:37   ` Dong, Zhanjun
2026-03-15  9:58     ` Raag Jadav
2026-03-16  2:31       ` Matthew Brost [this message]
2026-03-08 13:55 ` [PATCH v3 04/10] drm/xe/gt: Introduce FLR helpers Raag Jadav
2026-03-08 13:55 ` [PATCH v3 05/10] drm/xe/irq: Introduce xe_irq_disable() Raag Jadav
2026-03-08 13:55 ` [PATCH v3 06/10] drm/xe: Introduce xe_device_assert_lmem_ready() Raag Jadav
2026-03-08 13:55 ` [PATCH v3 07/10] drm/xe/bo_evict: Introduce xe_bo_restore_map() Raag Jadav
2026-03-08 13:55 ` [PATCH v3 08/10] drm/xe/exec_queue: Introduce xe_exec_queue_reinit() Raag Jadav
2026-03-08 13:55 ` [PATCH v3 09/10] drm/xe/migrate: Introduce xe_migrate_reinit() Raag Jadav
2026-03-08 13:55 ` [PATCH v3 10/10] drm/xe/pci: Introduce PCIe FLR Raag Jadav
2026-03-08 14:08 ` ✗ CI.checkpatch: warning for Introduce Xe PCIe FLR (rev3) Patchwork
2026-03-08 14:09 ` ✓ CI.KUnit: success " Patchwork
2026-03-08 14:50 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-08 15:49 ` ✓ Xe.CI.FULL: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abdrboTkN82XHFTw@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=raag.jadav@intel.com \
    --cc=zhanjun.dong@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox