Re: [PATCH v3 03/25] drm/xe: Decouple exec queue idle check from LRC

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
	"Yadav, Arvind" <arvind.yadav@intel.com>,
	"thomas.hellstrom@linux.intel.com"
	<thomas.hellstrom@linux.intel.com>,
	"Dugast, Francois" <francois.dugast@intel.com>
Subject: Re: [PATCH v3 03/25] drm/xe: Decouple exec queue idle check from LRC
Date: Tue, 3 Mar 2026 22:54:38 +0000	[thread overview]
Message-ID: <1b4ec7807684fbe0f4cc6d7a83e5ae2e0bc9f4e2.camel@intel.com> (raw)
In-Reply-To: <aadj2K8I7Bom1vXJ@lstrano-desk.jf.intel.com>

On Tue, 2026-03-03 at 14:42 -0800, Matthew Brost wrote:
> On Tue, Mar 03, 2026 at 02:26:56PM -0700, Summers, Stuart wrote:
> > On Mon, 2026-03-02 at 13:02 -0800, Matthew Brost wrote:
> > > > On Mon, Mar 02, 2026 at 01:50:11PM -0700, Summers, Stuart
> > > > wrote:
> > > > > > On Fri, 2026-02-27 at 17:34 -0800, Matthew Brost wrote:
> > > > > > > > We already maintain a job count for each exec queue, so
> > > > > > > > simplify > > > the
> > > > > > > > idle
> > > > > > > > check to rely on the job count rather than the LRC
> > > > > > > > state.
> > > > > > > > This
> > > > > > > > decouples
> > > > > > > > exec queues from LRC-based backends and avoids
> > > > > > > > unnecessarily > > > coupling
> > > > > > > > idle
> > > > > > > > detection to backend-specific implementation details.
> > > > > > > > 
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/gpu/drm/xe/xe_exec_queue.c | 15 +-------------
> > > > > > > > -
> > > > > > > >  1 file changed, 1 insertion(+), 14 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > > > index 2d0e73a6a6ee..b3f700a9d425 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > > > @@ -1382,20 +1382,7 @@ bool xe_exec_queue_is_lr(struct
> > > > > > > > > >
> > > > > > > > > xe_exec_queue
> > > > > > > > *q)
> > > > > > > >   */
> > > > > > > >  bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
> > > > > > > >  {
> > > > > > > > -       if (xe_exec_queue_is_parallel(q)) {
> > > > > > > > -               int i;
> > > > > > > > -
> > > > > > > > -               for (i = 0; i < q->width; ++i) {
> > > > > > > > -                       if (xe_lrc_seqno(q->lrc[i]) !=
> > > > > > > > -                           q->lrc[i]-
> > > > > > > > >fence_ctx.next_seqno
> > > > > > > > - 1)
> > > > > > > > -                               return false;
> > > > > > > > -               }
> > > > > > > > -
> > > > > > > > -               return true;
> > > > > > > > -       }
> > > > > > > > -
> > > > > > > > -       return xe_lrc_seqno(q->lrc[0]) ==
> > > > > > > > -               q->lrc[0]->fence_ctx.next_seqno - 1;
> > > > > > > > +       return !atomic_read(&q->job_cnt);
> > > > > > 
> > > > > > Still looking through the series, so might be handled
> > > > > > elsewhere, > > but
> > > > > > just looking at this patch alone, I'm a little worried this
> > > > > > will > > cause
> > > > > > unexpected issues in the exec queue cleanup. This function
> > > > > > > >
> > > > > > currently
> > > > > > ensures that the job is idle from the hardware level. The
> > > > > > change > > you
> > > > 
> > > > The current check is actually incorrect if, for example, a
> > > > queue is
> > > > reset and the LRC head != tail. However, I believe the only
> > > > places
> > > > we
> > > > use xe_exec_queue_is_idle are cases where a queue hasn’t been
> > > > reset, > so
> > > > it happens to work in practice. It’s also just an advisory
> > > > check,
> > > > so
> > > > nothing bad happens if it incorrectly reports “not idle".
> > 
> > So reset case aside (which not taking into consideration anything
> > you
> > said below :) I'd consider a bug here), it does give a false sense
> > of
> > things being actually idle on the hardware IMO that might be
> > extended
> > out to other areas without realizing in the future. I agree that
> > the
> > current use cases match what you said.
> > 
> 
> Yes, so I would say this patch is actually improving things and
> opening
> up this function to other possible use cases.

Agreed..

> 
> > > > 
> > > > > > make here moves that to a software level check. And this is
> > > > > > getting
> > > > > > decremented and checked before we tear down the exec queue.
> > > > > > So
> > > > > > presumably, GuC and the command streamer could still be
> > > > > > doing >
> > > > > > > something
> > > > > > here and we're falsely telling other parts of the driver
> > > > > > that
> > > > > > rely > > on
> > > > > > the engine to really be idle to trust us here.
> > > > > > 
> > > > 
> > > > See above for part of the explanation, but the other part
> > > > involves
> > > > reference counting and fence signaling. A job can only have its
> > > > last
> > > > reference dropped when its fence is signaled.
> > > > 
> > > > A fence can only signal under the following conditions:
> > > > 
> > > > - Its seqno is incremented via ring instructions (which
> > > > corresponds
> > > > > to
> > > >   the LRC head == tail if it’s the last job on the queue).
> > 
> > Right, so technically I guess we could have a hardware hang after
> > the
> > sequence number was written since that isn't the last instruction
> > there, but seems very unlikely. And if we did hit that case, the
> > reset
> > handler would cover that.
> > 
> > Maybe this should be obvious... but just so I'm not missing
> > something
> > here..
> > 
> > So I think the signaling here we're talking about is via the
> > MI_USER_INT in:
> > xe_hw_engine_handle_irq -> xe_hw_fence_rq_run
> 
> This is where fences are signaled or if we time them out in
> guc_exec_queue_timedout_job via xe_sched_job_set_error.

Ah right..

> 
> > 
> > And that dependency you're talking about is here (xe_exec, although
> > I
> > know there are a few in xe_migrate, xe_pt, etc)?
> >         /* Wait behind rebinds */
> >         if (!xe_vm_in_lr_mode(vm)) {
> >                 err = xe_sched_job_add_deps(job,
> >                                             xe_vm_resv(vm),
> >                                             DMA_RESV_USAGE_KERNEL);
> >                 if (err)
> >                         goto err_put_job;
> >         }
> > 
> > What is the expectation for LR jobs?
> > 
> 
> This is completely unrelated but in dma-fence mode
> (!xe_vm_in_lr_mode)
> we can't fault the device so we issue rebinds in the current exec
> IOCTL for anything that moved since the last exec IOCTL - this
> ordering
> exec IOCTL submission behind moving memory back into place +
> rebinding
> it.
> 
> LR mode we either:
>  - Rebind in preempt rebind worker
>  - Let the device take a page fault and rebind
> 
> Because of this we don't even take the dma-resv lock for LR VMs in
> the
> exec IOCTL.

Yeah ok makes sense and I appreciate the explanation :)

Anyway I think with that I agree with the direction here.

Reviewed-by: Stuart Summers <stuart.summers@intel.com>

Thanks,
Stuart

> 
> Matt 
> 
> > Thanks,
> > Stuart
> > 
> > > > - We time out jobs on the queue and signal their fences in
> > > > software. > We
> > > >   only signal fences in software once the queue has been kicked
> > > > off
> > > > > the
> > > >   hardware (i.e., scheduling-disable H2G triggers a G2H
> > > > response).
> > > > 
> > > > > > For reference, I'm looking at xe_sched_job_destroy() where
> > > > > > we
> > > > > > do > > the
> > > > > > decrement and then the exec queue put.
> > > > > > 
> > > > > > So my question is, how are we guaranteeing that hardware is
> > > > > > indeed > > idle
> > > > > > after this change? Are we moving the sequence number check
> > > > > > > >
> > > > > > somewhere
> > > > > > else?
> > > > > > 
> > > > 
> > > > I think above explains this.
> > > > 
> > > > Matt
> > > > 
> > > > > > Thanks,
> > > > > > Stuart
> > > > > > 
> > > > > > > >  }
> > > > > > > >  
> > > > > > > >  /**
> > > > > > 
> >

next prev parent reply	other threads:[~2026-03-03 22:54 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-28  1:34 [PATCH v3 00/25] CPU binds and ULLS on migration queue Matthew Brost
2026-02-28  1:34 ` [PATCH v3 01/25] drm/xe: Drop struct xe_migrate_pt_update argument from populate/clear vfuns Matthew Brost
2026-03-05 14:17   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 02/25] drm/xe: Add xe_migrate_update_pgtables_cpu_execute helper Matthew Brost
2026-03-05 14:39   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 03/25] drm/xe: Decouple exec queue idle check from LRC Matthew Brost
2026-03-02 20:50   ` Summers, Stuart
2026-03-02 21:02     ` Matthew Brost
2026-03-03 21:26       ` Summers, Stuart
2026-03-03 22:42         ` Matthew Brost
2026-03-03 22:54           ` Summers, Stuart [this message]
2026-02-28  1:34 ` [PATCH v3 04/25] drm/xe: Add job count to GuC exec queue snapshot Matthew Brost
2026-03-02 20:50   ` Summers, Stuart
2026-02-28  1:34 ` [PATCH v3 05/25] drm/xe: Update xe_bo_put_deferred arguments to include writeback flag Matthew Brost
2026-04-01 12:20   ` Francois Dugast
2026-04-01 22:39     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 06/25] drm/xe: Add XE_BO_FLAG_PUT_VM_ASYNC Matthew Brost
2026-04-01 12:22   ` Francois Dugast
2026-04-01 22:38     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 07/25] drm/xe: Update scheduler job layer to support PT jobs Matthew Brost
2026-03-03 22:50   ` Summers, Stuart
2026-03-03 23:00     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 08/25] drm/xe: Add helpers to access PT ops Matthew Brost
2026-04-07 15:22   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 09/25] drm/xe: Add struct xe_pt_job_ops Matthew Brost
2026-03-03 23:26   ` Summers, Stuart
2026-03-03 23:28     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 10/25] drm/xe: Update GuC submission backend to run PT jobs Matthew Brost
2026-03-03 23:28   ` Summers, Stuart
2026-03-04  0:26     ` Matthew Brost
2026-03-04 20:43       ` Summers, Stuart
2026-03-04 21:53         ` Matthew Brost
2026-03-05 20:24           ` Summers, Stuart
2026-02-28  1:34 ` [PATCH v3 11/25] drm/xe: Store level in struct xe_vm_pgtable_update Matthew Brost
2026-03-03 23:44   ` Summers, Stuart
2026-02-28  1:34 ` [PATCH v3 12/25] drm/xe: Don't use migrate exec queue for page fault binds Matthew Brost
2026-02-28  1:34 ` [PATCH v3 13/25] drm/xe: Enable CPU binds for jobs Matthew Brost
2026-02-28  1:34 ` [PATCH v3 14/25] drm/xe: Remove unused arguments from xe_migrate_pt_update_ops Matthew Brost
2026-02-28  1:34 ` [PATCH v3 15/25] drm/xe: Make bind queues operate cross-tile Matthew Brost
2026-02-28  1:34 ` [PATCH v3 16/25] drm/xe: Add CPU bind layer Matthew Brost
2026-02-28  1:34 ` [PATCH v3 17/25] drm/xe: Add device flag to enable PT mirroring across tiles Matthew Brost
2026-02-28  1:34 ` [PATCH v3 18/25] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
2026-02-28  1:34 ` [PATCH v3 19/25] drm/xe: Add ULLS support to LRC Matthew Brost
2026-03-05 20:21   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 20/25] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
2026-03-05 23:34   ` Summers, Stuart
2026-03-09 23:11     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 21/25] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
2026-02-28  1:34 ` [PATCH v3 22/25] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
2026-02-28  1:34 ` [PATCH v3 23/25] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
2026-02-28  1:35 ` [PATCH v3 24/25] drm/xe: Enter ULLS for migration jobs upon page fault or SVM prefetch Matthew Brost
2026-02-28  1:35 ` [PATCH v3 25/25] drm/xe: Add modparam to enable / disable ULLS on migrate queue Matthew Brost
2026-03-05 22:59   ` Summers, Stuart
2026-04-01 22:44     ` Matthew Brost
2026-02-28  1:43 ` ✗ CI.checkpatch: warning for CPU binds and ULLS on migration queue (rev3) Patchwork
2026-02-28  1:44 ` ✓ CI.KUnit: success " Patchwork
2026-02-28  2:32 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-28 13:59 ` ✗ Xe.CI.FULL: failure " Patchwork
2026-03-02 17:54   ` Summers, Stuart
2026-03-02 18:13     ` Matthew Brost
2026-03-05 22:56 ` [PATCH v3 00/25] CPU binds and ULLS on migration queue Summers, Stuart
2026-03-10 22:17   ` Matthew Brost
2026-03-20 15:31 ` Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b4ec7807684fbe0f4cc6d7a83e5ae2e0bc9f4e2.camel@intel.com \
    --to=stuart.summers@intel.com \
    --cc=arvind.yadav@intel.com \
    --cc=francois.dugast@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox