Re: [PATCH v3 03/25] drm/xe: Decouple exec queue idle check from LRC

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
	"Yadav, Arvind" <arvind.yadav@intel.com>,
	"thomas.hellstrom@linux.intel.com"
	<thomas.hellstrom@linux.intel.com>,
	"Dugast, Francois" <francois.dugast@intel.com>
Subject: Re: [PATCH v3 03/25] drm/xe: Decouple exec queue idle check from LRC
Date: Tue, 3 Mar 2026 21:26:56 +0000	[thread overview]
Message-ID: <fd39339c6fc8461bb32fa01454754c58bb88f835.camel@intel.com> (raw)
In-Reply-To: <aaX65nBqdWvXGmZF@lstrano-desk.jf.intel.com>

On Mon, 2026-03-02 at 13:02 -0800, Matthew Brost wrote:
> > On Mon, Mar 02, 2026 at 01:50:11PM -0700, Summers, Stuart wrote:
> > > > On Fri, 2026-02-27 at 17:34 -0800, Matthew Brost wrote:
> > > > > > We already maintain a job count for each exec queue, so
> > > > > > simplify > > > the
> > > > > > idle
> > > > > > check to rely on the job count rather than the LRC state.
> > > > > > This
> > > > > > decouples
> > > > > > exec queues from LRC-based backends and avoids
> > > > > > unnecessarily > > > coupling
> > > > > > idle
> > > > > > detection to backend-specific implementation details.
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_exec_queue.c | 15 +--------------
> > > > > >  1 file changed, 1 insertion(+), 14 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > index 2d0e73a6a6ee..b3f700a9d425 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > > > @@ -1382,20 +1382,7 @@ bool xe_exec_queue_is_lr(struct > >
> > > > > > > xe_exec_queue
> > > > > > *q)
> > > > > >   */
> > > > > >  bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
> > > > > >  {
> > > > > > -       if (xe_exec_queue_is_parallel(q)) {
> > > > > > -               int i;
> > > > > > -
> > > > > > -               for (i = 0; i < q->width; ++i) {
> > > > > > -                       if (xe_lrc_seqno(q->lrc[i]) !=
> > > > > > -                           q->lrc[i]->fence_ctx.next_seqno
> > > > > > - 1)
> > > > > > -                               return false;
> > > > > > -               }
> > > > > > -
> > > > > > -               return true;
> > > > > > -       }
> > > > > > -
> > > > > > -       return xe_lrc_seqno(q->lrc[0]) ==
> > > > > > -               q->lrc[0]->fence_ctx.next_seqno - 1;
> > > > > > +       return !atomic_read(&q->job_cnt);
> > > > 
> > > > Still looking through the series, so might be handled
> > > > elsewhere, > > but
> > > > just looking at this patch alone, I'm a little worried this
> > > > will > > cause
> > > > unexpected issues in the exec queue cleanup. This function > >
> > > > currently
> > > > ensures that the job is idle from the hardware level. The
> > > > change > > you
> > 
> > The current check is actually incorrect if, for example, a queue is
> > reset and the LRC head != tail. However, I believe the only places
> > we
> > use xe_exec_queue_is_idle are cases where a queue hasn’t been
> > reset, > so
> > it happens to work in practice. It’s also just an advisory check,
> > so
> > nothing bad happens if it incorrectly reports “not idle".

So reset case aside (which not taking into consideration anything you
said below :) I'd consider a bug here), it does give a false sense of
things being actually idle on the hardware IMO that might be extended
out to other areas without realizing in the future. I agree that the
current use cases match what you said.

> > 
> > > > make here moves that to a software level check. And this is
> > > > getting
> > > > decremented and checked before we tear down the exec queue. So
> > > > presumably, GuC and the command streamer could still be doing >
> > > > > something
> > > > here and we're falsely telling other parts of the driver that
> > > > rely > > on
> > > > the engine to really be idle to trust us here.
> > > > 
> > 
> > See above for part of the explanation, but the other part involves
> > reference counting and fence signaling. A job can only have its
> > last
> > reference dropped when its fence is signaled.
> > 
> > A fence can only signal under the following conditions:
> > 
> > - Its seqno is incremented via ring instructions (which corresponds
> > > to
> >   the LRC head == tail if it’s the last job on the queue).

Right, so technically I guess we could have a hardware hang after the
sequence number was written since that isn't the last instruction
there, but seems very unlikely. And if we did hit that case, the reset
handler would cover that.

Maybe this should be obvious... but just so I'm not missing something
here..

So I think the signaling here we're talking about is via the
MI_USER_INT in:
xe_hw_engine_handle_irq -> xe_hw_fence_rq_run

And that dependency you're talking about is here (xe_exec, although I
know there are a few in xe_migrate, xe_pt, etc)?
        /* Wait behind rebinds */
        if (!xe_vm_in_lr_mode(vm)) {
                err = xe_sched_job_add_deps(job,
                                            xe_vm_resv(vm),
                                            DMA_RESV_USAGE_KERNEL);
                if (err)
                        goto err_put_job;
        }

What is the expectation for LR jobs?

Thanks,
Stuart

> > - We time out jobs on the queue and signal their fences in
> > software. > We
> >   only signal fences in software once the queue has been kicked off
> > > the
> >   hardware (i.e., scheduling-disable H2G triggers a G2H response).
> > 
> > > > For reference, I'm looking at xe_sched_job_destroy() where we
> > > > do > > the
> > > > decrement and then the exec queue put.
> > > > 
> > > > So my question is, how are we guaranteeing that hardware is
> > > > indeed > > idle
> > > > after this change? Are we moving the sequence number check > >
> > > > somewhere
> > > > else?
> > > > 
> > 
> > I think above explains this.
> > 
> > Matt
> > 
> > > > Thanks,
> > > > Stuart
> > > > 
> > > > > >  }
> > > > > >  
> > > > > >  /**
> > > >

next prev parent reply	other threads:[~2026-03-03 21:27 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-28  1:34 [PATCH v3 00/25] CPU binds and ULLS on migration queue Matthew Brost
2026-02-28  1:34 ` [PATCH v3 01/25] drm/xe: Drop struct xe_migrate_pt_update argument from populate/clear vfuns Matthew Brost
2026-03-05 14:17   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 02/25] drm/xe: Add xe_migrate_update_pgtables_cpu_execute helper Matthew Brost
2026-03-05 14:39   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 03/25] drm/xe: Decouple exec queue idle check from LRC Matthew Brost
2026-03-02 20:50   ` Summers, Stuart
2026-03-02 21:02     ` Matthew Brost
2026-03-03 21:26       ` Summers, Stuart [this message]
2026-03-03 22:42         ` Matthew Brost
2026-03-03 22:54           ` Summers, Stuart
2026-02-28  1:34 ` [PATCH v3 04/25] drm/xe: Add job count to GuC exec queue snapshot Matthew Brost
2026-03-02 20:50   ` Summers, Stuart
2026-02-28  1:34 ` [PATCH v3 05/25] drm/xe: Update xe_bo_put_deferred arguments to include writeback flag Matthew Brost
2026-04-01 12:20   ` Francois Dugast
2026-04-01 22:39     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 06/25] drm/xe: Add XE_BO_FLAG_PUT_VM_ASYNC Matthew Brost
2026-04-01 12:22   ` Francois Dugast
2026-04-01 22:38     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 07/25] drm/xe: Update scheduler job layer to support PT jobs Matthew Brost
2026-03-03 22:50   ` Summers, Stuart
2026-03-03 23:00     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 08/25] drm/xe: Add helpers to access PT ops Matthew Brost
2026-04-07 15:22   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 09/25] drm/xe: Add struct xe_pt_job_ops Matthew Brost
2026-03-03 23:26   ` Summers, Stuart
2026-03-03 23:28     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 10/25] drm/xe: Update GuC submission backend to run PT jobs Matthew Brost
2026-03-03 23:28   ` Summers, Stuart
2026-03-04  0:26     ` Matthew Brost
2026-03-04 20:43       ` Summers, Stuart
2026-03-04 21:53         ` Matthew Brost
2026-03-05 20:24           ` Summers, Stuart
2026-02-28  1:34 ` [PATCH v3 11/25] drm/xe: Store level in struct xe_vm_pgtable_update Matthew Brost
2026-03-03 23:44   ` Summers, Stuart
2026-02-28  1:34 ` [PATCH v3 12/25] drm/xe: Don't use migrate exec queue for page fault binds Matthew Brost
2026-02-28  1:34 ` [PATCH v3 13/25] drm/xe: Enable CPU binds for jobs Matthew Brost
2026-02-28  1:34 ` [PATCH v3 14/25] drm/xe: Remove unused arguments from xe_migrate_pt_update_ops Matthew Brost
2026-02-28  1:34 ` [PATCH v3 15/25] drm/xe: Make bind queues operate cross-tile Matthew Brost
2026-02-28  1:34 ` [PATCH v3 16/25] drm/xe: Add CPU bind layer Matthew Brost
2026-02-28  1:34 ` [PATCH v3 17/25] drm/xe: Add device flag to enable PT mirroring across tiles Matthew Brost
2026-02-28  1:34 ` [PATCH v3 18/25] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
2026-02-28  1:34 ` [PATCH v3 19/25] drm/xe: Add ULLS support to LRC Matthew Brost
2026-03-05 20:21   ` Francois Dugast
2026-02-28  1:34 ` [PATCH v3 20/25] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
2026-03-05 23:34   ` Summers, Stuart
2026-03-09 23:11     ` Matthew Brost
2026-02-28  1:34 ` [PATCH v3 21/25] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
2026-02-28  1:34 ` [PATCH v3 22/25] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
2026-02-28  1:34 ` [PATCH v3 23/25] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
2026-02-28  1:35 ` [PATCH v3 24/25] drm/xe: Enter ULLS for migration jobs upon page fault or SVM prefetch Matthew Brost
2026-02-28  1:35 ` [PATCH v3 25/25] drm/xe: Add modparam to enable / disable ULLS on migrate queue Matthew Brost
2026-03-05 22:59   ` Summers, Stuart
2026-04-01 22:44     ` Matthew Brost
2026-02-28  1:43 ` ✗ CI.checkpatch: warning for CPU binds and ULLS on migration queue (rev3) Patchwork
2026-02-28  1:44 ` ✓ CI.KUnit: success " Patchwork
2026-02-28  2:32 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-28 13:59 ` ✗ Xe.CI.FULL: failure " Patchwork
2026-03-02 17:54   ` Summers, Stuart
2026-03-02 18:13     ` Matthew Brost
2026-03-05 22:56 ` [PATCH v3 00/25] CPU binds and ULLS on migration queue Summers, Stuart
2026-03-10 22:17   ` Matthew Brost
2026-03-20 15:31 ` Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fd39339c6fc8461bb32fa01454754c58bb88f835.camel@intel.com \
    --to=stuart.summers@intel.com \
    --cc=arvind.yadav@intel.com \
    --cc=francois.dugast@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox