From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Santa, Carlos" <carlos.santa@intel.com>,
"thomas.hellstrom@linux.intel.com"
<thomas.hellstrom@linux.intel.com>
Subject: Re: [PATCH 1/1] drm/xe: Avoid serializing unbind jobs on prior TLB invalidations
Date: Tue, 21 Oct 2025 20:43:07 +0000 [thread overview]
Message-ID: <7e2afab209224282ab3a211972364bb5ec1fce44.camel@intel.com> (raw)
In-Reply-To: <aPfuxg77e8nwtoHI@lstrano-desk.jf.intel.com>
On Tue, 2025-10-21 at 13:36 -0700, Matthew Brost wrote:
> On Tue, Oct 21, 2025 at 11:55:56AM -0600, Summers, Stuart wrote:
> > On Fri, 2025-10-17 at 09:52 -0700, Matthew Brost wrote:
> > > When a burst of unbind jobs is issued, a dependency chain can
> > > form
> > > between the TLB invalidation of a previous unbind job and the
> > > current
> > > one. This leads to undesirable serialization, causing current
> > > jobs to
> > > wait unnecessarily for prior TLB invalidations, execute on the
> > > GPU
> > > when
> > > not needed, and significantly slow down the unbind
> > > burst—resulting in
> > > up
> > > to a 4× slowdown.
> > >
> > > To break this chain, mask the last bind queue dependency if the
> > > last
> > > fence's DMA context matches the TLB invalidation context. This
> > > allows
> > > full pipelining of unbinds and TLB invalidations while preserving
> > > correct dma-fence signaling semantics.
> > >
> > > Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6047
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_exec.c | 3 +-
> > > drivers/gpu/drm/xe/xe_exec_queue.c | 18 +++++++++--
> > > drivers/gpu/drm/xe/xe_exec_queue.h | 3 +-
> > > drivers/gpu/drm/xe/xe_pt.c | 15 +++++++--
> > > drivers/gpu/drm/xe/xe_sched_job.c | 44
> > > ++++++++++++++++++++++++++-
> > > drivers/gpu/drm/xe/xe_sched_job.h | 7 ++++-
> > > drivers/gpu/drm/xe/xe_tlb_inval_job.c | 14 +++++++++
> > > drivers/gpu/drm/xe/xe_tlb_inval_job.h | 2 ++
> > > 8 files changed, 98 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_exec.c
> > > b/drivers/gpu/drm/xe/xe_exec.c
> > > index 0dc27476832b..6034cfc8be06 100644
> > > --- a/drivers/gpu/drm/xe/xe_exec.c
> > > +++ b/drivers/gpu/drm/xe/xe_exec.c
> > > @@ -294,7 +294,8 @@ int xe_exec_ioctl(struct drm_device *dev,
> > > void
> > > *data, struct drm_file *file)
> > > goto err_put_job;
> > >
> > > if (!xe_vm_in_lr_mode(vm)) {
> > > - err = xe_sched_job_last_fence_add_dep(job, vm);
> > > + err = xe_sched_job_last_fence_add_dep(job, vm,
> > > NO_MASK_DEP,
> > > +
> > > NO_MASK_DEP);
> > > if (err)
> > > goto err_put_job;
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > index 90cbc95f8e2e..d6f69d9bccba 100644
> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > @@ -25,6 +25,7 @@
> > > #include "xe_migrate.h"
> > > #include "xe_pm.h"
> > > #include "xe_ring_ops_types.h"
> > > +#include "xe_sched_job.h"
> > > #include "xe_trace.h"
> > > #include "xe_vm.h"
> > > #include "xe_pxp.h"
> > > @@ -1106,11 +1107,17 @@ void xe_exec_queue_last_fence_set(struct
> > > xe_exec_queue *q, struct xe_vm *vm,
> > > * xe_exec_queue_last_fence_test_dep - Test last fence
> > > dependency of
> > > queue
> > > * @q: The exec queue
> > > * @vm: The VM the engine does a bind or exec for
> > > + * @mask_ctx0: Mask dma-fence context0
> > > + * @mask_ctx1: Mask dma-fence context1
> > > + *
> > > + * Test last fence dependency of queue, skipping masked dma
> > > fence
> > > contexts.
> > > *
> > > * Returns:
> > > - * -ETIME if there exists an unsignalled last fence dependency,
> > > zero
> > > otherwise.
> > > + * -ETIME if there exists an unsignalled and unmasked last fence
> > > dependency,
> > > + * zero otherwise.
> > > */
> > > -int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q,
> > > struct xe_vm *vm)
> > > +int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q,
> > > struct xe_vm *vm,
> > > + u64 mask_ctx0, u64
> > > mask_ctx1)
> > > {
> > > struct dma_fence *fence;
> > > int err = 0;
> > > @@ -1119,6 +1126,13 @@ int
> > > xe_exec_queue_last_fence_test_dep(struct
> > > xe_exec_queue *q, struct xe_vm *vm)
> > > if (fence) {
> > > err = test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > > &fence-
> > > > flags) ?
> > > 0 : -ETIME;
> > > +
> > > + if (err == -ETIME) {
> > > + if (xe_sched_job_mask_dependency(fence,
> > > mask_ctx0,
> > > +
> > > mask_ctx1))
> > > + err = 0;
> > > + }
> > > +
> > > dma_fence_put(fence);
> > > }
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h
> > > b/drivers/gpu/drm/xe/xe_exec_queue.h
> > > index a4dfbe858bda..99a35b22a46c 100644
> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.h
> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.h
> > > @@ -85,7 +85,8 @@ struct dma_fence
> > > *xe_exec_queue_last_fence_get_for_resume(struct xe_exec_queue *
> > > void xe_exec_queue_last_fence_set(struct xe_exec_queue *e,
> > > struct
> > > xe_vm *vm,
> > > struct dma_fence *fence);
> > > int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q,
> > > - struct xe_vm *vm);
> > > + struct xe_vm *vm, u64
> > > mask_ctx0,
> > > + u64 mask_ctx1);
> > > void xe_exec_queue_update_run_ticks(struct xe_exec_queue *q);
> > >
> > > int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q,
> > > void
> > > *scratch);
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > b/drivers/gpu/drm/xe/xe_pt.c
> > > index d22fd1ccc0ba..bba9ae559f57 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -1341,10 +1341,21 @@ static int xe_pt_vm_dependencies(struct
> > > xe_sched_job *job,
> > > }
> > >
> > > if (!(pt_update_ops->q->flags & EXEC_QUEUE_FLAG_KERNEL))
> > > {
> > > + u64 mask_ctx0 = NO_MASK_DEP, mask_ctx1 =
> > > NO_MASK_DEP;
> > > +
> > > + if (ijob)
> > > + mask_ctx0 =
> > > xe_tlb_inval_job_fence_context(ijob);
> > > + if (mjob)
> > > + mask_ctx1 =
> > > xe_tlb_inval_job_fence_context(mjob);
> >
> > Can we rename these ictx and mctx for consistency?
> >
>
> Yes.
>
> > Also, do we really need both of these here? Shouldn't we always
> > have
> > the primary GT inval (ictx) and so just need to check the one? My
>
> The code as written, we'd only need to check primary GT but Matt R
> eventually wants the driver to be able to boot without the primary
> GT.
> There is a bit of work to done to that work but I didn't want to make
> it
> worse in this patch.
Yeah makes sense, was just thinking of how to make this a little
simplier. I don't really like that we're looking at both contexts here
when we really just need one of them. But we're also doing this in
other parts of the driver (like the primary/media GT TLB invals for
which this is based), so maybe no problem. Just good to at least note
this, otherwise to me it isn't super clear why we need two contexts
here at a glance.
I think at least having those name changes (ictx and mctx) would help
here.
>
> > understanding is the reason being that we might be adding either
> > one of
> > these as the last fence so we need both checks. But in that case
> > would
> > it be better to check against all dependencies or even just the
> > last
> > two? Wouldn't that also help if multiple apps are trying to free at
> > once here so we have interleaved unbind dependencies?
> >
>
> Depending on how the bind is setup, we may check further dependecies
> in dma-resv - see all the other checks in this function. This is
> covering the case for last queue dependecy only which is at least
> sufficent help with ChromeOS case where this is triggered with a
> burst
> of user unbinds.
>
> We might still have an issue with a burst of SVM unbinds where this
> can
> serialize though, that would however likely need some DRM scheduler
> changes though fairly similar to this patch. I can maybe think on
> that
> one in a follow up in a later series. Also I probably should switch
> over
> SVM unbinds to drain the entire garbage collector list and issue a
> single unbind job too.
Makes sense to me, but I'm also fine having that in a follow-up patch.
>
> > > +
> > > if (job)
> > > - err =
> > > xe_sched_job_last_fence_add_dep(job,
> > > vm);
> > > + err =
> > > xe_sched_job_last_fence_add_dep(job,
> > > vm,
> > > +
> > > mask_ctx0,
> > > +
> > > mask_ctx1);
> > > else
> > > - err =
> > > xe_exec_queue_last_fence_test_dep(pt_update_ops->q, vm);
> > > + err =
> > > xe_exec_queue_last_fence_test_dep(pt_update_ops->q,
> > > + v
> > > m,
> > > mask_ctx0,
> > > + m
> > > ask_
> > > ctx1);
> > > }
> > >
> > > for (i = 0; job && !err && i < vops->num_syncs; i++)
> > > diff --git a/drivers/gpu/drm/xe/xe_sched_job.c
> > > b/drivers/gpu/drm/xe/xe_sched_job.c
> > > index d21bf8f26964..7cbdd87904c6 100644
> > > --- a/drivers/gpu/drm/xe/xe_sched_job.c
> > > +++ b/drivers/gpu/drm/xe/xe_sched_job.c
> > > @@ -6,6 +6,7 @@
> > > #include "xe_sched_job.h"
> > >
> > > #include <uapi/drm/xe_drm.h>
> > > +#include <linux/dma-fence-array.h>
> > > #include <linux/dma-fence-chain.h>
> > > #include <linux/slab.h>
> > >
> > > @@ -295,19 +296,60 @@ void xe_sched_job_push(struct xe_sched_job
> > > *job)
> > > xe_sched_job_put(job);
> > > }
> > >
> > > +/**
> > > + * xe_sched_job_mask_dependency() - Determine if a dma-fence
> > > dependency can be masked
> > > + * @fence: The dma-fence to check
> > > + * @mask_ctx0: First context to compare against the fence's
> > > context
> > > + * @mask_ctx1: Second context to compare against the fence's
> > > context
> > > + *
> > > + * This function checks whether the context of the given dma-
> > > fence
> > > matches
> > > + * either of the provided mask contexts. If a match is found,
> > > the
> > > dependency
> > > + * represented by the fence can be skipped. If the fence is a
> > > dma-
> > > fence-array,
> > > + * its individual fences are unwound and checked.
> > > + *
> > > + * Return: true if the fence can be masked (i.e., skipped),
> > > false
> > > otherwise.
> > > + */
> > > +bool xe_sched_job_mask_dependency(struct dma_fence *fence, u64
> > > mask_ctx0,
> > > + u64 mask_ctx1)
> > > +{
> > > + if (dma_fence_is_array(fence)) {
> > > + struct dma_fence *__fence;
> > > + int index;
> > > +
> > > + dma_fence_array_for_each(__fence, index, fence)
> > > + if (__fence->context == mask_ctx0 ||
> > > + __fence->context == mask_ctx1)
> > > + return true;
> > > + } else if (fence->context == mask_ctx0 ||
> > > + fence->context == mask_ctx1) {
> > > + return true;
> > > + }
> > > +
> > > + return false;
> > > +}
> > > +
> > > /**
> > > * xe_sched_job_last_fence_add_dep - Add last fence dependency
> > > to
> > > job
> > > * @job:job to add the last fence dependency to
> > > * @vm: virtual memory job belongs to
> > > + * @mask_ctx0: Mask dma-fence context0
> > > + * @mask_ctx1: Mask dma-fence context1
> > > + *
> > > + * Add last fence dependency to job, skipping masked dma fence
> > > contexts.
> > > *
> > > * Returns:
> > > * 0 on success, or an error on failing to expand the array.
> > > */
> > > -int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job,
> > > struct
> > > xe_vm *vm)
> > > +int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job,
> > > struct
> > > xe_vm *vm,
> > > + u64 mask_ctx0, u64 mask_ctx1)
> > > {
> > > struct dma_fence *fence;
> > >
> > > fence = xe_exec_queue_last_fence_get(job->q, vm);
> > > + if (xe_sched_job_mask_dependency(fence, mask_ctx0,
> > > mask_ctx1)) {
> > > + dma_fence_put(fence);
> > > + return 0;
> > > + }
> > >
> > > return drm_sched_job_add_dependency(&job->drm, fence);
> > > }
> > > diff --git a/drivers/gpu/drm/xe/xe_sched_job.h
> > > b/drivers/gpu/drm/xe/xe_sched_job.h
> > > index 3dc72c5c1f13..81d8e848e605 100644
> > > --- a/drivers/gpu/drm/xe/xe_sched_job.h
> > > +++ b/drivers/gpu/drm/xe/xe_sched_job.h
> > > @@ -58,7 +58,8 @@ bool xe_sched_job_completed(struct xe_sched_job
> > > *job);
> > > void xe_sched_job_arm(struct xe_sched_job *job);
> > > void xe_sched_job_push(struct xe_sched_job *job);
> > >
> > > -int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job,
> > > struct
> > > xe_vm *vm);
> > > +int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job,
> > > struct
> > > xe_vm *vm,
> > > + u64 mask_ctx0, u64
> > > mask_ctx1);
> > > void xe_sched_job_init_user_fence(struct xe_sched_job *job,
> > > struct xe_sync_entry *sync);
> > >
> > > @@ -93,4 +94,8 @@ void xe_sched_job_snapshot_print(struct
> > > xe_sched_job_snapshot *snapshot, struct
> > > int xe_sched_job_add_deps(struct xe_sched_job *job, struct
> > > dma_resv
> > > *resv,
> > > enum dma_resv_usage usage);
> > >
> > > +#define NO_MASK_DEP (~0x0ull)
> > > +bool xe_sched_job_mask_dependency(struct dma_fence *fence, u64
> > > mask_ctx0,
> > > + u64 mask_ctx1);
> > > +
> > > #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> > > b/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> > > index 492def04a559..f2fe7f9fbb22 100644
> > > --- a/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> > > +++ b/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> > > @@ -32,6 +32,8 @@ struct xe_tlb_inval_job {
> > > u64 start;
> > > /** @end: End address to invalidate */
> > > u64 end;
> > > + /** @fence_context: Fence context for job */
> > > + u64 fence_context;
> > > /** @asid: Address space ID to invalidate */
> > > u32 asid;
> > > /** @fence_armed: Fence has been armed */
> > > @@ -101,6 +103,7 @@ xe_tlb_inval_job_create(struct xe_exec_queue
> > > *q,
> > > struct xe_tlb_inval *tlb_inval,
> > > job->asid = asid;
> > > job->fence_armed = false;
> > > job->dep.ops = &dep_job_ops;
> >
> > This means the "finished" context per the entity definition right?
> > Can
> > you either add a note here or change the job->fence_context name to
> > reflect that? Or otherwise why is this adding the +1 here?
> >
>
> The schedule context is entity->context, the finished context is
> entity
> + 1 - this is in DRM scheduler doc. I can add a comment to this for
> now,
> and roll a better fix which is DRM scheduler helper to fish out the
> finished context into this series [1]. DRM scheduler stuff moves slow
> so
> that latter may take a minute and didn't want to block a fix on that.
Can you add some quick documentation there to that effect? Just nice
not to have to go back and forth to the entity documentation which
right now is just implied.
Also thanks for the link to that other series, I'll check that out too.
Thanks,
Stuart
>
> Matt
>
> [1] https://patchwork.freedesktop.org/series/155314/
>
> > Thanks,
> > Stuart
> >
> > > + job->fence_context = entity->fence_context + 1;
> > > kref_init(&job->refcount);
> > > xe_exec_queue_get(q); /* Pairs with put in
> > > xe_tlb_inval_job_destroy */
> > >
> > > @@ -266,3 +269,14 @@ void xe_tlb_inval_job_put(struct
> > > xe_tlb_inval_job *job)
> > > if (!IS_ERR_OR_NULL(job))
> > > kref_put(&job->refcount,
> > > xe_tlb_inval_job_destroy);
> > > }
> > > +
> > > +/**
> > > + * xe_tlb_inval_job_fence_context() - TLB invalidation job fence
> > > context
> > > + * @job: TLB invalidation job object
> > > + *
> > > + * Return: TLB invalidation job fence context
> > > + */
> > > +u64 xe_tlb_inval_job_fence_context(struct xe_tlb_inval_job *job)
> > > +{
> > > + return job->fence_context;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> > > b/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> > > index e63edcb26b50..2576165c2228 100644
> > > --- a/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> > > +++ b/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> > > @@ -30,4 +30,6 @@ void xe_tlb_inval_job_get(struct
> > > xe_tlb_inval_job
> > > *job);
> > >
> > > void xe_tlb_inval_job_put(struct xe_tlb_inval_job *job);
> > >
> > > +u64 xe_tlb_inval_job_fence_context(struct xe_tlb_inval_job
> > > *job);
> > > +
> > > #endif
> >
next prev parent reply other threads:[~2025-10-21 20:43 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-17 16:52 [PATCH 0/1] Fix serialization on burst of unbinds Matthew Brost
2025-10-17 16:52 ` [PATCH 1/1] drm/xe: Avoid serializing unbind jobs on prior TLB invalidations Matthew Brost
2025-10-21 17:55 ` Summers, Stuart
2025-10-21 20:36 ` Matthew Brost
2025-10-21 20:43 ` Summers, Stuart [this message]
2025-10-21 20:50 ` Matthew Brost
2025-10-22 8:00 ` Tvrtko Ursulin
2025-10-22 15:10 ` Matthew Brost
2025-10-23 12:46 ` Tvrtko Ursulin
2025-10-23 18:55 ` Matthew Brost
2025-10-23 19:27 ` Matthew Brost
2025-10-23 12:28 ` Thomas Hellström
2025-10-17 18:36 ` ✓ CI.KUnit: success for Fix serialization on burst of unbinds Patchwork
2025-10-17 19:16 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-18 18:20 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7e2afab209224282ab3a211972364bb5ec1fce44.camel@intel.com \
--to=stuart.summers@intel.com \
--cc=carlos.santa@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox