Re: [PATCH 1/1] drm/xe: Avoid serializing unbind jobs on prior TLB invalidations

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: <intel-xe@lists.freedesktop.org>, <carlos.santa@intel.com>,
	<thomas.hellstrom@linux.intel.com>,
	Philipp Stanner <phasta@kernel.org>
Subject: Re: [PATCH 1/1] drm/xe: Avoid serializing unbind jobs on prior TLB invalidations
Date: Wed, 22 Oct 2025 08:10:28 -0700	[thread overview]
Message-ID: <aPjz5BHrNbUBUk8L@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <f7f83835-adaa-426a-94a5-21ad3bb240ff@ursulin.net>

On Wed, Oct 22, 2025 at 09:00:47AM +0100, Tvrtko Ursulin wrote:
> 
> On 17/10/2025 17:52, Matthew Brost wrote:
> > When a burst of unbind jobs is issued, a dependency chain can form
> > between the TLB invalidation of a previous unbind job and the current
> > one. This leads to undesirable serialization, causing current jobs to
> > wait unnecessarily for prior TLB invalidations, execute on the GPU when
> > not needed, and significantly slow down the unbind burst—resulting in up
> > to a 4× slowdown.
> > 
> > To break this chain, mask the last bind queue dependency if the last
> > fence's DMA context matches the TLB invalidation context. This allows
> > full pipelining of unbinds and TLB invalidations while preserving
> > correct dma-fence signaling semantics.
> > 
> > Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6047
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_exec.c          |  3 +-
> >   drivers/gpu/drm/xe/xe_exec_queue.c    | 18 +++++++++--
> >   drivers/gpu/drm/xe/xe_exec_queue.h    |  3 +-
> >   drivers/gpu/drm/xe/xe_pt.c            | 15 +++++++--
> >   drivers/gpu/drm/xe/xe_sched_job.c     | 44 ++++++++++++++++++++++++++-
> >   drivers/gpu/drm/xe/xe_sched_job.h     |  7 ++++-
> >   drivers/gpu/drm/xe/xe_tlb_inval_job.c | 14 +++++++++
> >   drivers/gpu/drm/xe/xe_tlb_inval_job.h |  2 ++
> >   8 files changed, 98 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c
> > index 0dc27476832b..6034cfc8be06 100644
> > --- a/drivers/gpu/drm/xe/xe_exec.c
> > +++ b/drivers/gpu/drm/xe/xe_exec.c
> > @@ -294,7 +294,8 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
> >   		goto err_put_job;
> >   	if (!xe_vm_in_lr_mode(vm)) {
> > -		err = xe_sched_job_last_fence_add_dep(job, vm);
> > +		err = xe_sched_job_last_fence_add_dep(job, vm, NO_MASK_DEP,
> > +						      NO_MASK_DEP);
> >   		if (err)
> >   			goto err_put_job;
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> > index 90cbc95f8e2e..d6f69d9bccba 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > @@ -25,6 +25,7 @@
> >   #include "xe_migrate.h"
> >   #include "xe_pm.h"
> >   #include "xe_ring_ops_types.h"
> > +#include "xe_sched_job.h"
> >   #include "xe_trace.h"
> >   #include "xe_vm.h"
> >   #include "xe_pxp.h"
> > @@ -1106,11 +1107,17 @@ void xe_exec_queue_last_fence_set(struct xe_exec_queue *q, struct xe_vm *vm,
> >    * xe_exec_queue_last_fence_test_dep - Test last fence dependency of queue
> >    * @q: The exec queue
> >    * @vm: The VM the engine does a bind or exec for
> > + * @mask_ctx0: Mask dma-fence context0
> > + * @mask_ctx1: Mask dma-fence context1
> > + *
> > + * Test last fence dependency of queue, skipping masked dma fence contexts.
> >    *
> >    * Returns:
> > - * -ETIME if there exists an unsignalled last fence dependency, zero otherwise.
> > + * -ETIME if there exists an unsignalled and unmasked last fence dependency,
> > + * zero otherwise.
> >    */
> > -int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q, struct xe_vm *vm)
> > +int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q, struct xe_vm *vm,
> > +				      u64 mask_ctx0, u64 mask_ctx1)
> >   {
> >   	struct dma_fence *fence;
> >   	int err = 0;
> > @@ -1119,6 +1126,13 @@ int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q, struct xe_vm *vm)
> >   	if (fence) {
> >   		err = test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags) ?
> >   			0 : -ETIME;
> > +
> > +		if (err == -ETIME) {
> > +			if (xe_sched_job_mask_dependency(fence, mask_ctx0,
> > +							 mask_ctx1))
> > +				err = 0;
> > +		}
> > +
> >   		dma_fence_put(fence);
> >   	}
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h
> > index a4dfbe858bda..99a35b22a46c 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue.h
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue.h
> > @@ -85,7 +85,8 @@ struct dma_fence *xe_exec_queue_last_fence_get_for_resume(struct xe_exec_queue *
> >   void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm,
> >   				  struct dma_fence *fence);
> >   int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q,
> > -				      struct xe_vm *vm);
> > +				      struct xe_vm *vm, u64 mask_ctx0,
> > +				      u64 mask_ctx1);
> >   void xe_exec_queue_update_run_ticks(struct xe_exec_queue *q);
> >   int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch);
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index d22fd1ccc0ba..bba9ae559f57 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -1341,10 +1341,21 @@ static int xe_pt_vm_dependencies(struct xe_sched_job *job,
> >   	}
> >   	if (!(pt_update_ops->q->flags & EXEC_QUEUE_FLAG_KERNEL)) {
> > +		u64 mask_ctx0 = NO_MASK_DEP, mask_ctx1 = NO_MASK_DEP;
> > +
> > +		if (ijob)
> > +			mask_ctx0 = xe_tlb_inval_job_fence_context(ijob);
> > +		if (mjob)
> > +			mask_ctx1 = xe_tlb_inval_job_fence_context(mjob);
> > +
> >   		if (job)
> > -			err = xe_sched_job_last_fence_add_dep(job, vm);
> > +			err = xe_sched_job_last_fence_add_dep(job, vm,
> > +							      mask_ctx0,
> > +							      mask_ctx1);
> >   		else
> > -			err = xe_exec_queue_last_fence_test_dep(pt_update_ops->q, vm);
> > +			err = xe_exec_queue_last_fence_test_dep(pt_update_ops->q,
> > +								vm, mask_ctx0,
> > +								mask_ctx1);
> >   	}
> >   	for (i = 0; job && !err && i < vops->num_syncs; i++)
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
> > index d21bf8f26964..7cbdd87904c6 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job.c
> > +++ b/drivers/gpu/drm/xe/xe_sched_job.c
> > @@ -6,6 +6,7 @@
> >   #include "xe_sched_job.h"
> >   #include <uapi/drm/xe_drm.h>
> > +#include <linux/dma-fence-array.h>
> >   #include <linux/dma-fence-chain.h>
> >   #include <linux/slab.h>
> > @@ -295,19 +296,60 @@ void xe_sched_job_push(struct xe_sched_job *job)
> >   	xe_sched_job_put(job);
> >   }
> > +/**
> > + * xe_sched_job_mask_dependency() - Determine if a dma-fence dependency can be masked
> > + * @fence: The dma-fence to check
> > + * @mask_ctx0: First context to compare against the fence's context
> > + * @mask_ctx1: Second context to compare against the fence's context
> > + *
> > + * This function checks whether the context of the given dma-fence matches
> > + * either of the provided mask contexts. If a match is found, the dependency
> > + * represented by the fence can be skipped. If the fence is a dma-fence-array,
> > + * its individual fences are unwound and checked.
> > + *
> > + * Return: true if the fence can be masked (i.e., skipped), false otherwise.
> > + */
> > +bool xe_sched_job_mask_dependency(struct dma_fence *fence, u64 mask_ctx0,
> > +				  u64 mask_ctx1)
> > +{
> > +	if (dma_fence_is_array(fence)) {
> > +		struct dma_fence *__fence;
> > +		int index;
> > +
> > +		dma_fence_array_for_each(__fence, index, fence)
> > +			if (__fence->context == mask_ctx0 ||
> > +			    __fence->context == mask_ctx1)
> > +				return true;
> > +	} else if (fence->context == mask_ctx0 ||
> > +		   fence->context == mask_ctx1) {
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> >   /**
> >    * xe_sched_job_last_fence_add_dep - Add last fence dependency to job
> >    * @job:job to add the last fence dependency to
> >    * @vm: virtual memory job belongs to
> > + * @mask_ctx0: Mask dma-fence context0
> > + * @mask_ctx1: Mask dma-fence context1
> > + *
> > + * Add last fence dependency to job, skipping masked dma fence contexts.
> >    *
> >    * Returns:
> >    * 0 on success, or an error on failing to expand the array.
> >    */
> > -int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct xe_vm *vm)
> > +int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct xe_vm *vm,
> > +				    u64 mask_ctx0, u64 mask_ctx1)
> >   {
> >   	struct dma_fence *fence;
> >   	fence = xe_exec_queue_last_fence_get(job->q, vm);
> > +	if (xe_sched_job_mask_dependency(fence, mask_ctx0, mask_ctx1)) {
> > +		dma_fence_put(fence);
> > +		return 0;
> > +	}
> >   	return drm_sched_job_add_dependency(&job->drm, fence);
> >   }
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job.h b/drivers/gpu/drm/xe/xe_sched_job.h
> > index 3dc72c5c1f13..81d8e848e605 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job.h
> > +++ b/drivers/gpu/drm/xe/xe_sched_job.h
> > @@ -58,7 +58,8 @@ bool xe_sched_job_completed(struct xe_sched_job *job);
> >   void xe_sched_job_arm(struct xe_sched_job *job);
> >   void xe_sched_job_push(struct xe_sched_job *job);
> > -int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct xe_vm *vm);
> > +int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct xe_vm *vm,
> > +				    u64 mask_ctx0, u64 mask_ctx1);
> >   void xe_sched_job_init_user_fence(struct xe_sched_job *job,
> >   				  struct xe_sync_entry *sync);
> > @@ -93,4 +94,8 @@ void xe_sched_job_snapshot_print(struct xe_sched_job_snapshot *snapshot, struct
> >   int xe_sched_job_add_deps(struct xe_sched_job *job, struct dma_resv *resv,
> >   			  enum dma_resv_usage usage);
> > +#define NO_MASK_DEP	(~0x0ull)
> > +bool xe_sched_job_mask_dependency(struct dma_fence *fence, u64 mask_ctx0,
> > +				  u64 mask_ctx1);
> > +
> >   #endif
> > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_job.c b/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> > index 492def04a559..f2fe7f9fbb22 100644
> > --- a/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> > +++ b/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> > @@ -32,6 +32,8 @@ struct xe_tlb_inval_job {
> >   	u64 start;
> >   	/** @end: End address to invalidate */
> >   	u64 end;
> > +	/** @fence_context: Fence context for job */
> > +	u64 fence_context;
> >   	/** @asid: Address space ID to invalidate */
> >   	u32 asid;
> >   	/** @fence_armed: Fence has been armed */
> > @@ -101,6 +103,7 @@ xe_tlb_inval_job_create(struct xe_exec_queue *q, struct xe_tlb_inval *tlb_inval,
> >   	job->asid = asid;
> >   	job->fence_armed = false;
> >   	job->dep.ops = &dep_job_ops;
> > +	job->fence_context = entity->fence_context + 1;
> 
> As a side note, hardcoding the assumption on how scheduler allocates
> contexts is not great given recent efforts to make drivers know less of the
> scheduler internals.
> 

Yes, we should probably have a helper here — maybe
drm_sched_job_finished_context?

I was planning to roll this change into [1], but that series hasn’t
gained much traction, and fixing this is a fairly high-priority issue
for customers.

This is documented in the DRM scheduler kernel docs:
entity->fence_context + 1 is the job's finished context.

[1] https://patchwork.freedesktop.org/series/155314/

> But what I really wanted to ask is, having only glanced the patch briefly,
> could xe performance problem here also be solved by unwrapping the container
> fences at the DRM scheduler dependency tracking level?
> 

This is primarily about preventing TLB fences — which originate from a
different context than the bind queue but are still ordered on the queue
— from becoming dependencies. The process involves two passes: in the
first pass, we detect dependencies. If none are found, we immediately
complete the bind via the CPU. If dependencies are present, we defer the
bind to the GPU.

> I am asking because amdgpu recently posted a patch to unwrap in their code
> for potentially similar performance reasons, and if now xe wants something
> similar, or even the same, it is an interesting question where to do it.
> 
> Also, I have a patch (not sure if I posted it so far) which unwraps in
> drm_sched_job_add_dependency() and converst the dependency xarray to
> unwrapped dma-fence-array. Initial idea there was to allow scheduler worker
> to only be woken up once, once all deps are signaled, but now if two drivers
> seems to be unwrapping fences maybe there is a case to be made for doing it
> in the core.
> 

I don't think this is the same problem as the one above, but it's an
interesting idea in general. CC me if you post this one.

Matt

> Regards,
> 
> Tvrtko
> 
> >   	kref_init(&job->refcount);
> >   	xe_exec_queue_get(q);	/* Pairs with put in xe_tlb_inval_job_destroy */
> > @@ -266,3 +269,14 @@ void xe_tlb_inval_job_put(struct xe_tlb_inval_job *job)
> >   	if (!IS_ERR_OR_NULL(job))
> >   		kref_put(&job->refcount, xe_tlb_inval_job_destroy);
> >   }
> > +
> > +/**
> > + * xe_tlb_inval_job_fence_context() - TLB invalidation job fence context
> > + * @job: TLB invalidation job object
> > + *
> > + * Return: TLB invalidation job fence context
> > + */
> > +u64 xe_tlb_inval_job_fence_context(struct xe_tlb_inval_job *job)
> > +{
> > +	return job->fence_context;
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_job.h b/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> > index e63edcb26b50..2576165c2228 100644
> > --- a/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> > +++ b/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> > @@ -30,4 +30,6 @@ void xe_tlb_inval_job_get(struct xe_tlb_inval_job *job);
> >   void xe_tlb_inval_job_put(struct xe_tlb_inval_job *job);
> > +u64 xe_tlb_inval_job_fence_context(struct xe_tlb_inval_job *job);
> > +
> >   #endif
>

next prev parent reply	other threads:[~2025-10-22 15:10 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-17 16:52 [PATCH 0/1] Fix serialization on burst of unbinds Matthew Brost
2025-10-17 16:52 ` [PATCH 1/1] drm/xe: Avoid serializing unbind jobs on prior TLB invalidations Matthew Brost
2025-10-21 17:55   ` Summers, Stuart
2025-10-21 20:36     ` Matthew Brost
2025-10-21 20:43       ` Summers, Stuart
2025-10-21 20:50         ` Matthew Brost
2025-10-22  8:00   ` Tvrtko Ursulin
2025-10-22 15:10     ` Matthew Brost [this message]
2025-10-23 12:46       ` Tvrtko Ursulin
2025-10-23 18:55         ` Matthew Brost
2025-10-23 19:27           ` Matthew Brost
2025-10-23 12:28   ` Thomas Hellström
2025-10-17 18:36 ` ✓ CI.KUnit: success for Fix serialization on burst of unbinds Patchwork
2025-10-17 19:16 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-18 18:20 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPjz5BHrNbUBUk8L@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=carlos.santa@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=phasta@kernel.org \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tursulin@ursulin.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox