From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Matthew Brost <matthew.brost@intel.com>, intel-xe@lists.freedesktop.org
Cc: carlos.santa@intel.com
Subject: Re: [PATCH 1/1] drm/xe: Avoid serializing unbind jobs on prior TLB invalidations
Date: Thu, 23 Oct 2025 14:28:05 +0200 [thread overview]
Message-ID: <46098af5c1d4d98f0e498b2c6968c7a04caacefe.camel@linux.intel.com> (raw)
In-Reply-To: <20251017165217.493595-2-matthew.brost@intel.com>
Hi Matt
On Fri, 2025-10-17 at 09:52 -0700, Matthew Brost wrote:
> When a burst of unbind jobs is issued, a dependency chain can form
> between the TLB invalidation of a previous unbind job and the current
> one. This leads to undesirable serialization, causing current jobs to
> wait unnecessarily for prior TLB invalidations, execute on the GPU
> when
> not needed, and significantly slow down the unbind burst—resulting in
> up
> to a 4× slowdown.
>
> To break this chain, mask the last bind queue dependency if the last
> fence's DMA context matches the TLB invalidation context. This allows
> full pipelining of unbinds and TLB invalidations while preserving
> correct dma-fence signaling semantics.
>
> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6047
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Some comments below.
Apart from that, it looks like this stems from us always combining the
exec_queue fence and TLB fence into an out_fence and then use that as
the exec_queue_last_fence.
But IMO the exec_queue last fence should always be a single gpu_job
fence, with the exception of no bind jobs, so that we store the gpu_job
fence as the last fence, and *then* combine with any TLB fence as the
out fence.
> ---
> drivers/gpu/drm/xe/xe_exec.c | 3 +-
> drivers/gpu/drm/xe/xe_exec_queue.c | 18 +++++++++--
> drivers/gpu/drm/xe/xe_exec_queue.h | 3 +-
> drivers/gpu/drm/xe/xe_pt.c | 15 +++++++--
> drivers/gpu/drm/xe/xe_sched_job.c | 44
> ++++++++++++++++++++++++++-
> drivers/gpu/drm/xe/xe_sched_job.h | 7 ++++-
> drivers/gpu/drm/xe/xe_tlb_inval_job.c | 14 +++++++++
> drivers/gpu/drm/xe/xe_tlb_inval_job.h | 2 ++
> 8 files changed, 98 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_exec.c
> b/drivers/gpu/drm/xe/xe_exec.c
> index 0dc27476832b..6034cfc8be06 100644
> --- a/drivers/gpu/drm/xe/xe_exec.c
> +++ b/drivers/gpu/drm/xe/xe_exec.c
> @@ -294,7 +294,8 @@ int xe_exec_ioctl(struct drm_device *dev, void
> *data, struct drm_file *file)
> goto err_put_job;
>
> if (!xe_vm_in_lr_mode(vm)) {
> - err = xe_sched_job_last_fence_add_dep(job, vm);
> + err = xe_sched_job_last_fence_add_dep(job, vm,
> NO_MASK_DEP,
> + NO_MASK_DEP);
> if (err)
> goto err_put_job;
>
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 90cbc95f8e2e..d6f69d9bccba 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -25,6 +25,7 @@
> #include "xe_migrate.h"
> #include "xe_pm.h"
> #include "xe_ring_ops_types.h"
> +#include "xe_sched_job.h"
> #include "xe_trace.h"
> #include "xe_vm.h"
> #include "xe_pxp.h"
> @@ -1106,11 +1107,17 @@ void xe_exec_queue_last_fence_set(struct
> xe_exec_queue *q, struct xe_vm *vm,
> * xe_exec_queue_last_fence_test_dep - Test last fence dependency of
> queue
> * @q: The exec queue
> * @vm: The VM the engine does a bind or exec for
> + * @mask_ctx0: Mask dma-fence context0
> + * @mask_ctx1: Mask dma-fence context1
> + *
> + * Test last fence dependency of queue, skipping masked dma fence
> contexts.
> *
> * Returns:
> - * -ETIME if there exists an unsignalled last fence dependency, zero
> otherwise.
> + * -ETIME if there exists an unsignalled and unmasked last fence
> dependency,
> + * zero otherwise.
> */
> -int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q,
> struct xe_vm *vm)
> +int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q,
> struct xe_vm *vm,
> + u64 mask_ctx0, u64 mask_ctx1)
> {
> struct dma_fence *fence;
> int err = 0;
> @@ -1119,6 +1126,13 @@ int xe_exec_queue_last_fence_test_dep(struct
> xe_exec_queue *q, struct xe_vm *vm)
> if (fence) {
> err = test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence-
> >flags) ?
> 0 : -ETIME;
> +
> + if (err == -ETIME) {
> + if (xe_sched_job_mask_dependency(fence,
> mask_ctx0,
> + mask_ctx1))
> + err = 0;
> + }
> +
> dma_fence_put(fence);
> }
>
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h
> b/drivers/gpu/drm/xe/xe_exec_queue.h
> index a4dfbe858bda..99a35b22a46c 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.h
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.h
> @@ -85,7 +85,8 @@ struct dma_fence
> *xe_exec_queue_last_fence_get_for_resume(struct xe_exec_queue *
> void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct
> xe_vm *vm,
> struct dma_fence *fence);
> int xe_exec_queue_last_fence_test_dep(struct xe_exec_queue *q,
> - struct xe_vm *vm);
> + struct xe_vm *vm, u64
> mask_ctx0,
> + u64 mask_ctx1);
> void xe_exec_queue_update_run_ticks(struct xe_exec_queue *q);
>
> int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void
> *scratch);
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index d22fd1ccc0ba..bba9ae559f57 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -1341,10 +1341,21 @@ static int xe_pt_vm_dependencies(struct
> xe_sched_job *job,
> }
>
> if (!(pt_update_ops->q->flags & EXEC_QUEUE_FLAG_KERNEL)) {
> + u64 mask_ctx0 = NO_MASK_DEP, mask_ctx1 =
> NO_MASK_DEP;
> +
> + if (ijob)
> + mask_ctx0 =
> xe_tlb_inval_job_fence_context(ijob);
> + if (mjob)
> + mask_ctx1 =
> xe_tlb_inval_job_fence_context(mjob);
> +
> if (job)
> - err = xe_sched_job_last_fence_add_dep(job,
> vm);
> + err = xe_sched_job_last_fence_add_dep(job,
> vm,
> +
> mask_ctx0,
> +
> mask_ctx1);
> else
> - err =
> xe_exec_queue_last_fence_test_dep(pt_update_ops->q, vm);
> + err =
> xe_exec_queue_last_fence_test_dep(pt_update_ops->q,
> + vm,
> mask_ctx0,
> + mask
> _ctx1);
> }
>
> for (i = 0; job && !err && i < vops->num_syncs; i++)
> diff --git a/drivers/gpu/drm/xe/xe_sched_job.c
> b/drivers/gpu/drm/xe/xe_sched_job.c
> index d21bf8f26964..7cbdd87904c6 100644
> --- a/drivers/gpu/drm/xe/xe_sched_job.c
> +++ b/drivers/gpu/drm/xe/xe_sched_job.c
> @@ -6,6 +6,7 @@
> #include "xe_sched_job.h"
>
> #include <uapi/drm/xe_drm.h>
> +#include <linux/dma-fence-array.h>
> #include <linux/dma-fence-chain.h>
> #include <linux/slab.h>
>
> @@ -295,19 +296,60 @@ void xe_sched_job_push(struct xe_sched_job
> *job)
> xe_sched_job_put(job);
> }
>
> +/**
> + * xe_sched_job_mask_dependency() - Determine if a dma-fence
> dependency can be masked
> + * @fence: The dma-fence to check
> + * @mask_ctx0: First context to compare against the fence's context
> + * @mask_ctx1: Second context to compare against the fence's context
> + *
> + * This function checks whether the context of the given dma-fence
> matches
> + * either of the provided mask contexts. If a match is found, the
> dependency
> + * represented by the fence can be skipped. If the fence is a dma-
> fence-array,
> + * its individual fences are unwound and checked.
> + *
> + * Return: true if the fence can be masked (i.e., skipped), false
> otherwise.
> + */
> +bool xe_sched_job_mask_dependency(struct dma_fence *fence, u64
> mask_ctx0,
> + u64 mask_ctx1)
> +{
> + if (dma_fence_is_array(fence)) {
> + struct dma_fence *__fence;
> + int index;
> +
> + dma_fence_array_for_each(__fence, index, fence)
> + if (__fence->context == mask_ctx0 ||
> + __fence->context == mask_ctx1)
> + return true;
What if there are other fences in the array that don't match the
contexts? Don't we lose them.
On a different side-note, when looking at the code it looks like the
last fence could in theory be an array_fence[tiles] of
array_fence[gts]. I don't think we have such HW yet, but IIRC dma_fence
container rules do not allow that.
Thanks,
Thomas
> + } else if (fence->context == mask_ctx0 ||
> + fence->context == mask_ctx1) {
> + return true;
> + }
> +
> + return false;
> +}
> +
> /**
> * xe_sched_job_last_fence_add_dep - Add last fence dependency to
> job
> * @job:job to add the last fence dependency to
> * @vm: virtual memory job belongs to
> + * @mask_ctx0: Mask dma-fence context0
> + * @mask_ctx1: Mask dma-fence context1
> + *
> + * Add last fence dependency to job, skipping masked dma fence
> contexts.
> *
> * Returns:
> * 0 on success, or an error on failing to expand the array.
> */
> -int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct
> xe_vm *vm)
> +int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct
> xe_vm *vm,
> + u64 mask_ctx0, u64 mask_ctx1)
> {
> struct dma_fence *fence;
>
> fence = xe_exec_queue_last_fence_get(job->q, vm);
> + if (xe_sched_job_mask_dependency(fence, mask_ctx0,
> mask_ctx1)) {
> + dma_fence_put(fence);
> + return 0;
> + }
>
> return drm_sched_job_add_dependency(&job->drm, fence);
> }
> diff --git a/drivers/gpu/drm/xe/xe_sched_job.h
> b/drivers/gpu/drm/xe/xe_sched_job.h
> index 3dc72c5c1f13..81d8e848e605 100644
> --- a/drivers/gpu/drm/xe/xe_sched_job.h
> +++ b/drivers/gpu/drm/xe/xe_sched_job.h
> @@ -58,7 +58,8 @@ bool xe_sched_job_completed(struct xe_sched_job
> *job);
> void xe_sched_job_arm(struct xe_sched_job *job);
> void xe_sched_job_push(struct xe_sched_job *job);
>
> -int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct
> xe_vm *vm);
> +int xe_sched_job_last_fence_add_dep(struct xe_sched_job *job, struct
> xe_vm *vm,
> + u64 mask_ctx0, u64 mask_ctx1);
> void xe_sched_job_init_user_fence(struct xe_sched_job *job,
> struct xe_sync_entry *sync);
>
> @@ -93,4 +94,8 @@ void xe_sched_job_snapshot_print(struct
> xe_sched_job_snapshot *snapshot, struct
> int xe_sched_job_add_deps(struct xe_sched_job *job, struct dma_resv
> *resv,
> enum dma_resv_usage usage);
>
> +#define NO_MASK_DEP (~0x0ull)
> +bool xe_sched_job_mask_dependency(struct dma_fence *fence, u64
> mask_ctx0,
> + u64 mask_ctx1);
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> b/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> index 492def04a559..f2fe7f9fbb22 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval_job.c
> @@ -32,6 +32,8 @@ struct xe_tlb_inval_job {
> u64 start;
> /** @end: End address to invalidate */
> u64 end;
> + /** @fence_context: Fence context for job */
> + u64 fence_context;
> /** @asid: Address space ID to invalidate */
> u32 asid;
> /** @fence_armed: Fence has been armed */
> @@ -101,6 +103,7 @@ xe_tlb_inval_job_create(struct xe_exec_queue *q,
> struct xe_tlb_inval *tlb_inval,
> job->asid = asid;
> job->fence_armed = false;
> job->dep.ops = &dep_job_ops;
> + job->fence_context = entity->fence_context + 1;
> kref_init(&job->refcount);
> xe_exec_queue_get(q); /* Pairs with put in
> xe_tlb_inval_job_destroy */
>
> @@ -266,3 +269,14 @@ void xe_tlb_inval_job_put(struct
> xe_tlb_inval_job *job)
> if (!IS_ERR_OR_NULL(job))
> kref_put(&job->refcount, xe_tlb_inval_job_destroy);
> }
> +
> +/**
> + * xe_tlb_inval_job_fence_context() - TLB invalidation job fence
> context
> + * @job: TLB invalidation job object
> + *
> + * Return: TLB invalidation job fence context
> + */
> +u64 xe_tlb_inval_job_fence_context(struct xe_tlb_inval_job *job)
> +{
> + return job->fence_context;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> b/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> index e63edcb26b50..2576165c2228 100644
> --- a/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> +++ b/drivers/gpu/drm/xe/xe_tlb_inval_job.h
> @@ -30,4 +30,6 @@ void xe_tlb_inval_job_get(struct xe_tlb_inval_job
> *job);
>
> void xe_tlb_inval_job_put(struct xe_tlb_inval_job *job);
>
> +u64 xe_tlb_inval_job_fence_context(struct xe_tlb_inval_job *job);
> +
> #endif
next prev parent reply other threads:[~2025-10-23 12:28 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-17 16:52 [PATCH 0/1] Fix serialization on burst of unbinds Matthew Brost
2025-10-17 16:52 ` [PATCH 1/1] drm/xe: Avoid serializing unbind jobs on prior TLB invalidations Matthew Brost
2025-10-21 17:55 ` Summers, Stuart
2025-10-21 20:36 ` Matthew Brost
2025-10-21 20:43 ` Summers, Stuart
2025-10-21 20:50 ` Matthew Brost
2025-10-22 8:00 ` Tvrtko Ursulin
2025-10-22 15:10 ` Matthew Brost
2025-10-23 12:46 ` Tvrtko Ursulin
2025-10-23 18:55 ` Matthew Brost
2025-10-23 19:27 ` Matthew Brost
2025-10-23 12:28 ` Thomas Hellström [this message]
2025-10-17 18:36 ` ✓ CI.KUnit: success for Fix serialization on burst of unbinds Patchwork
2025-10-17 19:16 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-18 18:20 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46098af5c1d4d98f0e498b2c6968c7a04caacefe.camel@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=carlos.santa@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox