[PATCH 00/15] CPU binds and ULLS on migration queue

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/15] CPU binds and ULLS on migration queue
@ 2025-06-05 15:32 Matthew Brost
  2025-06-05 15:32 ` [PATCH 01/15] drm/xe: Drop struct xe_migrate_pt_update argument from populate / clear vfuns Matthew Brost
                   ` (17 more replies)
  0 siblings, 18 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

We now have data to back up the need for CPU binds and ULLS on the
migration queue, as generated from [1].

On BMG, it is shown that when the GPU is consistently processing faults,
copy jobs run approximately 40–65µs faster (depending on the test case)
with ULLS compared to traditional GuC submission with SLPC enabled on
the migration queue (not upstream, but last patch in series can enable
this). Without SLPC enabled (upstream), ULLS is approximately 100–200µs
faster. Startup from a cold GPU shows an even larger speedup. Given the
critical nature of fault performance, ULLS appears to be a worthwhile
feature.

ULLS will consume more power (not yet measured) due to a continuously
running batch on the paging engine. However, compute UMDs already do
this on engines exposed to users. Again, this seems like a worthwhile
tradeoff.

CPU binds are required for ULLS to function, as the migration queue
needs exclusive access to the paging hardware engine. Thus, CPU binds
are included here. Beyond being a requirement for ULLS, CPU binds
should also reduce VM bind latency and decouple kernel binds from
unrelated copy/clear jobs—this is especially beneficial when faults are
serviced in parallel. Average bind time in a parallel faulting test case
was reduced by approximately 15µs-in the worst case, 2M copy time (~140µs)
* (number of page fault threads - 1) latency would be added to single
fault.

This series could be merged in phases: first CPU binds, then ULLS on the
migration execution queue.

Last couple of patches in series add modparams for quick performance /
power experiments.

Matt

[1] https://patchwork.freedesktop.org/series/149811/

Matthew Brost (15):
  drm/xe: Drop struct xe_migrate_pt_update argument from populate /
    clear vfuns
  drm/xe: Add __xe_migrate_update_pgtables_cpu helper
  drm/xe: CPU binds for jobs
  drm/xe: Remove unused arguments from xe_migrate_pt_update_ops
  drm/xe: Don't use migrate exec queue for page fault binds
  drm/xe: Add xe_hw_engine_write_ring_tail
  drm/xe: Add ULLS support to LRC
  drm/xe: Add ULLS migration job support to migration layer
  drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
  drm/xe: Add ULLS migration job support to ring ops
  drm/xe: Add ULLS migration job support to GuC submission
  drm/xe: Enable ULLS migration jobs when opening LR VM
  drm/xe: Set slpc freq to max on ULLS jobs
  drm/xe: Add modparam to enable / disable ULLS on migrate queue
  drm/xe: Add modparam to enable / disable high SLPC on migrate queue

 .../gpu/drm/xe/instructions/xe_mi_commands.h  |   6 +
 drivers/gpu/drm/xe/xe_bo.c                    |   7 +-
 drivers/gpu/drm/xe/xe_bo.h                    |   9 +-
 drivers/gpu/drm/xe/xe_bo_types.h              |   2 -
 drivers/gpu/drm/xe/xe_debugfs.c               |   3 +
 drivers/gpu/drm/xe/xe_device.c                |   3 +
 drivers/gpu/drm/xe/xe_device_types.h          |  10 +
 drivers/gpu/drm/xe/xe_drm_client.c            |   3 +-
 drivers/gpu/drm/xe/xe_guc_submit.c            |  86 +++-
 drivers/gpu/drm/xe/xe_hw_engine.c             |  10 +
 drivers/gpu/drm/xe/xe_hw_engine.h             |   1 +
 drivers/gpu/drm/xe/xe_lrc.c                   |  51 +-
 drivers/gpu/drm/xe/xe_lrc.h                   |   3 +
 drivers/gpu/drm/xe/xe_lrc_types.h             |   2 +
 drivers/gpu/drm/xe/xe_migrate.c               | 442 ++++++++----------
 drivers/gpu/drm/xe/xe_migrate.h               |  27 +-
 drivers/gpu/drm/xe/xe_module.c                |   7 +
 drivers/gpu/drm/xe/xe_module.h                |   2 +
 drivers/gpu/drm/xe/xe_pt.c                    | 221 ++++++---
 drivers/gpu/drm/xe/xe_pt.h                    |   5 +-
 drivers/gpu/drm/xe/xe_pt_types.h              |  29 +-
 drivers/gpu/drm/xe/xe_ring_ops.c              |  30 ++
 drivers/gpu/drm/xe/xe_sched_job.c             |  78 ++--
 drivers/gpu/drm/xe/xe_sched_job_types.h       |  37 +-
 drivers/gpu/drm/xe/xe_vm.c                    |  73 +--
 25 files changed, 730 insertions(+), 417 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/15] drm/xe: Drop struct xe_migrate_pt_update argument from populate / clear vfuns
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 02/15] drm/xe: Add __xe_migrate_update_pgtables_cpu helper Matthew Brost
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

struct xe_migrate_pt_update will not be available in run job where CPU
binds will be implemented. This field is used in populate and replaced
with a VM field in clear.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c |  9 +++++----
 drivers/gpu/drm/xe/xe_migrate.h | 12 +++++-------
 drivers/gpu/drm/xe/xe_pt.c      | 12 +++++-------
 3 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 8f8e9fdfb2a8..d8b50305df87 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1199,6 +1199,7 @@ static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb, u64 ppgtt_ofs,
 			  struct xe_migrate_pt_update *pt_update)
 {
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
+	struct xe_vm *vm = pt_update->vops->vm;
 	u32 chunk;
 	u32 ofs = update->ofs, size = update->qwords;
 
@@ -1230,10 +1231,10 @@ static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb, u64 ppgtt_ofs,
 		bb->cs[bb->len++] = lower_32_bits(addr);
 		bb->cs[bb->len++] = upper_32_bits(addr);
 		if (pt_op->bind)
-			ops->populate(pt_update, tile, NULL, bb->cs + bb->len,
+			ops->populate(tile, NULL, bb->cs + bb->len,
 				      ofs, chunk, update);
 		else
-			ops->clear(pt_update, tile, NULL, bb->cs + bb->len,
+			ops->clear(vm, tile, NULL, bb->cs + bb->len,
 				   ofs, chunk, update);
 
 		bb->len += chunk * 2;
@@ -1290,12 +1291,12 @@ xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
 				&pt_op->entries[j];
 
 			if (pt_op->bind)
-				ops->populate(pt_update, m->tile,
+				ops->populate(m->tile,
 					      &update->pt_bo->vmap, NULL,
 					      update->ofs, update->qwords,
 					      update);
 			else
-				ops->clear(pt_update, m->tile,
+				ops->clear(vm, m->tile,
 					   &update->pt_bo->vmap, NULL,
 					   update->ofs, update->qwords, update);
 		}
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index fb9839c1bae0..b064455b604e 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -31,7 +31,6 @@ struct xe_vma;
 struct xe_migrate_pt_update_ops {
 	/**
 	 * @populate: Populate a command buffer or page-table with ptes.
-	 * @pt_update: Embeddable callback argument.
 	 * @tile: The tile for the current operation.
 	 * @map: struct iosys_map into the memory to be populated.
 	 * @pos: If @map is NULL, map into the memory to be populated.
@@ -43,13 +42,12 @@ struct xe_migrate_pt_update_ops {
 	 * page-table system to populate command buffers or shared
 	 * page-tables with PTEs.
 	 */
-	void (*populate)(struct xe_migrate_pt_update *pt_update,
-			 struct xe_tile *tile, struct iosys_map *map,
+	void (*populate)(struct xe_tile *tile, struct iosys_map *map,
 			 void *pos, u32 ofs, u32 num_qwords,
 			 const struct xe_vm_pgtable_update *update);
 	/**
 	 * @clear: Clear a command buffer or page-table with ptes.
-	 * @pt_update: Embeddable callback argument.
+	 * @vm: VM being updated
 	 * @tile: The tile for the current operation.
 	 * @map: struct iosys_map into the memory to be populated.
 	 * @pos: If @map is NULL, map into the memory to be populated.
@@ -61,9 +59,9 @@ struct xe_migrate_pt_update_ops {
 	 * page-table system to populate command buffers or shared
 	 * page-tables with PTEs.
 	 */
-	void (*clear)(struct xe_migrate_pt_update *pt_update,
-		      struct xe_tile *tile, struct iosys_map *map,
-		      void *pos, u32 ofs, u32 num_qwords,
+	void (*clear)(struct xe_vm *vm, struct xe_tile *tile,
+		      struct iosys_map *map, void *pos, u32 ofs,
+		      u32 num_qwords,
 		      const struct xe_vm_pgtable_update *update);
 
 	/**
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index f39d5cc9f411..db1c363a65d5 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -962,9 +962,8 @@ bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
 }
 
 static void
-xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *tile,
-		       struct iosys_map *map, void *data,
-		       u32 qword_ofs, u32 num_qwords,
+xe_vm_populate_pgtable(struct xe_tile *tile, struct iosys_map *map,
+		       void *data, u32 qword_ofs, u32 num_qwords,
 		       const struct xe_vm_pgtable_update *update)
 {
 	struct xe_pt_entry *ptes = update->pt_entries;
@@ -1735,12 +1734,11 @@ static unsigned int xe_pt_stage_unbind(struct xe_tile *tile,
 }
 
 static void
-xe_migrate_clear_pgtable_callback(struct xe_migrate_pt_update *pt_update,
-				  struct xe_tile *tile, struct iosys_map *map,
-				  void *ptr, u32 qword_ofs, u32 num_qwords,
+xe_migrate_clear_pgtable_callback(struct xe_vm *vm, struct xe_tile *tile,
+				  struct iosys_map *map, void *ptr,
+				  u32 qword_ofs, u32 num_qwords,
 				  const struct xe_vm_pgtable_update *update)
 {
-	struct xe_vm *vm = pt_update->vops->vm;
 	u64 empty = __xe_pt_empty_pte(tile, vm, update->pt->level);
 	int i;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/15] drm/xe: Add __xe_migrate_update_pgtables_cpu helper
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
  2025-06-05 15:32 ` [PATCH 01/15] drm/xe: Drop struct xe_migrate_pt_update argument from populate / clear vfuns Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 03/15] drm/xe: CPU binds for jobs Matthew Brost
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

This will help implement CPU binds as the submision backend can call
this helper when a bind jobs dependencies are resolved. Also add some
asserts to the helper while here.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 58 ++++++++++++++++++++-------------
 1 file changed, 35 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index d8b50305df87..9084f5cbc02d 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1258,6 +1258,38 @@ struct migrate_test_params {
 	container_of(_priv, struct migrate_test_params, base)
 #endif
 
+static void
+__xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
+				 const struct xe_migrate_pt_update_ops *ops,
+				 struct xe_vm_pgtable_update_op *pt_op,
+				 int num_ops)
+{
+	u32 j, i;
+
+	for (j = 0; j < num_ops; ++j, ++pt_op) {
+		for (i = 0; i < pt_op->num_entries; i++) {
+			const struct xe_vm_pgtable_update *update =
+				&pt_op->entries[i];
+
+			xe_tile_assert(tile, update);
+			xe_tile_assert(tile, update->pt_bo);
+			xe_tile_assert(tile, !iosys_map_is_null(&update->pt_bo->vmap));
+
+			if (pt_op->bind)
+				ops->populate(tile, &update->pt_bo->vmap,
+					      NULL, update->ofs, update->qwords,
+					      update);
+			else
+				ops->clear(vm, tile, &update->pt_bo->vmap,
+					   NULL, update->ofs, update->qwords,
+					   update);
+		}
+	}
+
+	trace_xe_vm_cpu_bind(vm);
+	xe_device_wmb(vm->xe);
+}
+
 static struct dma_fence *
 xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
 			       struct xe_migrate_pt_update *pt_update)
@@ -1270,7 +1302,6 @@ xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
 	struct xe_vm_pgtable_update_ops *pt_update_ops =
 		&pt_update->vops->pt_update_ops[pt_update->tile_id];
 	int err;
-	u32 i, j;
 
 	if (XE_TEST_ONLY(test && test->force_gpu))
 		return ERR_PTR(-ETIME);
@@ -1282,28 +1313,9 @@ xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
 			return ERR_PTR(err);
 	}
 
-	for (i = 0; i < pt_update_ops->num_ops; ++i) {
-		const struct xe_vm_pgtable_update_op *pt_op =
-			&pt_update_ops->ops[i];
-
-		for (j = 0; j < pt_op->num_entries; j++) {
-			const struct xe_vm_pgtable_update *update =
-				&pt_op->entries[j];
-
-			if (pt_op->bind)
-				ops->populate(m->tile,
-					      &update->pt_bo->vmap, NULL,
-					      update->ofs, update->qwords,
-					      update);
-			else
-				ops->clear(vm, m->tile,
-					   &update->pt_bo->vmap, NULL,
-					   update->ofs, update->qwords, update);
-		}
-	}
-
-	trace_xe_vm_cpu_bind(vm);
-	xe_device_wmb(vm->xe);
+	__xe_migrate_update_pgtables_cpu(vm, m->tile, ops,
+					 pt_update_ops->ops,
+					 pt_update_ops->num_ops);
 
 	return dma_fence_get_stub();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/15] drm/xe: CPU binds for jobs
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
  2025-06-05 15:32 ` [PATCH 01/15] drm/xe: Drop struct xe_migrate_pt_update argument from populate / clear vfuns Matthew Brost
  2025-06-05 15:32 ` [PATCH 02/15] drm/xe: Add __xe_migrate_update_pgtables_cpu helper Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:44   ` Thomas Hellström
  2025-06-05 15:32 ` [PATCH 04/15] drm/xe: Remove unused arguments from xe_migrate_pt_update_ops Matthew Brost
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

No reason to use the GPU for binds. In run_job, use the CPU to perform
binds once the bind job's dependencies are resolved.

Benefits of CPU-based binds:
- Lower latency once dependencies are resolved, as there is no
  interaction with the GuC or a hardware context switch both of which
  are relatively slow.
- Large arrays of binds do not risk running out of migration PTEs,
  avoiding -ENOBUFS being returned to userspace.
- Kernel binds are decoupled from the migration exec queue (which issues
  copies and clears), so they cannot get stuck behind unrelated
  jobs—this can be a problem with parallel GPU faults.
- Enables ULLS on the migration exec queue, as this queue has exclusive
  access to the paging copy engine.

The basic idea of the implementation is to store the VM page table
update operations (struct xe_vm_pgtable_update_op *pt_op) and additional
arguments for the migrate layer’s CPU PTE update function in a job. The
submission backend can then call into the migrate layer using the CPU to
write the PTEs and free the stored resources for the PTE update.

PT job submission is implemented in the GuC backend for simplicity. A
follow-up could introduce a specific backend for PT jobs.

All code related to GPU-based binding has been removed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c              |   7 +-
 drivers/gpu/drm/xe/xe_bo.h              |   9 +-
 drivers/gpu/drm/xe/xe_bo_types.h        |   2 -
 drivers/gpu/drm/xe/xe_drm_client.c      |   3 +-
 drivers/gpu/drm/xe/xe_guc_submit.c      |  36 +++-
 drivers/gpu/drm/xe/xe_migrate.c         | 251 +++---------------------
 drivers/gpu/drm/xe/xe_migrate.h         |   6 +
 drivers/gpu/drm/xe/xe_pt.c              | 188 ++++++++++++++----
 drivers/gpu/drm/xe/xe_pt.h              |   5 +-
 drivers/gpu/drm/xe/xe_pt_types.h        |  29 ++-
 drivers/gpu/drm/xe/xe_sched_job.c       |  78 +++++---
 drivers/gpu/drm/xe/xe_sched_job_types.h |  31 ++-
 drivers/gpu/drm/xe/xe_vm.c              |  46 ++---
 13 files changed, 341 insertions(+), 350 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 61d208c85281..7aa598b584d2 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -3033,8 +3033,13 @@ void xe_bo_put_commit(struct llist_head *deferred)
 	if (!freed)
 		return;
 
-	llist_for_each_entry_safe(bo, next, freed, freed)
+	llist_for_each_entry_safe(bo, next, freed, freed) {
+		struct xe_vm *vm = bo->vm;
+
 		drm_gem_object_free(&bo->ttm.base.refcount);
+		if (bo->flags & XE_BO_FLAG_PUT_VM_ASYNC)
+			xe_vm_put(vm);
+	}
 }
 
 static void xe_bo_dev_work_func(struct work_struct *work)
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 02ada1fb8a23..967b1fe92560 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -46,6 +46,7 @@
 #define XE_BO_FLAG_GGTT2		BIT(22)
 #define XE_BO_FLAG_GGTT3		BIT(23)
 #define XE_BO_FLAG_CPU_ADDR_MIRROR	BIT(24)
+#define XE_BO_FLAG_PUT_VM_ASYNC		BIT(25)
 
 /* this one is trigger internally only */
 #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
@@ -319,6 +320,7 @@ void __xe_bo_release_dummy(struct kref *kref);
  * @bo: The bo to put.
  * @deferred: List to which to add the buffer object if we cannot put, or
  * NULL if the function is to put unconditionally.
+ * @added: BO was added to deferred list
  *
  * Since the final freeing of an object includes both sleeping and (!)
  * memory allocation in the dma_resv individualization, it's not ok
@@ -338,7 +340,8 @@ void __xe_bo_release_dummy(struct kref *kref);
  * false otherwise.
  */
 static inline bool
-xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred)
+xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred,
+		   bool *added)
 {
 	if (!deferred) {
 		xe_bo_put(bo);
@@ -348,6 +351,7 @@ xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred)
 	if (!kref_put(&bo->ttm.base.refcount, __xe_bo_release_dummy))
 		return false;
 
+	*added = true;
 	return llist_add(&bo->freed, deferred);
 }
 
@@ -363,8 +367,9 @@ static inline void
 xe_bo_put_async(struct xe_bo *bo)
 {
 	struct xe_bo_dev *bo_device = &xe_bo_device(bo)->bo_device;
+	bool added = false;
 
-	if (xe_bo_put_deferred(bo, &bo_device->async_list))
+	if (xe_bo_put_deferred(bo, &bo_device->async_list, &added))
 		schedule_work(&bo_device->async_free);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
index eb5e83c5f233..ecf42a04640a 100644
--- a/drivers/gpu/drm/xe/xe_bo_types.h
+++ b/drivers/gpu/drm/xe/xe_bo_types.h
@@ -70,8 +70,6 @@ struct xe_bo {
 
 	/** @freed: List node for delayed put. */
 	struct llist_node freed;
-	/** @update_index: Update index if PT BO */
-	int update_index;
 	/** @created: Whether the bo has passed initial creation */
 	bool created;
 
diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
index 31f688e953d7..6f5a91ef7491 100644
--- a/drivers/gpu/drm/xe/xe_drm_client.c
+++ b/drivers/gpu/drm/xe/xe_drm_client.c
@@ -200,6 +200,7 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
 	LLIST_HEAD(deferred);
 	unsigned int id;
 	u32 mem_type;
+	bool added = false;
 
 	client = xef->client;
 
@@ -246,7 +247,7 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
 			xe_assert(xef->xe, !list_empty(&bo->client_link));
 		}
 
-		xe_bo_put_deferred(bo, &deferred);
+		xe_bo_put_deferred(bo, &deferred, &added);
 	}
 	spin_unlock(&client->bos_lock);
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 2b61d017eeca..551cd21a6465 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -19,6 +19,7 @@
 #include "abi/guc_klvs_abi.h"
 #include "regs/xe_lrc_layout.h"
 #include "xe_assert.h"
+#include "xe_bo.h"
 #include "xe_devcoredump.h"
 #include "xe_device.h"
 #include "xe_exec_queue.h"
@@ -38,8 +39,10 @@
 #include "xe_lrc.h"
 #include "xe_macros.h"
 #include "xe_map.h"
+#include "xe_migrate.h"
 #include "xe_mocs.h"
 #include "xe_pm.h"
+#include "xe_pt.h"
 #include "xe_ring_ops_types.h"
 #include "xe_sched_job.h"
 #include "xe_trace.h"
@@ -745,6 +748,20 @@ static void submit_exec_queue(struct xe_exec_queue *q)
 	}
 }
 
+static bool is_pt_job(struct xe_sched_job *job)
+{
+	return job->is_pt_job;
+}
+
+static void run_pt_job(struct xe_sched_job *job)
+{
+	__xe_migrate_update_pgtables_cpu(job->pt_update[0].vm,
+					 job->pt_update[0].tile,
+					 job->pt_update[0].ops,
+					 job->pt_update[0].pt_job_ops->ops,
+					 job->pt_update[0].pt_job_ops->current_op);
+}
+
 static struct dma_fence *
 guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 {
@@ -760,14 +777,21 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	trace_xe_sched_job_run(job);
 
 	if (!exec_queue_killed_or_banned_or_wedged(q) && !xe_sched_job_is_error(job)) {
-		if (!exec_queue_registered(q))
-			register_exec_queue(q);
-		if (!lr)	/* LR jobs are emitted in the exec IOCTL */
-			q->ring_ops->emit_job(job);
-		submit_exec_queue(q);
+		if (is_pt_job(job)) {
+			run_pt_job(job);
+		} else {
+			if (!exec_queue_registered(q))
+				register_exec_queue(q);
+			if (!lr)	/* LR jobs are emitted in the exec IOCTL */
+				q->ring_ops->emit_job(job);
+			submit_exec_queue(q);
+		}
 	}
 
-	if (lr) {
+	if (is_pt_job(job)) {
+		xe_pt_job_ops_put(job->pt_update[0].pt_job_ops);
+		dma_fence_put(job->fence);	/* Drop ref from xe_sched_job_arm */
+	} else if (lr) {
 		xe_sched_job_set_error(job, -EOPNOTSUPP);
 		dma_fence_put(job->fence);	/* Drop ref from xe_sched_job_arm */
 	} else {
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 9084f5cbc02d..e444f3fae97c 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -58,18 +58,12 @@ struct xe_migrate {
 	 * Protected by @job_mutex.
 	 */
 	struct dma_fence *fence;
-	/**
-	 * @vm_update_sa: For integrated, used to suballocate page-tables
-	 * out of the pt_bo.
-	 */
-	struct drm_suballoc_manager vm_update_sa;
 	/** @min_chunk_size: For dgfx, Minimum chunk size */
 	u64 min_chunk_size;
 };
 
 #define MAX_PREEMPTDISABLE_TRANSFER SZ_8M /* Around 1ms. */
 #define MAX_CCS_LIMITED_TRANSFER SZ_4M /* XE_PAGE_SIZE * (FIELD_MAX(XE2_CCS_SIZE_MASK) + 1) */
-#define NUM_KERNEL_PDE 15
 #define NUM_PT_SLOTS 32
 #define LEVEL0_PAGE_TABLE_ENCODE_SIZE SZ_2M
 #define MAX_NUM_PTE 512
@@ -107,7 +101,6 @@ static void xe_migrate_fini(void *arg)
 
 	dma_fence_put(m->fence);
 	xe_bo_put(m->pt_bo);
-	drm_suballoc_manager_fini(&m->vm_update_sa);
 	mutex_destroy(&m->job_mutex);
 	xe_vm_close_and_put(m->q->vm);
 	xe_exec_queue_put(m->q);
@@ -199,8 +192,6 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 	BUILD_BUG_ON(NUM_PT_SLOTS > SZ_2M/XE_PAGE_SIZE);
 	/* Must be a multiple of 64K to support all platforms */
 	BUILD_BUG_ON(NUM_PT_SLOTS * XE_PAGE_SIZE % SZ_64K);
-	/* And one slot reserved for the 4KiB page table updates */
-	BUILD_BUG_ON(!(NUM_KERNEL_PDE & 1));
 
 	/* Need to be sure everything fits in the first PT, or create more */
 	xe_tile_assert(tile, m->batch_base_ofs + batch->size < SZ_2M);
@@ -333,8 +324,6 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 	/*
 	 * Example layout created above, with root level = 3:
 	 * [PT0...PT7]: kernel PT's for copy/clear; 64 or 4KiB PTE's
-	 * [PT8]: Kernel PT for VM_BIND, 4 KiB PTE's
-	 * [PT9...PT26]: Userspace PT's for VM_BIND, 4 KiB PTE's
 	 * [PT27 = PDE 0] [PT28 = PDE 1] [PT29 = PDE 2] [PT30 & PT31 = 2M vram identity map]
 	 *
 	 * This makes the lowest part of the VM point to the pagetables.
@@ -342,19 +331,10 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 	 * and flushes, other parts of the VM can be used either for copying and
 	 * clearing.
 	 *
-	 * For performance, the kernel reserves PDE's, so about 20 are left
-	 * for async VM updates.
-	 *
 	 * To make it easier to work, each scratch PT is put in slot (1 + PT #)
 	 * everywhere, this allows lockless updates to scratch pages by using
 	 * the different addresses in VM.
 	 */
-#define NUM_VMUSA_UNIT_PER_PAGE	32
-#define VM_SA_UPDATE_UNIT_SIZE		(XE_PAGE_SIZE / NUM_VMUSA_UNIT_PER_PAGE)
-#define NUM_VMUSA_WRITES_PER_UNIT	(VM_SA_UPDATE_UNIT_SIZE / sizeof(u64))
-	drm_suballoc_manager_init(&m->vm_update_sa,
-				  (size_t)(map_ofs / XE_PAGE_SIZE - NUM_KERNEL_PDE) *
-				  NUM_VMUSA_UNIT_PER_PAGE, 0);
 
 	m->pt_bo = bo;
 	return 0;
@@ -1193,56 +1173,6 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 	return fence;
 }
 
-static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb, u64 ppgtt_ofs,
-			  const struct xe_vm_pgtable_update_op *pt_op,
-			  const struct xe_vm_pgtable_update *update,
-			  struct xe_migrate_pt_update *pt_update)
-{
-	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
-	struct xe_vm *vm = pt_update->vops->vm;
-	u32 chunk;
-	u32 ofs = update->ofs, size = update->qwords;
-
-	/*
-	 * If we have 512 entries (max), we would populate it ourselves,
-	 * and update the PDE above it to the new pointer.
-	 * The only time this can only happen if we have to update the top
-	 * PDE. This requires a BO that is almost vm->size big.
-	 *
-	 * This shouldn't be possible in practice.. might change when 16K
-	 * pages are used. Hence the assert.
-	 */
-	xe_tile_assert(tile, update->qwords < MAX_NUM_PTE);
-	if (!ppgtt_ofs)
-		ppgtt_ofs = xe_migrate_vram_ofs(tile_to_xe(tile),
-						xe_bo_addr(update->pt_bo, 0,
-							   XE_PAGE_SIZE), false);
-
-	do {
-		u64 addr = ppgtt_ofs + ofs * 8;
-
-		chunk = min(size, MAX_PTE_PER_SDI);
-
-		/* Ensure populatefn can do memset64 by aligning bb->cs */
-		if (!(bb->len & 1))
-			bb->cs[bb->len++] = MI_NOOP;
-
-		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
-		bb->cs[bb->len++] = lower_32_bits(addr);
-		bb->cs[bb->len++] = upper_32_bits(addr);
-		if (pt_op->bind)
-			ops->populate(tile, NULL, bb->cs + bb->len,
-				      ofs, chunk, update);
-		else
-			ops->clear(vm, tile, NULL, bb->cs + bb->len,
-				   ofs, chunk, update);
-
-		bb->len += chunk * 2;
-		ofs += chunk;
-		size -= chunk;
-	} while (size);
-}
-
 struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m)
 {
 	return xe_vm_get(m->q->vm);
@@ -1258,7 +1188,18 @@ struct migrate_test_params {
 	container_of(_priv, struct migrate_test_params, base)
 #endif
 
-static void
+/**
+ * __xe_migrate_update_pgtables_cpu() - Update a VM's PTEs via the CPU
+ * @vm: The VM being updated
+ * @tile: The tile being updated
+ * @ops: The migrate PT update ops
+ * @pt_ops: The VM PT update ops
+ * @num_ops: The number of The VM PT update ops
+ *
+ * Execute the VM PT update ops array which results in a VM's PTEs being updated
+ * via the CPU.
+ */
+void
 __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
 				 const struct xe_migrate_pt_update_ops *ops,
 				 struct xe_vm_pgtable_update_op *pt_op,
@@ -1314,7 +1255,7 @@ xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
 	}
 
 	__xe_migrate_update_pgtables_cpu(vm, m->tile, ops,
-					 pt_update_ops->ops,
+					 pt_update_ops->pt_job_ops->ops,
 					 pt_update_ops->num_ops);
 
 	return dma_fence_get_stub();
@@ -1327,161 +1268,19 @@ __xe_migrate_update_pgtables(struct xe_migrate *m,
 {
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
 	struct xe_tile *tile = m->tile;
-	struct xe_gt *gt = tile->primary_gt;
-	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_sched_job *job;
 	struct dma_fence *fence;
-	struct drm_suballoc *sa_bo = NULL;
-	struct xe_bb *bb;
-	u32 i, j, batch_size = 0, ppgtt_ofs, update_idx, page_ofs = 0;
-	u32 num_updates = 0, current_update = 0;
-	u64 addr;
-	int err = 0;
 	bool is_migrate = pt_update_ops->q == m->q;
-	bool usm = is_migrate && xe->info.has_usm;
-
-	for (i = 0; i < pt_update_ops->num_ops; ++i) {
-		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
-		struct xe_vm_pgtable_update *updates = pt_op->entries;
-
-		num_updates += pt_op->num_entries;
-		for (j = 0; j < pt_op->num_entries; ++j) {
-			u32 num_cmds = DIV_ROUND_UP(updates[j].qwords,
-						    MAX_PTE_PER_SDI);
-
-			/* align noop + MI_STORE_DATA_IMM cmd prefix */
-			batch_size += 4 * num_cmds + updates[j].qwords * 2;
-		}
-	}
-
-	/* fixed + PTE entries */
-	if (IS_DGFX(xe))
-		batch_size += 2;
-	else
-		batch_size += 6 * (num_updates / MAX_PTE_PER_SDI + 1) +
-			num_updates * 2;
-
-	bb = xe_bb_new(gt, batch_size, usm);
-	if (IS_ERR(bb))
-		return ERR_CAST(bb);
-
-	/* For sysmem PTE's, need to map them in our hole.. */
-	if (!IS_DGFX(xe)) {
-		u16 pat_index = xe->pat.idx[XE_CACHE_WB];
-		u32 ptes, ofs;
-
-		ppgtt_ofs = NUM_KERNEL_PDE - 1;
-		if (!is_migrate) {
-			u32 num_units = DIV_ROUND_UP(num_updates,
-						     NUM_VMUSA_WRITES_PER_UNIT);
-
-			if (num_units > m->vm_update_sa.size) {
-				err = -ENOBUFS;
-				goto err_bb;
-			}
-			sa_bo = drm_suballoc_new(&m->vm_update_sa, num_units,
-						 GFP_KERNEL, true, 0);
-			if (IS_ERR(sa_bo)) {
-				err = PTR_ERR(sa_bo);
-				goto err_bb;
-			}
-
-			ppgtt_ofs = NUM_KERNEL_PDE +
-				(drm_suballoc_soffset(sa_bo) /
-				 NUM_VMUSA_UNIT_PER_PAGE);
-			page_ofs = (drm_suballoc_soffset(sa_bo) %
-				    NUM_VMUSA_UNIT_PER_PAGE) *
-				VM_SA_UPDATE_UNIT_SIZE;
-		}
-
-		/* Map our PT's to gtt */
-		i = 0;
-		j = 0;
-		ptes = num_updates;
-		ofs = ppgtt_ofs * XE_PAGE_SIZE + page_ofs;
-		while (ptes) {
-			u32 chunk = min(MAX_PTE_PER_SDI, ptes);
-			u32 idx = 0;
-
-			bb->cs[bb->len++] = MI_STORE_DATA_IMM |
-				MI_SDI_NUM_QW(chunk);
-			bb->cs[bb->len++] = ofs;
-			bb->cs[bb->len++] = 0; /* upper_32_bits */
-
-			for (; i < pt_update_ops->num_ops; ++i) {
-				struct xe_vm_pgtable_update_op *pt_op =
-					&pt_update_ops->ops[i];
-				struct xe_vm_pgtable_update *updates = pt_op->entries;
-
-				for (; j < pt_op->num_entries; ++j, ++current_update, ++idx) {
-					struct xe_vm *vm = pt_update->vops->vm;
-					struct xe_bo *pt_bo = updates[j].pt_bo;
-
-					if (idx == chunk)
-						goto next_cmd;
-
-					xe_tile_assert(tile, pt_bo->size == SZ_4K);
-
-					/* Map a PT at most once */
-					if (pt_bo->update_index < 0)
-						pt_bo->update_index = current_update;
-
-					addr = vm->pt_ops->pte_encode_bo(pt_bo, 0,
-									 pat_index, 0);
-					bb->cs[bb->len++] = lower_32_bits(addr);
-					bb->cs[bb->len++] = upper_32_bits(addr);
-				}
-
-				j = 0;
-			}
-
-next_cmd:
-			ptes -= chunk;
-			ofs += chunk * sizeof(u64);
-		}
-
-		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
-		update_idx = bb->len;
-
-		addr = xe_migrate_vm_addr(ppgtt_ofs, 0) +
-			(page_ofs / sizeof(u64)) * XE_PAGE_SIZE;
-		for (i = 0; i < pt_update_ops->num_ops; ++i) {
-			struct xe_vm_pgtable_update_op *pt_op =
-				&pt_update_ops->ops[i];
-			struct xe_vm_pgtable_update *updates = pt_op->entries;
-
-			for (j = 0; j < pt_op->num_entries; ++j) {
-				struct xe_bo *pt_bo = updates[j].pt_bo;
-
-				write_pgtable(tile, bb, addr +
-					      pt_bo->update_index * XE_PAGE_SIZE,
-					      pt_op, &updates[j], pt_update);
-			}
-		}
-	} else {
-		/* phys pages, no preamble required */
-		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
-		update_idx = bb->len;
-
-		for (i = 0; i < pt_update_ops->num_ops; ++i) {
-			struct xe_vm_pgtable_update_op *pt_op =
-				&pt_update_ops->ops[i];
-			struct xe_vm_pgtable_update *updates = pt_op->entries;
-
-			for (j = 0; j < pt_op->num_entries; ++j)
-				write_pgtable(tile, bb, 0, pt_op, &updates[j],
-					      pt_update);
-		}
-	}
+	int err;
 
-	job = xe_bb_create_migration_job(pt_update_ops->q, bb,
-					 xe_migrate_batch_base(m, usm),
-					 update_idx);
+	job = xe_sched_job_create(pt_update_ops->q, NULL);
 	if (IS_ERR(job)) {
 		err = PTR_ERR(job);
-		goto err_sa;
+		goto err_out;
 	}
 
+	xe_tile_assert(tile, job->is_pt_job);
+
 	if (ops->pre_commit) {
 		pt_update->job = job;
 		err = ops->pre_commit(pt_update);
@@ -1491,6 +1290,12 @@ __xe_migrate_update_pgtables(struct xe_migrate *m,
 	if (is_migrate)
 		mutex_lock(&m->job_mutex);
 
+	job->pt_update[0].vm = pt_update->vops->vm;
+	job->pt_update[0].tile = tile;
+	job->pt_update[0].ops = ops;
+	job->pt_update[0].pt_job_ops =
+		xe_pt_job_ops_get(pt_update_ops->pt_job_ops);
+
 	xe_sched_job_arm(job);
 	fence = dma_fence_get(&job->drm.s_fence->finished);
 	xe_sched_job_push(job);
@@ -1498,17 +1303,11 @@ __xe_migrate_update_pgtables(struct xe_migrate *m,
 	if (is_migrate)
 		mutex_unlock(&m->job_mutex);
 
-	xe_bb_free(bb, fence);
-	drm_suballoc_free(sa_bo, fence);
-
 	return fence;
 
 err_job:
 	xe_sched_job_put(job);
-err_sa:
-	drm_suballoc_free(sa_bo, NULL);
-err_bb:
-	xe_bb_free(bb, NULL);
+err_out:
 	return ERR_PTR(err);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index b064455b604e..0986ffdd8d9a 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -22,6 +22,7 @@ struct xe_pt;
 struct xe_tile;
 struct xe_vm;
 struct xe_vm_pgtable_update;
+struct xe_vm_pgtable_update_op;
 struct xe_vma;
 
 /**
@@ -125,6 +126,11 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 
 struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m);
 
+void __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
+				      const struct xe_migrate_pt_update_ops *ops,
+				      struct xe_vm_pgtable_update_op *pt_op,
+				      int num_ops);
+
 struct dma_fence *
 xe_migrate_update_pgtables(struct xe_migrate *m,
 			   struct xe_migrate_pt_update *pt_update);
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index db1c363a65d5..1ad31f444b79 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -200,7 +200,9 @@ unsigned int xe_pt_shift(unsigned int level)
  * and finally frees @pt. TODO: Can we remove the @flags argument?
  */
 void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head *deferred)
+
 {
+	bool added = false;
 	int i;
 
 	if (!pt)
@@ -208,7 +210,18 @@ void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head *deferred)
 
 	XE_WARN_ON(!list_empty(&pt->bo->ttm.base.gpuva.list));
 	xe_bo_unpin(pt->bo);
-	xe_bo_put_deferred(pt->bo, deferred);
+	xe_bo_put_deferred(pt->bo, deferred, &added);
+	if (added) {
+		/*
+		 * We need the VM present until the BO is destroyed as it shares
+		 * a dma-resv and BO destroy is async. Reinit BO refcount so
+		 * xe_bo_put_async can be used when the PT job ops refcount goes
+		 * to zero.
+		 */
+		xe_vm_get(pt->bo->vm);
+		pt->bo->flags |= XE_BO_FLAG_PUT_VM_ASYNC;
+		kref_init(&pt->bo->ttm.base.refcount);
+	}
 
 	if (pt->level > 0 && pt->num_live) {
 		struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
@@ -361,7 +374,7 @@ xe_pt_new_shared(struct xe_walk_update *wupd, struct xe_pt *parent,
 	entry->pt = parent;
 	entry->flags = 0;
 	entry->qwords = 0;
-	entry->pt_bo->update_index = -1;
+	entry->level = parent->level;
 
 	if (alloc_entries) {
 		entry->pt_entries = kmalloc_array(XE_PDES,
@@ -1739,7 +1752,7 @@ xe_migrate_clear_pgtable_callback(struct xe_vm *vm, struct xe_tile *tile,
 				  u32 qword_ofs, u32 num_qwords,
 				  const struct xe_vm_pgtable_update *update)
 {
-	u64 empty = __xe_pt_empty_pte(tile, vm, update->pt->level);
+	u64 empty = __xe_pt_empty_pte(tile, vm, update->level);
 	int i;
 
 	if (map && map->is_iomem)
@@ -1805,13 +1818,20 @@ xe_pt_commit_prepare_unbind(struct xe_vma *vma,
 	}
 }
 
+static struct xe_vm_pgtable_update_op *
+to_pt_op(struct xe_vm_pgtable_update_ops *pt_update_ops, u32 current_op)
+{
+	return &pt_update_ops->pt_job_ops->ops[current_op];
+}
+
 static void
 xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
 				 u64 start, u64 end)
 {
 	u64 last;
-	u32 current_op = pt_update_ops->current_op;
-	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	u32 current_op = pt_update_ops->pt_job_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op =
+		to_pt_op(pt_update_ops, current_op);
 	int i, level = 0;
 
 	for (i = 0; i < pt_op->num_entries; i++) {
@@ -1846,8 +1866,9 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vm_pgtable_update_ops *pt_update_ops,
 			   struct xe_vma *vma, bool invalidate_on_bind)
 {
-	u32 current_op = pt_update_ops->current_op;
-	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	u32 current_op = pt_update_ops->pt_job_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op =
+		to_pt_op(pt_update_ops, current_op);
 	int err;
 
 	xe_tile_assert(tile, !xe_vma_is_cpu_addr_mirror(vma));
@@ -1876,7 +1897,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 		xe_pt_update_ops_rfence_interval(pt_update_ops,
 						 xe_vma_start(vma),
 						 xe_vma_end(vma));
-		++pt_update_ops->current_op;
+		++pt_update_ops->pt_job_ops->current_op;
 		pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 
 		/*
@@ -1913,8 +1934,9 @@ static int bind_range_prepare(struct xe_vm *vm, struct xe_tile *tile,
 			      struct xe_vm_pgtable_update_ops *pt_update_ops,
 			      struct xe_vma *vma, struct xe_svm_range *range)
 {
-	u32 current_op = pt_update_ops->current_op;
-	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	u32 current_op = pt_update_ops->pt_job_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op =
+		to_pt_op(pt_update_ops, current_op);
 	int err;
 
 	xe_tile_assert(tile, xe_vma_is_cpu_addr_mirror(vma));
@@ -1938,7 +1960,7 @@ static int bind_range_prepare(struct xe_vm *vm, struct xe_tile *tile,
 		xe_pt_update_ops_rfence_interval(pt_update_ops,
 						 range->base.itree.start,
 						 range->base.itree.last + 1);
-		++pt_update_ops->current_op;
+		++pt_update_ops->pt_job_ops->current_op;
 		pt_update_ops->needs_svm_lock = true;
 
 		pt_op->vma = vma;
@@ -1955,8 +1977,9 @@ static int unbind_op_prepare(struct xe_tile *tile,
 			     struct xe_vm_pgtable_update_ops *pt_update_ops,
 			     struct xe_vma *vma)
 {
-	u32 current_op = pt_update_ops->current_op;
-	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	u32 current_op = pt_update_ops->pt_job_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op =
+		to_pt_op(pt_update_ops, current_op);
 	int err;
 
 	if (!((vma->tile_present | vma->tile_staged) & BIT(tile->id)))
@@ -1984,7 +2007,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 				pt_op->num_entries, false);
 	xe_pt_update_ops_rfence_interval(pt_update_ops, xe_vma_start(vma),
 					 xe_vma_end(vma));
-	++pt_update_ops->current_op;
+	++pt_update_ops->pt_job_ops->current_op;
 	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 	pt_update_ops->needs_invalidation = true;
 
@@ -1998,8 +2021,9 @@ static int unbind_range_prepare(struct xe_vm *vm,
 				struct xe_vm_pgtable_update_ops *pt_update_ops,
 				struct xe_svm_range *range)
 {
-	u32 current_op = pt_update_ops->current_op;
-	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	u32 current_op = pt_update_ops->pt_job_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op =
+		to_pt_op(pt_update_ops, current_op);
 
 	if (!(range->tile_present & BIT(tile->id)))
 		return 0;
@@ -2019,7 +2043,7 @@ static int unbind_range_prepare(struct xe_vm *vm,
 				pt_op->num_entries, false);
 	xe_pt_update_ops_rfence_interval(pt_update_ops, range->base.itree.start,
 					 range->base.itree.last + 1);
-	++pt_update_ops->current_op;
+	++pt_update_ops->pt_job_ops->current_op;
 	pt_update_ops->needs_svm_lock = true;
 	pt_update_ops->needs_invalidation = true;
 
@@ -2122,7 +2146,6 @@ static int op_prepare(struct xe_vm *vm,
 static void
 xe_pt_update_ops_init(struct xe_vm_pgtable_update_ops *pt_update_ops)
 {
-	init_llist_head(&pt_update_ops->deferred);
 	pt_update_ops->start = ~0x0ull;
 	pt_update_ops->last = 0x0ull;
 }
@@ -2163,7 +2186,7 @@ int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops)
 			return err;
 	}
 
-	xe_tile_assert(tile, pt_update_ops->current_op <=
+	xe_tile_assert(tile, pt_update_ops->pt_job_ops->current_op <=
 		       pt_update_ops->num_ops);
 
 #ifdef TEST_VM_OPS_ERROR
@@ -2396,7 +2419,7 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 	lockdep_assert_held(&vm->lock);
 	xe_vm_assert_held(vm);
 
-	if (!pt_update_ops->current_op) {
+	if (!pt_update_ops->pt_job_ops->current_op) {
 		xe_tile_assert(tile, xe_vm_in_fault_mode(vm));
 
 		return dma_fence_get_stub();
@@ -2445,12 +2468,16 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 		goto free_rfence;
 	}
 
-	/* Point of no return - VM killed if failure after this */
-	for (i = 0; i < pt_update_ops->current_op; ++i) {
-		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
+	/*
+	 * Point of no return - VM killed if failure after this
+	 */
+	for (i = 0; i < pt_update_ops->pt_job_ops->current_op; ++i) {
+		struct xe_vm_pgtable_update_op *pt_op =
+			to_pt_op(pt_update_ops, i);
 
 		xe_pt_commit(pt_op->vma, pt_op->entries,
-			     pt_op->num_entries, &pt_update_ops->deferred);
+			     pt_op->num_entries,
+			     &pt_update_ops->pt_job_ops->deferred);
 		pt_op->vma = NULL;	/* skip in xe_pt_update_ops_abort */
 	}
 
@@ -2530,27 +2557,19 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
 
 /**
- * xe_pt_update_ops_fini() - Finish PT update operations
- * @tile: Tile of PT update operations
- * @vops: VMA operations
+ * xe_pt_update_ops_free() - Free PT update operations
+ * @pt_op: Array of PT update operations
+ * @num_ops: Number of PT update operations
  *
- * Finish PT update operations by committing to destroy page table memory
+ * Free PT update operations
  */
-void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops)
+static void xe_pt_update_ops_free(struct xe_vm_pgtable_update_op *pt_op,
+				  u32 num_ops)
 {
-	struct xe_vm_pgtable_update_ops *pt_update_ops =
-		&vops->pt_update_ops[tile->id];
-	int i;
-
-	lockdep_assert_held(&vops->vm->lock);
-	xe_vm_assert_held(vops->vm);
-
-	for (i = 0; i < pt_update_ops->current_op; ++i) {
-		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
+	u32 i;
 
+	for (i = 0; i < num_ops; ++i, ++pt_op)
 		xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
-	}
-	xe_bo_put_commit(&vops->pt_update_ops[tile->id].deferred);
 }
 
 /**
@@ -2571,9 +2590,9 @@ void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops)
 
 	for (i = pt_update_ops->num_ops - 1; i >= 0; --i) {
 		struct xe_vm_pgtable_update_op *pt_op =
-			&pt_update_ops->ops[i];
+			to_pt_op(pt_update_ops, i);
 
-		if (!pt_op->vma || i >= pt_update_ops->current_op)
+		if (!pt_op->vma || i >= pt_update_ops->pt_job_ops->current_op)
 			continue;
 
 		if (pt_op->bind)
@@ -2584,6 +2603,89 @@ void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops)
 			xe_pt_abort_unbind(pt_op->vma, pt_op->entries,
 					   pt_op->num_entries);
 	}
+}
+
+/**
+ * xe_pt_job_ops_alloc() - Allocate PT job ops
+ * @num_ops: Number of VM PT update ops
+ *
+ * Allocate PT job ops and internal array of VM PT update ops.
+ *
+ * Return: Pointer to PT job ops or NULL
+ */
+struct xe_pt_job_ops *xe_pt_job_ops_alloc(u32 num_ops)
+{
+	struct xe_pt_job_ops *pt_job_ops;
+
+	pt_job_ops = kmalloc(sizeof(*pt_job_ops), GFP_KERNEL);
+	if (!pt_job_ops)
+		return NULL;
+
+	pt_job_ops->ops = kvmalloc_array(num_ops, sizeof(*pt_job_ops->ops),
+					 GFP_KERNEL);
+	if (!pt_job_ops->ops) {
+		kvfree(pt_job_ops);
+		return NULL;
+	}
+
+	pt_job_ops->current_op = 0;
+	kref_init(&pt_job_ops->refcount);
+	init_llist_head(&pt_job_ops->deferred);
+
+	return pt_job_ops;
+}
+
+/**
+ * xe_pt_job_ops_get() - Get PT job ops
+ * @pt_job_ops: PT job ops to get
+ *
+ * Take a reference to PT job ops
+ *
+ * Return: Pointer to PT job ops or NULL
+ */
+struct xe_pt_job_ops *xe_pt_job_ops_get(struct xe_pt_job_ops *pt_job_ops)
+{
+	if (pt_job_ops)
+		kref_get(&pt_job_ops->refcount);
+
+	return pt_job_ops;
+}
+
+static void xe_pt_job_ops_destroy(struct kref *ref)
+{
+	struct xe_pt_job_ops *pt_job_ops =
+		container_of(ref, struct xe_pt_job_ops, refcount);
+	struct llist_node *freed;
+	struct xe_bo *bo, *next;
+
+	xe_pt_update_ops_free(pt_job_ops->ops,
+			      pt_job_ops->current_op);
+
+	freed = llist_del_all(&pt_job_ops->deferred);
+	if (freed) {
+		llist_for_each_entry_safe(bo, next, freed, freed)
+			/*
+			 * If called from run_job, we are in the dma-fencing
+			 * path and cannot take dma-resv locks so use an async
+			 * put.
+			 */
+			xe_bo_put_async(bo);
+	}
+
+	kvfree(pt_job_ops->ops);
+	kfree(pt_job_ops);
+}
+
+/**
+ * xe_pt_job_ops_put() - Put PT job ops
+ * @pt_job_ops: PT job ops to put
+ *
+ * Drop a reference to PT job ops
+ */
+void xe_pt_job_ops_put(struct xe_pt_job_ops *pt_job_ops)
+{
+	if (!pt_job_ops)
+		return;
 
-	xe_pt_update_ops_fini(tile, vops);
+	kref_put(&pt_job_ops->refcount, xe_pt_job_ops_destroy);
 }
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 5ecf003d513c..c9904573db82 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -41,11 +41,14 @@ void xe_pt_clear(struct xe_device *xe, struct xe_pt *pt);
 int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops);
 struct dma_fence *xe_pt_update_ops_run(struct xe_tile *tile,
 				       struct xe_vma_ops *vops);
-void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops);
 void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
 			  struct xe_svm_range *range);
 
+struct xe_pt_job_ops *xe_pt_job_ops_alloc(u32 num_ops);
+struct xe_pt_job_ops *xe_pt_job_ops_get(struct xe_pt_job_ops *pt_job_ops);
+void xe_pt_job_ops_put(struct xe_pt_job_ops *pt_job_ops);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index 69eab6f37cfe..33d0d20e0ac6 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -70,6 +70,9 @@ struct xe_vm_pgtable_update {
 	/** @pt_entries: Newly added pagetable entries */
 	struct xe_pt_entry *pt_entries;
 
+	/** @level: level of update */
+	unsigned int level;
+
 	/** @flags: Target flags */
 	u32 flags;
 };
@@ -88,12 +91,28 @@ struct xe_vm_pgtable_update_op {
 	bool rebind;
 };
 
-/** struct xe_vm_pgtable_update_ops: page table update operations */
-struct xe_vm_pgtable_update_ops {
-	/** @ops: operations */
-	struct xe_vm_pgtable_update_op *ops;
+/**
+ * struct xe_pt_job_ops: page table update operations dynamic allocation
+ *
+ * This is the part of struct xe_vma_ops and struct xe_vm_pgtable_update_ops
+ * which is dynamic allocated as it must be available until the bind job is
+ * complete.
+ */
+struct xe_pt_job_ops {
+	/** @current_op: current operations */
+	u32 current_op;
+	/** @refcount: ref count ops allocation */
+	struct kref refcount;
 	/** @deferred: deferred list to destroy PT entries */
 	struct llist_head deferred;
+	/** @ops: operations */
+	struct xe_vm_pgtable_update_op *ops;
+};
+
+/** struct xe_vm_pgtable_update_ops: page table update operations */
+struct xe_vm_pgtable_update_ops {
+	/** @pt_job_ops: PT update operations dynamic allocation*/
+	struct xe_pt_job_ops *pt_job_ops;
 	/** @q: exec queue for PT operations */
 	struct xe_exec_queue *q;
 	/** @start: start address of ops */
@@ -102,8 +121,6 @@ struct xe_vm_pgtable_update_ops {
 	u64 last;
 	/** @num_ops: number of operations */
 	u32 num_ops;
-	/** @current_op: current operations */
-	u32 current_op;
 	/** @needs_svm_lock: Needs SVM lock */
 	bool needs_svm_lock;
 	/** @needs_userptr_lock: Needs userptr lock */
diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
index d21bf8f26964..09cdd14d9ef7 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -26,19 +26,22 @@ static struct kmem_cache *xe_sched_job_parallel_slab;
 
 int __init xe_sched_job_module_init(void)
 {
+	struct xe_sched_job *job;
+	size_t size;
+
+	size = struct_size(job, ptrs, 1);
 	xe_sched_job_slab =
-		kmem_cache_create("xe_sched_job",
-				  sizeof(struct xe_sched_job) +
-				  sizeof(struct xe_job_ptrs), 0,
+		kmem_cache_create("xe_sched_job", size, 0,
 				  SLAB_HWCACHE_ALIGN, NULL);
 	if (!xe_sched_job_slab)
 		return -ENOMEM;
 
+	size = max_t(size_t,
+		     struct_size(job, ptrs,
+				 XE_HW_ENGINE_MAX_INSTANCE),
+		     struct_size(job, pt_update, 1));
 	xe_sched_job_parallel_slab =
-		kmem_cache_create("xe_sched_job_parallel",
-				  sizeof(struct xe_sched_job) +
-				  sizeof(struct xe_job_ptrs) *
-				  XE_HW_ENGINE_MAX_INSTANCE, 0,
+		kmem_cache_create("xe_sched_job_parallel", size, 0,
 				  SLAB_HWCACHE_ALIGN, NULL);
 	if (!xe_sched_job_parallel_slab) {
 		kmem_cache_destroy(xe_sched_job_slab);
@@ -84,7 +87,7 @@ static void xe_sched_job_free_fences(struct xe_sched_job *job)
 {
 	int i;
 
-	for (i = 0; i < job->q->width; ++i) {
+	for (i = 0; !job->is_pt_job && i < job->q->width; ++i) {
 		struct xe_job_ptrs *ptrs = &job->ptrs[i];
 
 		if (ptrs->lrc_fence)
@@ -118,33 +121,44 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 	if (err)
 		goto err_free;
 
-	for (i = 0; i < q->width; ++i) {
-		struct dma_fence *fence = xe_lrc_alloc_seqno_fence();
-		struct dma_fence_chain *chain;
-
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			goto err_sched_job;
-		}
-		job->ptrs[i].lrc_fence = fence;
-
-		if (i + 1 == q->width)
-			continue;
-
-		chain = dma_fence_chain_alloc();
-		if (!chain) {
+	if (!batch_addr) {
+		job->fence = dma_fence_allocate_private_stub(ktime_get());
+		if (!job->fence) {
 			err = -ENOMEM;
 			goto err_sched_job;
 		}
-		job->ptrs[i].chain_fence = chain;
+		job->is_pt_job = true;
+	} else {
+		for (i = 0; i < q->width; ++i) {
+			struct dma_fence *fence = xe_lrc_alloc_seqno_fence();
+			struct dma_fence_chain *chain;
+
+			if (IS_ERR(fence)) {
+				err = PTR_ERR(fence);
+				goto err_sched_job;
+			}
+			job->ptrs[i].lrc_fence = fence;
+
+			if (i + 1 == q->width)
+				continue;
+
+			chain = dma_fence_chain_alloc();
+			if (!chain) {
+				err = -ENOMEM;
+				goto err_sched_job;
+			}
+			job->ptrs[i].chain_fence = chain;
+		}
 	}
 
-	width = q->width;
-	if (is_migration)
-		width = 2;
+	if (batch_addr) {
+		width = q->width;
+		if (is_migration)
+			width = 2;
 
-	for (i = 0; i < width; ++i)
-		job->ptrs[i].batch_addr = batch_addr[i];
+		for (i = 0; i < width; ++i)
+			job->ptrs[i].batch_addr = batch_addr[i];
+	}
 
 	xe_pm_runtime_get_noresume(job_to_xe(job));
 	trace_xe_sched_job_create(job);
@@ -243,7 +257,7 @@ bool xe_sched_job_completed(struct xe_sched_job *job)
 void xe_sched_job_arm(struct xe_sched_job *job)
 {
 	struct xe_exec_queue *q = job->q;
-	struct dma_fence *fence, *prev;
+	struct dma_fence *fence = job->fence, *prev;
 	struct xe_vm *vm = q->vm;
 	u64 seqno = 0;
 	int i;
@@ -263,6 +277,9 @@ void xe_sched_job_arm(struct xe_sched_job *job)
 		job->ring_ops_flush_tlb = true;
 	}
 
+	if (job->is_pt_job)
+		goto arm;
+
 	/* Arm the pre-allocated fences */
 	for (i = 0; i < q->width; prev = fence, ++i) {
 		struct dma_fence_chain *chain;
@@ -283,6 +300,7 @@ void xe_sched_job_arm(struct xe_sched_job *job)
 		fence = &chain->base;
 	}
 
+arm:
 	job->fence = dma_fence_get(fence);	/* Pairs with put in scheduler */
 	drm_sched_job_arm(&job->drm);
 }
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index dbf260dded8d..79a459f2a0a8 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -10,10 +10,29 @@
 
 #include <drm/gpu_scheduler.h>
 
-struct xe_exec_queue;
 struct dma_fence;
 struct dma_fence_chain;
 
+struct xe_exec_queue;
+struct xe_migrate_pt_update_ops;
+struct xe_pt_job_ops;
+struct xe_tile;
+struct xe_vm;
+
+/**
+ * struct xe_pt_update_args - PT update arguments
+ */
+struct xe_pt_update_args {
+	/** @vm: VM */
+	struct xe_vm *vm;
+	/** @tile: Tile */
+	struct xe_tile *tile;
+	/** @ops: Migrate PT update ops */
+	const struct xe_migrate_pt_update_ops *ops;
+	/** @pt_job_ops: PT update ops */
+	struct xe_pt_job_ops *pt_job_ops;
+};
+
 /**
  * struct xe_job_ptrs - Per hw engine instance data
  */
@@ -58,8 +77,14 @@ struct xe_sched_job {
 	bool ring_ops_flush_tlb;
 	/** @ggtt: mapped in ggtt. */
 	bool ggtt;
-	/** @ptrs: per instance pointers. */
-	struct xe_job_ptrs ptrs[];
+	/** @is_pt_job: is a PT job */
+	bool is_pt_job;
+	union {
+		/** @ptrs: per instance pointers. */
+		DECLARE_FLEX_ARRAY(struct xe_job_ptrs, ptrs);
+		/** @pt_update: PT update arguments */
+		DECLARE_FLEX_ARRAY(struct xe_pt_update_args, pt_update);
+	};
 };
 
 struct xe_sched_job_snapshot {
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 18f967ce1f1a..6fc01fdd7286 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -780,6 +780,19 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm)
 		list_empty_careful(&vm->userptr.invalidated)) ? 0 : -EAGAIN;
 }
 
+static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm *vm,
+			    struct xe_exec_queue *q,
+			    struct xe_sync_entry *syncs, u32 num_syncs)
+{
+	memset(vops, 0, sizeof(*vops));
+	INIT_LIST_HEAD(&vops->list);
+	vops->vm = vm;
+	vops->q = q;
+	vops->syncs = syncs;
+	vops->num_syncs = num_syncs;
+	vops->flags = 0;
+}
+
 static int xe_vma_ops_alloc(struct xe_vma_ops *vops, bool array_of_binds)
 {
 	int i;
@@ -788,11 +801,9 @@ static int xe_vma_ops_alloc(struct xe_vma_ops *vops, bool array_of_binds)
 		if (!vops->pt_update_ops[i].num_ops)
 			continue;
 
-		vops->pt_update_ops[i].ops =
-			kmalloc_array(vops->pt_update_ops[i].num_ops,
-				      sizeof(*vops->pt_update_ops[i].ops),
-				      GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
-		if (!vops->pt_update_ops[i].ops)
+		vops->pt_update_ops[i].pt_job_ops =
+			xe_pt_job_ops_alloc(vops->pt_update_ops[i].num_ops);
+		if (!vops->pt_update_ops[i].pt_job_ops)
 			return array_of_binds ? -ENOBUFS : -ENOMEM;
 	}
 
@@ -828,7 +839,7 @@ static void xe_vma_ops_fini(struct xe_vma_ops *vops)
 	xe_vma_svm_prefetch_ops_fini(vops);
 
 	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
-		kfree(vops->pt_update_ops[i].ops);
+		xe_pt_job_ops_put(vops->pt_update_ops[i].pt_job_ops);
 }
 
 static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops, u8 tile_mask, int inc_val)
@@ -877,9 +888,6 @@ static int xe_vm_ops_add_rebind(struct xe_vma_ops *vops, struct xe_vma *vma,
 
 static struct dma_fence *ops_execute(struct xe_vm *vm,
 				     struct xe_vma_ops *vops);
-static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm *vm,
-			    struct xe_exec_queue *q,
-			    struct xe_sync_entry *syncs, u32 num_syncs);
 
 int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 {
@@ -3163,13 +3171,6 @@ static struct dma_fence *ops_execute(struct xe_vm *vm,
 		fence = &cf->base;
 	}
 
-	for_each_tile(tile, vm->xe, id) {
-		if (!vops->pt_update_ops[id].num_ops)
-			continue;
-
-		xe_pt_update_ops_fini(tile, vops);
-	}
-
 	return fence;
 
 err_out:
@@ -3447,19 +3448,6 @@ static int vm_bind_ioctl_signal_fences(struct xe_vm *vm,
 	return err;
 }
 
-static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm *vm,
-			    struct xe_exec_queue *q,
-			    struct xe_sync_entry *syncs, u32 num_syncs)
-{
-	memset(vops, 0, sizeof(*vops));
-	INIT_LIST_HEAD(&vops->list);
-	vops->vm = vm;
-	vops->q = q;
-	vops->syncs = syncs;
-	vops->num_syncs = num_syncs;
-	vops->flags = 0;
-}
-
 static int xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct xe_bo *bo,
 					u64 addr, u64 range, u64 obj_offset,
 					u16 pat_index, u32 op, u32 bind_flags)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/15] drm/xe: Remove unused arguments from xe_migrate_pt_update_ops
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (2 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 03/15] drm/xe: CPU binds for jobs Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 05/15] drm/xe: Don't use migrate exec queue for page fault binds Matthew Brost
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Both populate and clear have unused void* ptr arguments, remove these.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c |  4 ++--
 drivers/gpu/drm/xe/xe_migrate.h |  7 ++-----
 drivers/gpu/drm/xe/xe_pt.c      | 29 ++++++++++++++---------------
 3 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index e444f3fae97c..9b17feb8bae6 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1218,11 +1218,11 @@ __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
 
 			if (pt_op->bind)
 				ops->populate(tile, &update->pt_bo->vmap,
-					      NULL, update->ofs, update->qwords,
+					      update->ofs, update->qwords,
 					      update);
 			else
 				ops->clear(vm, tile, &update->pt_bo->vmap,
-					   NULL, update->ofs, update->qwords,
+					   update->ofs, update->qwords,
 					   update);
 		}
 	}
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 0986ffdd8d9a..b78bdd11d496 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -34,7 +34,6 @@ struct xe_migrate_pt_update_ops {
 	 * @populate: Populate a command buffer or page-table with ptes.
 	 * @tile: The tile for the current operation.
 	 * @map: struct iosys_map into the memory to be populated.
-	 * @pos: If @map is NULL, map into the memory to be populated.
 	 * @ofs: qword offset into @map, unused if @map is NULL.
 	 * @num_qwords: Number of qwords to write.
 	 * @update: Information about the PTEs to be inserted.
@@ -44,14 +43,13 @@ struct xe_migrate_pt_update_ops {
 	 * page-tables with PTEs.
 	 */
 	void (*populate)(struct xe_tile *tile, struct iosys_map *map,
-			 void *pos, u32 ofs, u32 num_qwords,
+			 u32 ofs, u32 num_qwords,
 			 const struct xe_vm_pgtable_update *update);
 	/**
 	 * @clear: Clear a command buffer or page-table with ptes.
 	 * @vm: VM being updated
 	 * @tile: The tile for the current operation.
 	 * @map: struct iosys_map into the memory to be populated.
-	 * @pos: If @map is NULL, map into the memory to be populated.
 	 * @ofs: qword offset into @map, unused if @map is NULL.
 	 * @num_qwords: Number of qwords to write.
 	 * @update: Information about the PTEs to be inserted.
@@ -61,8 +59,7 @@ struct xe_migrate_pt_update_ops {
 	 * page-tables with PTEs.
 	 */
 	void (*clear)(struct xe_vm *vm, struct xe_tile *tile,
-		      struct iosys_map *map, void *pos, u32 ofs,
-		      u32 num_qwords,
+		      struct iosys_map *map, u32 ofs, u32 num_qwords,
 		      const struct xe_vm_pgtable_update *update);
 
 	/**
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 1ad31f444b79..5132e8eb7562 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -976,20 +976,18 @@ bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
 
 static void
 xe_vm_populate_pgtable(struct xe_tile *tile, struct iosys_map *map,
-		       void *data, u32 qword_ofs, u32 num_qwords,
+		       u32 qword_ofs, u32 num_qwords,
 		       const struct xe_vm_pgtable_update *update)
 {
 	struct xe_pt_entry *ptes = update->pt_entries;
-	u64 *ptr = data;
 	u32 i;
 
-	for (i = 0; i < num_qwords; i++) {
-		if (map)
-			xe_map_wr(tile_to_xe(tile), map, (qword_ofs + i) *
-				  sizeof(u64), u64, ptes[i].pte);
-		else
-			ptr[i] = ptes[i].pte;
-	}
+	xe_assert(tile_to_xe(tile), map);
+	xe_assert(tile_to_xe(tile), !iosys_map_is_null(map));
+
+	for (i = 0; i < num_qwords; i++)
+		xe_map_wr(tile_to_xe(tile), map, (qword_ofs + i) *
+			  sizeof(u64), u64, ptes[i].pte);
 }
 
 static void xe_pt_cancel_bind(struct xe_vma *vma,
@@ -1748,22 +1746,23 @@ static unsigned int xe_pt_stage_unbind(struct xe_tile *tile,
 
 static void
 xe_migrate_clear_pgtable_callback(struct xe_vm *vm, struct xe_tile *tile,
-				  struct iosys_map *map, void *ptr,
-				  u32 qword_ofs, u32 num_qwords,
+				  struct iosys_map *map, u32 qword_ofs,
+				  u32 num_qwords,
 				  const struct xe_vm_pgtable_update *update)
 {
 	u64 empty = __xe_pt_empty_pte(tile, vm, update->level);
 	int i;
 
-	if (map && map->is_iomem)
+	xe_assert(vm->xe, map);
+	xe_assert(vm->xe, !iosys_map_is_null(map));
+
+	if (map->is_iomem)
 		for (i = 0; i < num_qwords; ++i)
 			xe_map_wr(tile_to_xe(tile), map, (qword_ofs + i) *
 				  sizeof(u64), u64, empty);
-	else if (map)
+	else
 		memset64(map->vaddr + qword_ofs * sizeof(u64), empty,
 			 num_qwords);
-	else
-		memset64(ptr, empty, num_qwords);
 }
 
 static void xe_pt_abort_unbind(struct xe_vma *vma,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/15] drm/xe: Don't use migrate exec queue for page fault binds
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (3 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 04/15] drm/xe: Remove unused arguments from xe_migrate_pt_update_ops Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 06/15] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Now that the CPU is always used for binds even in jobs, CPU bind jobs
can pass GPU jobs in the same exec queue resulting dma-fences signaling
out-of-order. Use a dedicated exec queue for binds issued from page
faults to avoid ordering issues and avoid blocking kernel binds on
unrelated copies / clears.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 27 +++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_migrate.h |  2 +-
 drivers/gpu/drm/xe/xe_vm.c      | 17 +++++++++++------
 3 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 9b17feb8bae6..6b6dff9d4aaa 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -41,6 +41,8 @@
 struct xe_migrate {
 	/** @q: Default exec queue used for migration */
 	struct xe_exec_queue *q;
+	/** @bind_q: Default exec queue used for binds */
+	struct xe_exec_queue *bind_q;
 	/** @tile: Backpointer to the tile this struct xe_migrate belongs to. */
 	struct xe_tile *tile;
 	/** @job_mutex: Timeline mutex for @eng. */
@@ -79,16 +81,16 @@ struct xe_migrate {
 #define MAX_PTE_PER_SDI 0x1FE
 
 /**
- * xe_tile_migrate_exec_queue() - Get this tile's migrate exec queue.
+ * xe_tile_migrate_bind_exec_queue() - Get this tile's migrate bind exec queue.
  * @tile: The tile.
  *
- * Returns the default migrate exec queue of this tile.
+ * Returns the default migrate bind exec queue of this tile.
  *
- * Return: The default migrate exec queue
+ * Return: The default migrate bind exec queue
  */
-struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile)
+struct xe_exec_queue *xe_tile_migrate_bind_exec_queue(struct xe_tile *tile)
 {
-	return tile->migrate->q;
+	return tile->migrate->bind_q;
 }
 
 static void xe_migrate_fini(void *arg)
@@ -104,6 +106,8 @@ static void xe_migrate_fini(void *arg)
 	mutex_destroy(&m->job_mutex);
 	xe_vm_close_and_put(m->q->vm);
 	xe_exec_queue_put(m->q);
+	if (m->bind_q)
+		xe_exec_queue_put(m->bind_q);
 }
 
 static u64 xe_migrate_vm_addr(u64 slot, u32 level)
@@ -410,6 +414,15 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
 		if (!hwe || !logical_mask)
 			return ERR_PTR(-EINVAL);
 
+		m->bind_q = xe_exec_queue_create(xe, vm, logical_mask, 1, hwe,
+						 EXEC_QUEUE_FLAG_KERNEL |
+						 EXEC_QUEUE_FLAG_PERMANENT |
+						 EXEC_QUEUE_FLAG_HIGH_PRIORITY, 0);
+		if (IS_ERR(m->bind_q)) {
+			xe_vm_close_and_put(vm);
+			return ERR_CAST(m->bind_q);
+		}
+
 		/*
 		 * XXX: Currently only reserving 1 (likely slow) BCS instance on
 		 * PVC, may want to revisit if performance is needed.
@@ -425,6 +438,8 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
 						  EXEC_QUEUE_FLAG_PERMANENT, 0);
 	}
 	if (IS_ERR(m->q)) {
+		if (m->bind_q)
+			xe_exec_queue_put(m->bind_q);
 		xe_vm_close_and_put(vm);
 		return ERR_CAST(m->q);
 	}
@@ -1270,7 +1285,7 @@ __xe_migrate_update_pgtables(struct xe_migrate *m,
 	struct xe_tile *tile = m->tile;
 	struct xe_sched_job *job;
 	struct dma_fence *fence;
-	bool is_migrate = pt_update_ops->q == m->q;
+	bool is_migrate = pt_update_ops->q == m->bind_q;
 	int err;
 
 	job = xe_sched_job_create(pt_update_ops->q, NULL);
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index b78bdd11d496..3131875341c9 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -134,5 +134,5 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 
 void xe_migrate_wait(struct xe_migrate *m);
 
-struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile);
+struct xe_exec_queue *xe_tile_migrate_bind_exec_queue(struct xe_tile *tile);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 6fc01fdd7286..0285938a4bb2 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -895,7 +895,9 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 	struct xe_vma *vma, *next;
 	struct xe_vma_ops vops;
 	struct xe_vma_op *op, *next_op;
-	int err, i;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
 
 	lockdep_assert_held(&vm->lock);
 	if ((xe_vm_in_lr_mode(vm) && !rebind_worker) ||
@@ -903,8 +905,11 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 		return 0;
 
 	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
-	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
-		vops.pt_update_ops[i].wait_vm_bookkeep = true;
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[id].q =
+			xe_tile_migrate_bind_exec_queue(tile);
+	}
 
 	xe_vm_assert_held(vm);
 	list_for_each_entry(vma, &vm->rebind_list, combined_links.rebind) {
@@ -961,7 +966,7 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 	for_each_tile(tile, vm->xe, id) {
 		vops.pt_update_ops[id].wait_vm_bookkeep = true;
 		vops.pt_update_ops[tile->id].q =
-			xe_tile_migrate_exec_queue(tile);
+			xe_tile_migrate_bind_exec_queue(tile);
 	}
 
 	err = xe_vm_ops_add_rebind(&vops, vma, tile_mask);
@@ -1051,7 +1056,7 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
 	for_each_tile(tile, vm->xe, id) {
 		vops.pt_update_ops[id].wait_vm_bookkeep = true;
 		vops.pt_update_ops[tile->id].q =
-			xe_tile_migrate_exec_queue(tile);
+			xe_tile_migrate_bind_exec_queue(tile);
 	}
 
 	err = xe_vm_ops_add_range_rebind(&vops, vma, range, tile_mask);
@@ -1134,7 +1139,7 @@ struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
 	for_each_tile(tile, vm->xe, id) {
 		vops.pt_update_ops[id].wait_vm_bookkeep = true;
 		vops.pt_update_ops[tile->id].q =
-			xe_tile_migrate_exec_queue(tile);
+			xe_tile_migrate_bind_exec_queue(tile);
 	}
 
 	err = xe_vm_ops_add_range_unbind(&vops, range);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/15] drm/xe: Add xe_hw_engine_write_ring_tail
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (4 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 05/15] drm/xe: Don't use migrate exec queue for page fault binds Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 07/15] drm/xe: Add ULLS support to LRC Matthew Brost
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

ULLS for migration jobs need to directly set hw engine ring tail, add
function to support this.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_hw_engine.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c
index 3439c8522d01..5ab024743504 100644
--- a/drivers/gpu/drm/xe/xe_hw_engine.c
+++ b/drivers/gpu/drm/xe/xe_hw_engine.c
@@ -304,6 +304,16 @@ void xe_hw_engine_mmio_write32(struct xe_hw_engine *hwe,
 	xe_mmio_write32(&hwe->gt->mmio, reg, val);
 }
 
+/**
+ * xe_hw_engine_write_ring_tail() - Write ring tail
+ * @hwe: engine
+ * @val: desired 32-bit value to write
+ */
+void xe_hw_engine_write_ring_tail(struct xe_hw_engine *hwe, u32 val)
+{
+	xe_hw_engine_mmio_write32(hwe, RING_TAIL(0), val);
+}
+
 /**
  * xe_hw_engine_mmio_read32() - Read engine register
  * @hwe: engine
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/15] drm/xe: Add ULLS support to LRC
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (5 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 06/15] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 08/15] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Define memory layout for ULLS semaphores stored in LRC memory. Add
support functions to return GGTT address and set semaphore based on a
job's seqno.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_hw_engine.h |  1 +
 drivers/gpu/drm/xe/xe_lrc.c       | 51 ++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_lrc.h       |  3 ++
 drivers/gpu/drm/xe/xe_lrc_types.h |  2 ++
 4 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_hw_engine.h b/drivers/gpu/drm/xe/xe_hw_engine.h
index 6b5f9fa2a594..b93c3eabca06 100644
--- a/drivers/gpu/drm/xe/xe_hw_engine.h
+++ b/drivers/gpu/drm/xe/xe_hw_engine.h
@@ -78,5 +78,6 @@ enum xe_force_wake_domains xe_hw_engine_to_fw_domain(struct xe_hw_engine *hwe);
 
 void xe_hw_engine_mmio_write32(struct xe_hw_engine *hwe, struct xe_reg reg, u32 val);
 u32 xe_hw_engine_mmio_read32(struct xe_hw_engine *hwe, struct xe_reg reg);
+void xe_hw_engine_write_ring_tail(struct xe_hw_engine *hwe, u32 val);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 63d74e27f54c..75344b89fe4a 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -654,8 +654,9 @@ u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc)
 #define LRC_SEQNO_PPHWSP_OFFSET 512
 #define LRC_START_SEQNO_PPHWSP_OFFSET (LRC_SEQNO_PPHWSP_OFFSET + 8)
 #define LRC_CTX_JOB_TIMESTAMP_OFFSET (LRC_START_SEQNO_PPHWSP_OFFSET + 8)
+#define LRC_ENGINE_ID_PPHWSP_OFFSET 1984
 #define LRC_PARALLEL_PPHWSP_OFFSET 2048
-#define LRC_ENGINE_ID_PPHWSP_OFFSET 2096
+#define LRC_ULLS_PPHWSP_OFFSET 2048	/* Mutual exclusive with parallel */
 
 u32 xe_lrc_regs_offset(struct xe_lrc *lrc)
 {
@@ -704,6 +705,12 @@ static inline u32 __xe_lrc_engine_id_offset(struct xe_lrc *lrc)
 	return xe_lrc_pphwsp_offset(lrc) + LRC_ENGINE_ID_PPHWSP_OFFSET;
 }
 
+static inline u32 __xe_lrc_ulls_offset(struct xe_lrc *lrc)
+{
+	/* The ulls is stored in the driver-defined portion of PPHWSP */
+	return xe_lrc_pphwsp_offset(lrc) + LRC_ULLS_PPHWSP_OFFSET;
+}
+
 static u32 __xe_lrc_ctx_timestamp_offset(struct xe_lrc *lrc)
 {
 	return __xe_lrc_regs_offset(lrc) + CTX_TIMESTAMP * sizeof(u32);
@@ -743,6 +750,7 @@ DECL_MAP_ADDR_HELPERS(ctx_job_timestamp)
 DECL_MAP_ADDR_HELPERS(ctx_timestamp)
 DECL_MAP_ADDR_HELPERS(ctx_timestamp_udw)
 DECL_MAP_ADDR_HELPERS(parallel)
+DECL_MAP_ADDR_HELPERS(ulls)
 DECL_MAP_ADDR_HELPERS(indirect_ring)
 DECL_MAP_ADDR_HELPERS(engine_id)
 
@@ -1361,6 +1369,47 @@ static u32 xe_lrc_engine_id(struct xe_lrc *lrc)
 	return xe_map_read32(xe, &map);
 }
 
+#define semaphore_offset(seqno) \
+	(sizeof(u32) * ((seqno) % LRC_MIGRATION_ULLS_SEMAPORE_COUNT))
+
+/**
+ * xe_lrc_ulls_semaphore_ggtt_addr() - ULLS semaphore GGTT address
+ * @lrc: Pointer to the lrc.
+ * @seqno: seqno of current job.
+ *
+ * Calculate ULLS semaphore GGTT address based on input seqno
+ *
+ * Returns: ULLS semaphore GGTT address
+ */
+u32 xe_lrc_ulls_semaphore_ggtt_addr(struct xe_lrc *lrc, u32 seqno)
+{
+	xe_assert(lrc_to_xe(lrc), semaphore_offset(seqno) <
+		  LRC_PPHWSP_SIZE - LRC_ULLS_PPHWSP_OFFSET);
+
+	return __xe_lrc_ulls_ggtt_addr(lrc) + semaphore_offset(seqno);
+}
+
+/**
+ * xe_lrc_set_ulls_semaphore() - Set ULLS semaphore
+ * @lrc: Pointer to the lrc.
+ * @seqno: seqno of current job.
+ *
+ * Set ULLS semaphore based on input seqno
+ */
+void xe_lrc_set_ulls_semaphore(struct xe_lrc *lrc, u32 seqno)
+{
+	struct xe_device *xe = lrc_to_xe(lrc);
+	struct iosys_map map = __xe_lrc_ulls_map(lrc);
+
+	xe_assert(xe, semaphore_offset(seqno) <
+		  LRC_PPHWSP_SIZE - LRC_ULLS_PPHWSP_OFFSET);
+
+	xe_device_wmb(xe);	/* Ensure everything before in code is ordered */
+
+	iosys_map_incr(&map, semaphore_offset(seqno));
+	xe_map_write32(xe, &map, 1);
+}
+
 static int instr_dw(u32 cmd_header)
 {
 	/* GFXPIPE "SINGLE_DW" opcodes are a single dword */
diff --git a/drivers/gpu/drm/xe/xe_lrc.h b/drivers/gpu/drm/xe/xe_lrc.h
index eb6e8de8c939..1ded2de34eb2 100644
--- a/drivers/gpu/drm/xe/xe_lrc.h
+++ b/drivers/gpu/drm/xe/xe_lrc.h
@@ -89,6 +89,9 @@ u32 xe_lrc_indirect_ring_ggtt_addr(struct xe_lrc *lrc);
 u32 xe_lrc_ggtt_addr(struct xe_lrc *lrc);
 u32 *xe_lrc_regs(struct xe_lrc *lrc);
 
+u32 xe_lrc_ulls_semaphore_ggtt_addr(struct xe_lrc *lrc, u32 seqno);
+void xe_lrc_set_ulls_semaphore(struct xe_lrc *lrc, u32 seqno);
+
 u32 xe_lrc_read_ctx_reg(struct xe_lrc *lrc, int reg_nr);
 void xe_lrc_write_ctx_reg(struct xe_lrc *lrc, int reg_nr, u32 val);
 
diff --git a/drivers/gpu/drm/xe/xe_lrc_types.h b/drivers/gpu/drm/xe/xe_lrc_types.h
index ae24cf6f8dd9..96a0f545ba60 100644
--- a/drivers/gpu/drm/xe/xe_lrc_types.h
+++ b/drivers/gpu/drm/xe/xe_lrc_types.h
@@ -12,6 +12,8 @@
 
 struct xe_bo;
 
+#define LRC_MIGRATION_ULLS_SEMAPORE_COUNT	64	/* Must be pow2 */
+
 /**
  * struct xe_lrc - Logical ring context (LRC) and submission ring object
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/15] drm/xe: Add ULLS migration job support to migration layer
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (6 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 07/15] drm/xe: Add ULLS support to LRC Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 09/15] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Add functions to enter / exit ULLS mode for migration jobs when LR VMs
are opened / closed. ULLS mode only support on DGFX and USM platforms
where a hardware engine is reserved for migrations jobs. When in ULLS
mode, set several flags on migration jobs so submission backend / ring
ops can properly submit in ULLS mode. Upon ULLS mode exit, send a job to
trigger that current ULLS semaphore so the ring can be taken off the
hardware.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c         | 111 ++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h         |   4 +
 drivers/gpu/drm/xe/xe_sched_job_types.h |   6 ++
 3 files changed, 121 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 6b6dff9d4aaa..80344d4f6f10 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -22,6 +22,7 @@
 #include "xe_bb.h"
 #include "xe_bo.h"
 #include "xe_exec_queue.h"
+#include "xe_force_wake.h"
 #include "xe_ggtt.h"
 #include "xe_gt.h"
 #include "xe_hw_engine.h"
@@ -62,6 +63,13 @@ struct xe_migrate {
 	struct dma_fence *fence;
 	/** @min_chunk_size: For dgfx, Minimum chunk size */
 	u64 min_chunk_size;
+	/** @ulls: ULLS support */
+	struct {
+		/** @ulls.lr_vm_count: count of LR VMs open */
+		u32 lr_vm_count;
+		/** @ulls: first submit of ULLS */
+		u8 first_submit : 1;
+	} ulls;
 };
 
 #define MAX_PREEMPTDISABLE_TRANSFER SZ_8M /* Around 1ms. */
@@ -734,6 +742,95 @@ static u32 xe_migrate_ccs_copy(struct xe_migrate *m,
 	return flush_flags;
 }
 
+/**
+ * xe_migrate_lr_vm_get() - Open a LR VM and possibly enter ULLS mode
+ * @m: The migration context.
+ *
+ * If DGFX and device supprts USM, enter ULLS mode by increasing LR VM count
+ */
+void xe_migrate_lr_vm_get(struct xe_migrate *m)
+{
+	struct xe_device *xe = tile_to_xe(m->tile);
+
+	if (!IS_DGFX(xe) || !xe->info.has_usm)
+		return;
+
+	mutex_lock(&m->job_mutex);
+	if (!m->ulls.lr_vm_count++) {
+		unsigned int fw_ref;
+
+		drm_dbg(&xe->drm, "Migrate ULLS mode enter");
+		fw_ref = xe_force_wake_get(gt_to_fw(m->q->hwe->gt),
+					   m->q->hwe->domain);
+
+		XE_WARN_ON(!fw_ref);
+		m->ulls.first_submit = true;
+	}
+	mutex_unlock(&m->job_mutex);
+}
+
+/**
+ * xe_migrate_lr_vm_put() - Open a LR VM and possinly exit ULLS mode
+ * @m: The migration context.
+ *
+ * If DGFX and device supprts USM, decrease LR VM count, exit if count equal to
+ * zero by submiting a job to trigger last ULLS semaphore.
+ */
+void xe_migrate_lr_vm_put(struct xe_migrate *m)
+{
+	struct xe_device *xe = tile_to_xe(m->tile);
+
+	if (!IS_DGFX(xe) || !xe->info.has_usm)
+		return;
+
+	mutex_lock(&m->job_mutex);
+	xe_assert(xe, m->ulls.lr_vm_count);
+	if (!--m->ulls.lr_vm_count && !m->ulls.first_submit) {
+		struct xe_sched_job *job;
+		struct dma_fence *fence;
+		u64 batch_addr[2] = { 0, 0 };
+
+		job = xe_sched_job_create(m->q, batch_addr);
+		if (WARN_ON_ONCE(IS_ERR(job)))
+			goto unlock;	/* Not fatal */
+
+		xe_sched_job_arm(job);
+		job->is_ulls = true;
+		job->is_ulls_last = true;
+		fence = dma_fence_get(&job->drm.s_fence->finished);
+		xe_sched_job_push(job);
+
+		/* Serialize force wake put */
+		dma_fence_wait(fence, false);
+		dma_fence_put(fence);
+	}
+unlock:
+	if (!m->ulls.lr_vm_count) {
+		drm_dbg(&xe->drm, "Migrate ULLS mode exit");
+		xe_force_wake_put(gt_to_fw(m->q->hwe->gt), m->q->hwe->domain);
+	}
+	mutex_unlock(&m->job_mutex);
+}
+
+static inline bool xe_migrate_is_ulls(struct xe_migrate *m)
+{
+	lockdep_assert_held(&m->job_mutex);
+
+	return !!m->ulls.lr_vm_count;
+}
+
+static inline bool xe_migrate_is_ulls_first(struct xe_migrate *m)
+{
+	lockdep_assert_held(&m->job_mutex);
+
+	if (xe_migrate_is_ulls(m) && m->ulls.first_submit) {
+		m->ulls.first_submit = false;
+		return true;
+	}
+
+	return false;
+}
+
 /**
  * xe_migrate_copy() - Copy content of TTM resources.
  * @m: The migration context.
@@ -904,6 +1001,10 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 
 		mutex_lock(&m->job_mutex);
 		xe_sched_job_arm(job);
+		if (xe_migrate_is_ulls(m))
+			job->is_ulls = true;
+		if (xe_migrate_is_ulls_first(m))
+			job->is_ulls_first = true;
 		dma_fence_put(fence);
 		fence = dma_fence_get(&job->drm.s_fence->finished);
 		xe_sched_job_push(job);
@@ -923,6 +1024,7 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 		xe_bb_free(bb, NULL);
 
 err_sync:
+
 		/* Sync partial copy if any. FIXME: under job_mutex? */
 		if (fence) {
 			dma_fence_wait(fence, false);
@@ -1156,6 +1258,10 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 
 		mutex_lock(&m->job_mutex);
 		xe_sched_job_arm(job);
+		if (xe_migrate_is_ulls(m))
+			job->is_ulls = true;
+		if (xe_migrate_is_ulls_first(m))
+			job->is_ulls_first = true;
 		dma_fence_put(fence);
 		fence = dma_fence_get(&job->drm.s_fence->finished);
 		xe_sched_job_push(job);
@@ -1173,6 +1279,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 err:
 		xe_bb_free(bb, NULL);
 err_sync:
+
 		/* Sync partial copies if any. FIXME: job_mutex? */
 		if (fence) {
 			dma_fence_wait(fence, false);
@@ -1499,6 +1606,10 @@ static struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
 	mutex_lock(&m->job_mutex);
 	xe_sched_job_arm(job);
 	fence = dma_fence_get(&job->drm.s_fence->finished);
+	if (xe_migrate_is_ulls(m))
+		job->is_ulls = true;
+	if (xe_migrate_is_ulls_first(m))
+		job->is_ulls_first = true;
 	xe_sched_job_push(job);
 
 	dma_fence_put(m->fence);
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 3131875341c9..3af024284722 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -135,4 +135,8 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 void xe_migrate_wait(struct xe_migrate *m);
 
 struct xe_exec_queue *xe_tile_migrate_bind_exec_queue(struct xe_tile *tile);
+
+void xe_migrate_lr_vm_get(struct xe_migrate *m);
+void xe_migrate_lr_vm_put(struct xe_migrate *m);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index 79a459f2a0a8..9beeafb636ba 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -79,6 +79,12 @@ struct xe_sched_job {
 	bool ggtt;
 	/** @is_pt_job: is a PT job */
 	bool is_pt_job;
+	/** @is_ulls: is ULLS job */
+	bool is_ulls;
+	/** @is_ulls_first: is first ULLS job */
+	bool is_ulls_first;
+	/** @is_ulls_last: is last ULLS job */
+	bool is_ulls_last;
 	union {
 		/** @ptrs: per instance pointers. */
 		DECLARE_FLEX_ARRAY(struct xe_job_ptrs, ptrs);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/15] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (7 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 08/15] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 10/15] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Add MI_SEMAPHORE_WAIT instruction defs which are need for kernel ULLS
migration jobs.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/instructions/xe_mi_commands.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
index e3f5e8bb3ebc..7a871b186401 100644
--- a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
+++ b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
@@ -34,6 +34,12 @@
 #define MI_FORCE_WAKEUP			__MI_INSTR(0x1D)
 #define MI_MATH(n)			(__MI_INSTR(0x1A) | XE_INSTR_NUM_DW((n) + 1))
 
+#define MI_SEMAPHORE_WAIT		(__MI_INSTR(0x1c) | XE_INSTR_NUM_DW(4))
+#define   MI_SEMAPHORE_GLOBAL_GTT	REG_BIT(22)
+#define   MI_SEMAPHORE_POLL             REG_BIT(15)
+#define   MI_SEMAPHORE_COMPARE		GENMASK(15, 12)
+#define   MI_SEMAPHORE_SAD_EQ_SDD       REG_FIELD_PREP(MI_SEMAPHORE_COMPARE, 4)
+
 #define MI_STORE_DATA_IMM		__MI_INSTR(0x20)
 #define   MI_SDI_GGTT			REG_BIT(22)
 #define   MI_SDI_LEN_DW			GENMASK(9, 0)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/15] drm/xe: Add ULLS migration job support to ring ops
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (8 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 09/15] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 11/15] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Add preamble and postamble for ULLS migrations jobs. Preamble clears
current semaphore for reuse. Postamble waits on next semaphore which is
set upon next job submission. The last ULLS migration job skips BB
submission and postamble (clear current semaphore, write seqno, exit
ULLS).

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_ring_ops.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index bc1689db4cd7..8f63c97b749d 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -394,11 +394,37 @@ static void __emit_job_gen12_render_compute(struct xe_sched_job *job,
 	xe_lrc_write_ring(lrc, dw, i * sizeof(*dw));
 }
 
+static int emit_ulls_preamble(struct xe_lrc *lrc, u32 *dw, int i, u32 seqno)
+{
+	u32 addr = xe_lrc_ulls_semaphore_ggtt_addr(lrc, seqno);
+
+	return emit_store_imm_ggtt(addr, 0, dw, i);
+}
+
+static int emit_ulls_postamble(struct xe_lrc *lrc, u32 *dw, int i, u32 seqno)
+{
+	dw[i++] = MI_SEMAPHORE_WAIT |
+		MI_SEMAPHORE_GLOBAL_GTT |
+		MI_SEMAPHORE_POLL |
+		MI_SEMAPHORE_SAD_EQ_SDD;
+	dw[i++] = 1;
+	dw[i++] = xe_lrc_ulls_semaphore_ggtt_addr(lrc, seqno + 1);
+	dw[i++] = 0;
+
+	return i;
+}
+
 static void emit_migration_job_gen12(struct xe_sched_job *job,
 				     struct xe_lrc *lrc, u32 seqno)
 {
 	u32 dw[MAX_JOB_SIZE_DW], i = 0;
 
+	if (job->is_ulls) {
+		i = emit_ulls_preamble(lrc, dw, i, seqno);
+		if (job->is_ulls_last)
+			goto seqno_write;
+	}
+
 	i = emit_copy_timestamp(lrc, dw, i);
 
 	i = emit_store_imm_ggtt(xe_lrc_start_seqno_ggtt_addr(lrc),
@@ -417,6 +443,7 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
 
 	i = emit_bb_start(job->ptrs[1].batch_addr, BIT(8), dw, i);
 
+seqno_write:
 	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | job->migrate_flush_flags |
 		MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_IMM_DW;
 	dw[i++] = xe_lrc_seqno_ggtt_addr(lrc) | MI_FLUSH_DW_USE_GTT;
@@ -425,6 +452,9 @@ static void emit_migration_job_gen12(struct xe_sched_job *job,
 
 	i = emit_user_interrupt(dw, i);
 
+	if (job->is_ulls && !job->is_ulls_last)
+		i = emit_ulls_postamble(lrc, dw, i, seqno);
+
 	xe_gt_assert(job->q->gt, i <= MAX_JOB_SIZE_DW);
 
 	xe_lrc_write_ring(lrc, dw, i * sizeof(*dw));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/15] drm/xe: Add ULLS migration job support to GuC submission
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (9 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 10/15] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 12/15] drm/xe: Enable ULLS migration jobs when opening LR VM Matthew Brost
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Add ULLS migration job support to GuC submission backend.

Changes required:
- On migration queue, reduce max jobs to the number of ULLS semaphores
  minus one
- Directly set the hardware engine tail via a MMIO write for ULLS jobs
  except for first ULLS job
- Set ULLS sempahore for current job releasing last job
- Suppress submit H2G for ULLS except for first ULLS job

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 551cd21a6465..f67dfdb69637 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -697,7 +697,7 @@ static void wq_item_append(struct xe_exec_queue *q)
 }
 
 #define RESUME_PENDING	~0x0ull
-static void submit_exec_queue(struct xe_exec_queue *q)
+static void submit_exec_queue(struct xe_sched_job *job, struct xe_exec_queue *q)
 {
 	struct xe_guc *guc = exec_queue_to_guc(q);
 	struct xe_lrc *lrc = q->lrc[0];
@@ -717,6 +717,13 @@ static void submit_exec_queue(struct xe_exec_queue *q)
 	if (exec_queue_suspended(q) && !xe_exec_queue_is_parallel(q))
 		return;
 
+	if (job->is_ulls) {
+		if (!job->is_ulls_first)
+			xe_hw_engine_write_ring_tail(q->hwe, lrc->ring.tail);
+
+		xe_lrc_set_ulls_semaphore(lrc, xe_sched_job_lrc_seqno(job));
+	}
+
 	if (!exec_queue_enabled(q) && !exec_queue_suspended(q)) {
 		action[len++] = XE_GUC_ACTION_SCHED_CONTEXT_MODE_SET;
 		action[len++] = q->guc->id;
@@ -730,13 +737,14 @@ static void submit_exec_queue(struct xe_exec_queue *q)
 		set_exec_queue_pending_enable(q);
 		set_exec_queue_enabled(q);
 		trace_xe_exec_queue_scheduling_enable(q);
-	} else {
+	} else if (!job->is_ulls || job->is_ulls_first) {
 		action[len++] = XE_GUC_ACTION_SCHED_CONTEXT;
 		action[len++] = q->guc->id;
 		trace_xe_exec_queue_submit(q);
 	}
 
-	xe_guc_ct_send(&guc->ct, action, len, g2h_len, num_g2h);
+	if (!job->is_ulls || job->is_ulls_first || num_g2h)
+		xe_guc_ct_send(&guc->ct, action, len, g2h_len, num_g2h);
 
 	if (extra_submit) {
 		len = 0;
@@ -784,7 +792,7 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 				register_exec_queue(q);
 			if (!lr)	/* LR jobs are emitted in the exec IOCTL */
 				q->ring_ops->emit_job(job);
-			submit_exec_queue(q);
+			submit_exec_queue(job, q);
 		}
 	}
 
@@ -1497,6 +1505,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 	struct xe_guc_exec_queue *ge;
 	long timeout;
 	int err, i;
+	int max_jobs = (q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES);
 
 	xe_gt_assert(guc_to_gt(guc), xe_device_uc_enabled(guc_to_xe(guc)));
 
@@ -1511,10 +1520,17 @@ static int guc_exec_queue_init(struct xe_exec_queue *q)
 	for (i = 0; i < MAX_STATIC_MSG_TYPE; ++i)
 		INIT_LIST_HEAD(&ge->static_msgs[i].link);
 
+	if (q->vm && q->vm->flags & XE_VM_FLAG_MIGRATION) {
+		xe_assert(guc_to_xe(guc),
+			  LRC_MIGRATION_ULLS_SEMAPORE_COUNT - 1 < max_jobs);
+
+		max_jobs = LRC_MIGRATION_ULLS_SEMAPORE_COUNT - 1;
+	}
+
 	timeout = (q->vm && xe_vm_in_lr_mode(q->vm)) ? MAX_SCHEDULE_TIMEOUT :
 		  msecs_to_jiffies(q->sched_props.job_timeout_ms);
 	err = xe_sched_init(&ge->sched, &drm_sched_ops, &xe_sched_ops,
-			    NULL, q->lrc[0]->ring.size / MAX_JOB_SIZE_BYTES, 64,
+			    NULL, max_jobs, 64,
 			    timeout, guc_to_gt(guc)->ordered_wq, NULL,
 			    q->name, gt_to_xe(q->gt)->drm.dev);
 	if (err)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 12/15] drm/xe: Enable ULLS migration jobs when opening LR VM
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (10 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 11/15] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 13/15] drm/xe: Set slpc freq to max on ULLS jobs Matthew Brost
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Call xe_migration_lr_vm_get upon opening LR VM and
xe_migration_lr_vm_put upon close to enter / exit ULLS on the migration
exec queue.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 0285938a4bb2..8373e776b504 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1807,6 +1807,11 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	if (number_tiles > 1)
 		vm->composite_fence_ctx = dma_fence_context_alloc(1);
 
+	if (flags & XE_VM_FLAG_LR_MODE) {
+		for_each_tile(tile, xe, id)
+			xe_migrate_lr_vm_get(tile->migrate);
+	}
+
 	trace_xe_vm_create(vm);
 
 	return vm;
@@ -1964,6 +1969,11 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 
 	up_write(&vm->lock);
 
+	if (vm->flags & XE_VM_FLAG_LR_MODE) {
+		for_each_tile(tile, xe, id)
+			xe_migrate_lr_vm_put(tile->migrate);
+	}
+
 	down_write(&xe->usm.lock);
 	if (vm->usm.asid) {
 		void *lookup;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 13/15] drm/xe: Set slpc freq to max on ULLS jobs
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (11 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 12/15] drm/xe: Enable ULLS migration jobs when opening LR VM Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 14/15] drm/xe: Add modparam to enable / disable ULLS on migrate queue Matthew Brost
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

ULLS jobs should run as fast as possible, as such, increase SLPC
frequency on the ULLS queue to the maximum.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index f67dfdb69637..41ac0b41b09b 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -451,6 +451,23 @@ static void init_policies(struct xe_guc *guc, struct xe_exec_queue *q)
 		       __guc_exec_queue_policy_action_size(&policy), 0, 0);
 }
 
+static void set_slpc_freq(struct xe_guc *guc, struct xe_exec_queue *q,
+			  bool high_freq)
+{
+	struct exec_queue_policy policy;
+	u32 slpc_exec_queue_freq_req = 0;
+
+	if (high_freq)
+		slpc_exec_queue_freq_req |= SLPC_CTX_FREQ_REQ_IS_COMPUTE;
+
+	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
+	__guc_exec_queue_policy_add_slpc_exec_queue_freq_req(&policy,
+							     slpc_exec_queue_freq_req);
+
+	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
+		       __guc_exec_queue_policy_action_size(&policy), 0, 0);
+}
+
 static void set_min_preemption_timeout(struct xe_guc *guc, struct xe_exec_queue *q)
 {
 	struct exec_queue_policy policy;
@@ -718,9 +735,14 @@ static void submit_exec_queue(struct xe_sched_job *job, struct xe_exec_queue *q)
 		return;
 
 	if (job->is_ulls) {
-		if (!job->is_ulls_first)
+		if (job->is_ulls_first)
+			set_slpc_freq(guc, q, true);
+		else
 			xe_hw_engine_write_ring_tail(q->hwe, lrc->ring.tail);
 
+		if (job->is_ulls_last)
+			set_slpc_freq(guc, q, false);
+
 		xe_lrc_set_ulls_semaphore(lrc, xe_sched_job_lrc_seqno(job));
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 14/15] drm/xe: Add modparam to enable / disable ULLS on migrate queue
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (12 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 13/15] drm/xe: Set slpc freq to max on ULLS jobs Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 15:32 ` [PATCH 15/15] drm/xe: Add modparam to enable / disable high SLPC " Matthew Brost
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Having modparam to enable / disable ULLS on migrate queue will help with
quick experiments.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c      | 1 +
 drivers/gpu/drm/xe/xe_device.c       | 1 +
 drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
 drivers/gpu/drm/xe/xe_migrate.c      | 4 ++--
 drivers/gpu/drm/xe/xe_module.c       | 4 ++++
 drivers/gpu/drm/xe/xe_module.h       | 1 +
 6 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index d83cd6ed3fa8..d027bda5652f 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -59,6 +59,7 @@ static int info(struct seq_file *m, void *data)
 	drm_printf(&p, "tile_count %d\n", xe->info.tile_count);
 	drm_printf(&p, "vm_max_level %d\n", xe->info.vm_max_level);
 	drm_printf(&p, "force_execlist %s\n", str_yes_no(xe->info.force_execlist));
+	drm_printf(&p, "ulls_enable %s\n", str_yes_no(xe->info.ulls_enable));
 	drm_printf(&p, "has_flat_ccs %s\n", str_yes_no(xe->info.has_flat_ccs));
 	drm_printf(&p, "has_usm %s\n", str_yes_no(xe->info.has_usm));
 	drm_printf(&p, "skip_guc_pc %s\n", str_yes_no(xe->info.skip_guc_pc));
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 660b0c5126dc..8cc88410dfca 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -442,6 +442,7 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 	xe->info.devid = pdev->device;
 	xe->info.revid = pdev->revision;
 	xe->info.force_execlist = xe_modparam.force_execlist;
+	xe->info.ulls_enable = xe_modparam.ulls_enable;
 	xe->atomic_svm_timeslice_ms = 5;
 
 	err = xe_irq_init(xe);
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index ac27389ccb8b..214a90696615 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -355,6 +355,11 @@ struct xe_device {
 		u8 skip_mtcfg:1;
 		/** @info.skip_pcode: skip access to PCODE uC */
 		u8 skip_pcode:1;
+		/**
+		 * @info.ulls_enable: Enable ULLS on migration queue in LR VM
+		 * open
+		 */
+		u8 ulls_enable:1;
 	} info;
 
 	/** @survivability: survivability information for device */
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 80344d4f6f10..ebe472af2e7a 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -752,7 +752,7 @@ void xe_migrate_lr_vm_get(struct xe_migrate *m)
 {
 	struct xe_device *xe = tile_to_xe(m->tile);
 
-	if (!IS_DGFX(xe) || !xe->info.has_usm)
+	if (!IS_DGFX(xe) || !xe->info.has_usm || !xe->info.ulls_enable)
 		return;
 
 	mutex_lock(&m->job_mutex);
@@ -780,7 +780,7 @@ void xe_migrate_lr_vm_put(struct xe_migrate *m)
 {
 	struct xe_device *xe = tile_to_xe(m->tile);
 
-	if (!IS_DGFX(xe) || !xe->info.has_usm)
+	if (!IS_DGFX(xe) || !xe->info.has_usm || !xe->info.ulls_enable)
 		return;
 
 	mutex_lock(&m->job_mutex);
diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index 1c4dfafbcd0b..353bd5a8f02b 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -20,6 +20,7 @@
 
 struct xe_modparam xe_modparam = {
 	.probe_display = true,
+	.ulls_enable = true,
 	.guc_log_level = 3,
 	.force_probe = CONFIG_DRM_XE_FORCE_PROBE,
 #ifdef CONFIG_PCI_IOV
@@ -39,6 +40,9 @@ MODULE_PARM_DESC(force_execlist, "Force Execlist submission");
 module_param_named(probe_display, xe_modparam.probe_display, bool, 0444);
 MODULE_PARM_DESC(probe_display, "Probe display HW, otherwise it's left untouched (default: true)");
 
+module_param_named(ulls_enable, xe_modparam.ulls_enable, bool, 0444);
+MODULE_PARM_DESC(ulls_enable, "Enable ULLS on migration queue if LR VM open (default: true)");
+
 module_param_named(vram_bar_size, xe_modparam.force_vram_bar_size, int, 0600);
 MODULE_PARM_DESC(vram_bar_size, "Set the vram bar size (in MiB) - <0=disable-resize, 0=max-needed-size[default], >0=force-size");
 
diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
index 5a3bfea8b7b4..8ea69a5b2141 100644
--- a/drivers/gpu/drm/xe/xe_module.h
+++ b/drivers/gpu/drm/xe/xe_module.h
@@ -12,6 +12,7 @@
 struct xe_modparam {
 	bool force_execlist;
 	bool probe_display;
+	bool ulls_enable;
 	u32 force_vram_bar_size;
 	int guc_log_level;
 	char *guc_firmware_path;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 15/15] drm/xe: Add modparam to enable / disable high SLPC on migrate queue
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (13 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 14/15] drm/xe: Add modparam to enable / disable ULLS on migrate queue Matthew Brost
@ 2025-06-05 15:32 ` Matthew Brost
  2025-06-05 22:30 ` ✓ CI.Patch_applied: success for CPU binds and ULLS on migration queue Patchwork
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 15:32 UTC (permalink / raw)
  To: intel-xe; +Cc: francois.dugast, thomas.hellstrom, himal.prasad.ghimiray

Having modparam to enable / disable high SLPC on migrate queue will help
with quick experiments.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c      | 2 ++
 drivers/gpu/drm/xe/xe_device.c       | 2 ++
 drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
 drivers/gpu/drm/xe/xe_guc_submit.c   | 4 +++-
 drivers/gpu/drm/xe/xe_module.c       | 3 +++
 drivers/gpu/drm/xe/xe_module.h       | 1 +
 6 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index d027bda5652f..369e39093010 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -60,6 +60,8 @@ static int info(struct seq_file *m, void *data)
 	drm_printf(&p, "vm_max_level %d\n", xe->info.vm_max_level);
 	drm_printf(&p, "force_execlist %s\n", str_yes_no(xe->info.force_execlist));
 	drm_printf(&p, "ulls_enable %s\n", str_yes_no(xe->info.ulls_enable));
+	drm_printf(&p, "high_slpc_migration_queue %s\n",
+		   str_yes_no(xe->info.high_slpc_migration_queue));
 	drm_printf(&p, "has_flat_ccs %s\n", str_yes_no(xe->info.has_flat_ccs));
 	drm_printf(&p, "has_usm %s\n", str_yes_no(xe->info.has_usm));
 	drm_printf(&p, "skip_guc_pc %s\n", str_yes_no(xe->info.skip_guc_pc));
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 8cc88410dfca..13da2da441b7 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -443,6 +443,8 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 	xe->info.revid = pdev->revision;
 	xe->info.force_execlist = xe_modparam.force_execlist;
 	xe->info.ulls_enable = xe_modparam.ulls_enable;
+	xe->info.high_slpc_migration_queue =
+		xe_modparam.high_slpc_migration_queue;
 	xe->atomic_svm_timeslice_ms = 5;
 
 	err = xe_irq_init(xe);
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 214a90696615..2cdcb11ae2e1 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -360,6 +360,11 @@ struct xe_device {
 		 * open
 		 */
 		u8 ulls_enable:1;
+		/**
+		 * @info.high_slpc_migration_queue: High SLPC on migration
+		 * queue
+		 */
+		u8 high_slpc_migration_queue:1;
 	} info;
 
 	/** @survivability: survivability information for device */
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 41ac0b41b09b..479f3ad0fd88 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -437,7 +437,9 @@ static void init_policies(struct xe_guc *guc, struct xe_exec_queue *q)
 
 	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
 
-	if (q->flags & EXEC_QUEUE_FLAG_LOW_LATENCY)
+	if (q->flags & EXEC_QUEUE_FLAG_LOW_LATENCY ||
+	    (guc_to_xe(guc)->info.high_slpc_migration_queue &&
+	     q->flags & EXEC_QUEUE_FLAG_HIGH_PRIORITY))
 		slpc_exec_queue_freq_req |= SLPC_CTX_FREQ_REQ_IS_COMPUTE;
 
 	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index 353bd5a8f02b..1c0c45f50de9 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -43,6 +43,9 @@ MODULE_PARM_DESC(probe_display, "Probe display HW, otherwise it's left untouched
 module_param_named(ulls_enable, xe_modparam.ulls_enable, bool, 0444);
 MODULE_PARM_DESC(ulls_enable, "Enable ULLS on migration queue if LR VM open (default: true)");
 
+module_param_named(high_slpc_migration_queue, xe_modparam.high_slpc_migration_queue, bool, 0444);
+MODULE_PARM_DESC(high_slpc_migration_queue, "High SLPC on migration queue (default: false)");
+
 module_param_named(vram_bar_size, xe_modparam.force_vram_bar_size, int, 0600);
 MODULE_PARM_DESC(vram_bar_size, "Set the vram bar size (in MiB) - <0=disable-resize, 0=max-needed-size[default], >0=force-size");
 
diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
index 8ea69a5b2141..d0afc085f443 100644
--- a/drivers/gpu/drm/xe/xe_module.h
+++ b/drivers/gpu/drm/xe/xe_module.h
@@ -13,6 +13,7 @@ struct xe_modparam {
 	bool force_execlist;
 	bool probe_display;
 	bool ulls_enable;
+	bool high_slpc_migration_queue;
 	u32 force_vram_bar_size;
 	int guc_log_level;
 	char *guc_firmware_path;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 03/15] drm/xe: CPU binds for jobs
  2025-06-05 15:32 ` [PATCH 03/15] drm/xe: CPU binds for jobs Matthew Brost
@ 2025-06-05 15:44   ` Thomas Hellström
  2025-06-05 16:13     ` Matthew Brost
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Hellström @ 2025-06-05 15:44 UTC (permalink / raw)
  To: Matthew Brost, intel-xe; +Cc: francois.dugast, himal.prasad.ghimiray

Hi, Matt,

An early comment:

Previous concerns have also included:

1) If clearing and binding happens on the same exec_queue, GPU binding
is actually likely to be faster, right since it can be queued without
waiting for additional dependencies? Do we have any timings from start-
of-clear to support or debunk this argument.

2) Is page-tables in unmappable VRAM something we'd want to support at
some point.

Thanks,
Thomas


On Thu, 2025-06-05 at 08:32 -0700, Matthew Brost wrote:
> No reason to use the GPU for binds. In run_job, use the CPU to
> perform
> binds once the bind job's dependencies are resolved.
> 
> Benefits of CPU-based binds:
> - Lower latency once dependencies are resolved, as there is no
>   interaction with the GuC or a hardware context switch both of which
>   are relatively slow.
> - Large arrays of binds do not risk running out of migration PTEs,
>   avoiding -ENOBUFS being returned to userspace.
> - Kernel binds are decoupled from the migration exec queue (which
> issues
>   copies and clears), so they cannot get stuck behind unrelated
>   jobs—this can be a problem with parallel GPU faults.
> - Enables ULLS on the migration exec queue, as this queue has
> exclusive
>   access to the paging copy engine.
> 
> The basic idea of the implementation is to store the VM page table
> update operations (struct xe_vm_pgtable_update_op *pt_op) and
> additional
> arguments for the migrate layer’s CPU PTE update function in a job.
> The
> submission backend can then call into the migrate layer using the CPU
> to
> write the PTEs and free the stored resources for the PTE update.
> 
> PT job submission is implemented in the GuC backend for simplicity. A
> follow-up could introduce a specific backend for PT jobs.
> 
> All code related to GPU-based binding has been removed.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c              |   7 +-
>  drivers/gpu/drm/xe/xe_bo.h              |   9 +-
>  drivers/gpu/drm/xe/xe_bo_types.h        |   2 -
>  drivers/gpu/drm/xe/xe_drm_client.c      |   3 +-
>  drivers/gpu/drm/xe/xe_guc_submit.c      |  36 +++-
>  drivers/gpu/drm/xe/xe_migrate.c         | 251 +++-------------------
> --
>  drivers/gpu/drm/xe/xe_migrate.h         |   6 +
>  drivers/gpu/drm/xe/xe_pt.c              | 188 ++++++++++++++----
>  drivers/gpu/drm/xe/xe_pt.h              |   5 +-
>  drivers/gpu/drm/xe/xe_pt_types.h        |  29 ++-
>  drivers/gpu/drm/xe/xe_sched_job.c       |  78 +++++---
>  drivers/gpu/drm/xe/xe_sched_job_types.h |  31 ++-
>  drivers/gpu/drm/xe/xe_vm.c              |  46 ++---
>  13 files changed, 341 insertions(+), 350 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 61d208c85281..7aa598b584d2 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -3033,8 +3033,13 @@ void xe_bo_put_commit(struct llist_head
> *deferred)
>  	if (!freed)
>  		return;
>  
> -	llist_for_each_entry_safe(bo, next, freed, freed)
> +	llist_for_each_entry_safe(bo, next, freed, freed) {
> +		struct xe_vm *vm = bo->vm;
> +
>  		drm_gem_object_free(&bo->ttm.base.refcount);
> +		if (bo->flags & XE_BO_FLAG_PUT_VM_ASYNC)
> +			xe_vm_put(vm);
> +	}
>  }
>  
>  static void xe_bo_dev_work_func(struct work_struct *work)
> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> index 02ada1fb8a23..967b1fe92560 100644
> --- a/drivers/gpu/drm/xe/xe_bo.h
> +++ b/drivers/gpu/drm/xe/xe_bo.h
> @@ -46,6 +46,7 @@
>  #define XE_BO_FLAG_GGTT2		BIT(22)
>  #define XE_BO_FLAG_GGTT3		BIT(23)
>  #define XE_BO_FLAG_CPU_ADDR_MIRROR	BIT(24)
> +#define XE_BO_FLAG_PUT_VM_ASYNC		BIT(25)
>  
>  /* this one is trigger internally only */
>  #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
> @@ -319,6 +320,7 @@ void __xe_bo_release_dummy(struct kref *kref);
>   * @bo: The bo to put.
>   * @deferred: List to which to add the buffer object if we cannot
> put, or
>   * NULL if the function is to put unconditionally.
> + * @added: BO was added to deferred list
>   *
>   * Since the final freeing of an object includes both sleeping and
> (!)
>   * memory allocation in the dma_resv individualization, it's not ok
> @@ -338,7 +340,8 @@ void __xe_bo_release_dummy(struct kref *kref);
>   * false otherwise.
>   */
>  static inline bool
> -xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred)
> +xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred,
> +		   bool *added)
>  {
>  	if (!deferred) {
>  		xe_bo_put(bo);
> @@ -348,6 +351,7 @@ xe_bo_put_deferred(struct xe_bo *bo, struct
> llist_head *deferred)
>  	if (!kref_put(&bo->ttm.base.refcount,
> __xe_bo_release_dummy))
>  		return false;
>  
> +	*added = true;
>  	return llist_add(&bo->freed, deferred);
>  }
>  
> @@ -363,8 +367,9 @@ static inline void
>  xe_bo_put_async(struct xe_bo *bo)
>  {
>  	struct xe_bo_dev *bo_device = &xe_bo_device(bo)->bo_device;
> +	bool added = false;
>  
> -	if (xe_bo_put_deferred(bo, &bo_device->async_list))
> +	if (xe_bo_put_deferred(bo, &bo_device->async_list, &added))
>  		schedule_work(&bo_device->async_free);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_bo_types.h
> b/drivers/gpu/drm/xe/xe_bo_types.h
> index eb5e83c5f233..ecf42a04640a 100644
> --- a/drivers/gpu/drm/xe/xe_bo_types.h
> +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> @@ -70,8 +70,6 @@ struct xe_bo {
>  
>  	/** @freed: List node for delayed put. */
>  	struct llist_node freed;
> -	/** @update_index: Update index if PT BO */
> -	int update_index;
>  	/** @created: Whether the bo has passed initial creation */
>  	bool created;
>  
> diff --git a/drivers/gpu/drm/xe/xe_drm_client.c
> b/drivers/gpu/drm/xe/xe_drm_client.c
> index 31f688e953d7..6f5a91ef7491 100644
> --- a/drivers/gpu/drm/xe/xe_drm_client.c
> +++ b/drivers/gpu/drm/xe/xe_drm_client.c
> @@ -200,6 +200,7 @@ static void show_meminfo(struct drm_printer *p,
> struct drm_file *file)
>  	LLIST_HEAD(deferred);
>  	unsigned int id;
>  	u32 mem_type;
> +	bool added = false;
>  
>  	client = xef->client;
>  
> @@ -246,7 +247,7 @@ static void show_meminfo(struct drm_printer *p,
> struct drm_file *file)
>  			xe_assert(xef->xe, !list_empty(&bo-
> >client_link));
>  		}
>  
> -		xe_bo_put_deferred(bo, &deferred);
> +		xe_bo_put_deferred(bo, &deferred, &added);
>  	}
>  	spin_unlock(&client->bos_lock);
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 2b61d017eeca..551cd21a6465 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -19,6 +19,7 @@
>  #include "abi/guc_klvs_abi.h"
>  #include "regs/xe_lrc_layout.h"
>  #include "xe_assert.h"
> +#include "xe_bo.h"
>  #include "xe_devcoredump.h"
>  #include "xe_device.h"
>  #include "xe_exec_queue.h"
> @@ -38,8 +39,10 @@
>  #include "xe_lrc.h"
>  #include "xe_macros.h"
>  #include "xe_map.h"
> +#include "xe_migrate.h"
>  #include "xe_mocs.h"
>  #include "xe_pm.h"
> +#include "xe_pt.h"
>  #include "xe_ring_ops_types.h"
>  #include "xe_sched_job.h"
>  #include "xe_trace.h"
> @@ -745,6 +748,20 @@ static void submit_exec_queue(struct
> xe_exec_queue *q)
>  	}
>  }
>  
> +static bool is_pt_job(struct xe_sched_job *job)
> +{
> +	return job->is_pt_job;
> +}
> +
> +static void run_pt_job(struct xe_sched_job *job)
> +{
> +	__xe_migrate_update_pgtables_cpu(job->pt_update[0].vm,
> +					 job->pt_update[0].tile,
> +					 job->pt_update[0].ops,
> +					 job-
> >pt_update[0].pt_job_ops->ops,
> +					 job-
> >pt_update[0].pt_job_ops->current_op);
> +}
> +
>  static struct dma_fence *
>  guc_exec_queue_run_job(struct drm_sched_job *drm_job)
>  {
> @@ -760,14 +777,21 @@ guc_exec_queue_run_job(struct drm_sched_job
> *drm_job)
>  	trace_xe_sched_job_run(job);
>  
>  	if (!exec_queue_killed_or_banned_or_wedged(q) &&
> !xe_sched_job_is_error(job)) {
> -		if (!exec_queue_registered(q))
> -			register_exec_queue(q);
> -		if (!lr)	/* LR jobs are emitted in the exec
> IOCTL */
> -			q->ring_ops->emit_job(job);
> -		submit_exec_queue(q);
> +		if (is_pt_job(job)) {
> +			run_pt_job(job);
> +		} else {
> +			if (!exec_queue_registered(q))
> +				register_exec_queue(q);
> +			if (!lr)	/* LR jobs are emitted in
> the exec IOCTL */
> +				q->ring_ops->emit_job(job);
> +			submit_exec_queue(q);
> +		}
>  	}
>  
> -	if (lr) {
> +	if (is_pt_job(job)) {
> +		xe_pt_job_ops_put(job->pt_update[0].pt_job_ops);
> +		dma_fence_put(job->fence);	/* Drop ref from
> xe_sched_job_arm */
> +	} else if (lr) {
>  		xe_sched_job_set_error(job, -EOPNOTSUPP);
>  		dma_fence_put(job->fence);	/* Drop ref from
> xe_sched_job_arm */
>  	} else {
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> b/drivers/gpu/drm/xe/xe_migrate.c
> index 9084f5cbc02d..e444f3fae97c 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -58,18 +58,12 @@ struct xe_migrate {
>  	 * Protected by @job_mutex.
>  	 */
>  	struct dma_fence *fence;
> -	/**
> -	 * @vm_update_sa: For integrated, used to suballocate page-
> tables
> -	 * out of the pt_bo.
> -	 */
> -	struct drm_suballoc_manager vm_update_sa;
>  	/** @min_chunk_size: For dgfx, Minimum chunk size */
>  	u64 min_chunk_size;
>  };
>  
>  #define MAX_PREEMPTDISABLE_TRANSFER SZ_8M /* Around 1ms. */
>  #define MAX_CCS_LIMITED_TRANSFER SZ_4M /* XE_PAGE_SIZE *
> (FIELD_MAX(XE2_CCS_SIZE_MASK) + 1) */
> -#define NUM_KERNEL_PDE 15
>  #define NUM_PT_SLOTS 32
>  #define LEVEL0_PAGE_TABLE_ENCODE_SIZE SZ_2M
>  #define MAX_NUM_PTE 512
> @@ -107,7 +101,6 @@ static void xe_migrate_fini(void *arg)
>  
>  	dma_fence_put(m->fence);
>  	xe_bo_put(m->pt_bo);
> -	drm_suballoc_manager_fini(&m->vm_update_sa);
>  	mutex_destroy(&m->job_mutex);
>  	xe_vm_close_and_put(m->q->vm);
>  	xe_exec_queue_put(m->q);
> @@ -199,8 +192,6 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile, struct xe_migrate *m,
>  	BUILD_BUG_ON(NUM_PT_SLOTS > SZ_2M/XE_PAGE_SIZE);
>  	/* Must be a multiple of 64K to support all platforms */
>  	BUILD_BUG_ON(NUM_PT_SLOTS * XE_PAGE_SIZE % SZ_64K);
> -	/* And one slot reserved for the 4KiB page table updates */
> -	BUILD_BUG_ON(!(NUM_KERNEL_PDE & 1));
>  
>  	/* Need to be sure everything fits in the first PT, or
> create more */
>  	xe_tile_assert(tile, m->batch_base_ofs + batch->size <
> SZ_2M);
> @@ -333,8 +324,6 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile, struct xe_migrate *m,
>  	/*
>  	 * Example layout created above, with root level = 3:
>  	 * [PT0...PT7]: kernel PT's for copy/clear; 64 or 4KiB PTE's
> -	 * [PT8]: Kernel PT for VM_BIND, 4 KiB PTE's
> -	 * [PT9...PT26]: Userspace PT's for VM_BIND, 4 KiB PTE's
>  	 * [PT27 = PDE 0] [PT28 = PDE 1] [PT29 = PDE 2] [PT30 & PT31
> = 2M vram identity map]
>  	 *
>  	 * This makes the lowest part of the VM point to the
> pagetables.
> @@ -342,19 +331,10 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile, struct xe_migrate *m,
>  	 * and flushes, other parts of the VM can be used either for
> copying and
>  	 * clearing.
>  	 *
> -	 * For performance, the kernel reserves PDE's, so about 20
> are left
> -	 * for async VM updates.
> -	 *
>  	 * To make it easier to work, each scratch PT is put in slot
> (1 + PT #)
>  	 * everywhere, this allows lockless updates to scratch pages
> by using
>  	 * the different addresses in VM.
>  	 */
> -#define NUM_VMUSA_UNIT_PER_PAGE	32
> -#define VM_SA_UPDATE_UNIT_SIZE		(XE_PAGE_SIZE /
> NUM_VMUSA_UNIT_PER_PAGE)
> -#define NUM_VMUSA_WRITES_PER_UNIT	(VM_SA_UPDATE_UNIT_SIZE /
> sizeof(u64))
> -	drm_suballoc_manager_init(&m->vm_update_sa,
> -				  (size_t)(map_ofs / XE_PAGE_SIZE -
> NUM_KERNEL_PDE) *
> -				  NUM_VMUSA_UNIT_PER_PAGE, 0);
>  
>  	m->pt_bo = bo;
>  	return 0;
> @@ -1193,56 +1173,6 @@ struct dma_fence *xe_migrate_clear(struct
> xe_migrate *m,
>  	return fence;
>  }
>  
> -static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb,
> u64 ppgtt_ofs,
> -			  const struct xe_vm_pgtable_update_op
> *pt_op,
> -			  const struct xe_vm_pgtable_update *update,
> -			  struct xe_migrate_pt_update *pt_update)
> -{
> -	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
> -	struct xe_vm *vm = pt_update->vops->vm;
> -	u32 chunk;
> -	u32 ofs = update->ofs, size = update->qwords;
> -
> -	/*
> -	 * If we have 512 entries (max), we would populate it
> ourselves,
> -	 * and update the PDE above it to the new pointer.
> -	 * The only time this can only happen if we have to update
> the top
> -	 * PDE. This requires a BO that is almost vm->size big.
> -	 *
> -	 * This shouldn't be possible in practice.. might change
> when 16K
> -	 * pages are used. Hence the assert.
> -	 */
> -	xe_tile_assert(tile, update->qwords < MAX_NUM_PTE);
> -	if (!ppgtt_ofs)
> -		ppgtt_ofs = xe_migrate_vram_ofs(tile_to_xe(tile),
> -						xe_bo_addr(update-
> >pt_bo, 0,
> -							  
> XE_PAGE_SIZE), false);
> -
> -	do {
> -		u64 addr = ppgtt_ofs + ofs * 8;
> -
> -		chunk = min(size, MAX_PTE_PER_SDI);
> -
> -		/* Ensure populatefn can do memset64 by aligning bb-
> >cs */
> -		if (!(bb->len & 1))
> -			bb->cs[bb->len++] = MI_NOOP;
> -
> -		bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> MI_SDI_NUM_QW(chunk);
> -		bb->cs[bb->len++] = lower_32_bits(addr);
> -		bb->cs[bb->len++] = upper_32_bits(addr);
> -		if (pt_op->bind)
> -			ops->populate(tile, NULL, bb->cs + bb->len,
> -				      ofs, chunk, update);
> -		else
> -			ops->clear(vm, tile, NULL, bb->cs + bb->len,
> -				   ofs, chunk, update);
> -
> -		bb->len += chunk * 2;
> -		ofs += chunk;
> -		size -= chunk;
> -	} while (size);
> -}
> -
>  struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m)
>  {
>  	return xe_vm_get(m->q->vm);
> @@ -1258,7 +1188,18 @@ struct migrate_test_params {
>  	container_of(_priv, struct migrate_test_params, base)
>  #endif
>  
> -static void
> +/**
> + * __xe_migrate_update_pgtables_cpu() - Update a VM's PTEs via the
> CPU
> + * @vm: The VM being updated
> + * @tile: The tile being updated
> + * @ops: The migrate PT update ops
> + * @pt_ops: The VM PT update ops
> + * @num_ops: The number of The VM PT update ops
> + *
> + * Execute the VM PT update ops array which results in a VM's PTEs
> being updated
> + * via the CPU.
> + */
> +void
>  __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile
> *tile,
>  				 const struct
> xe_migrate_pt_update_ops *ops,
>  				 struct xe_vm_pgtable_update_op
> *pt_op,
> @@ -1314,7 +1255,7 @@ xe_migrate_update_pgtables_cpu(struct
> xe_migrate *m,
>  	}
>  
>  	__xe_migrate_update_pgtables_cpu(vm, m->tile, ops,
> -					 pt_update_ops->ops,
> +					 pt_update_ops->pt_job_ops-
> >ops,
>  					 pt_update_ops->num_ops);
>  
>  	return dma_fence_get_stub();
> @@ -1327,161 +1268,19 @@ __xe_migrate_update_pgtables(struct
> xe_migrate *m,
>  {
>  	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
>  	struct xe_tile *tile = m->tile;
> -	struct xe_gt *gt = tile->primary_gt;
> -	struct xe_device *xe = tile_to_xe(tile);
>  	struct xe_sched_job *job;
>  	struct dma_fence *fence;
> -	struct drm_suballoc *sa_bo = NULL;
> -	struct xe_bb *bb;
> -	u32 i, j, batch_size = 0, ppgtt_ofs, update_idx, page_ofs =
> 0;
> -	u32 num_updates = 0, current_update = 0;
> -	u64 addr;
> -	int err = 0;
>  	bool is_migrate = pt_update_ops->q == m->q;
> -	bool usm = is_migrate && xe->info.has_usm;
> -
> -	for (i = 0; i < pt_update_ops->num_ops; ++i) {
> -		struct xe_vm_pgtable_update_op *pt_op =
> &pt_update_ops->ops[i];
> -		struct xe_vm_pgtable_update *updates = pt_op-
> >entries;
> -
> -		num_updates += pt_op->num_entries;
> -		for (j = 0; j < pt_op->num_entries; ++j) {
> -			u32 num_cmds =
> DIV_ROUND_UP(updates[j].qwords,
> -						   
> MAX_PTE_PER_SDI);
> -
> -			/* align noop + MI_STORE_DATA_IMM cmd prefix
> */
> -			batch_size += 4 * num_cmds +
> updates[j].qwords * 2;
> -		}
> -	}
> -
> -	/* fixed + PTE entries */
> -	if (IS_DGFX(xe))
> -		batch_size += 2;
> -	else
> -		batch_size += 6 * (num_updates / MAX_PTE_PER_SDI +
> 1) +
> -			num_updates * 2;
> -
> -	bb = xe_bb_new(gt, batch_size, usm);
> -	if (IS_ERR(bb))
> -		return ERR_CAST(bb);
> -
> -	/* For sysmem PTE's, need to map them in our hole.. */
> -	if (!IS_DGFX(xe)) {
> -		u16 pat_index = xe->pat.idx[XE_CACHE_WB];
> -		u32 ptes, ofs;
> -
> -		ppgtt_ofs = NUM_KERNEL_PDE - 1;
> -		if (!is_migrate) {
> -			u32 num_units = DIV_ROUND_UP(num_updates,
> -						    
> NUM_VMUSA_WRITES_PER_UNIT);
> -
> -			if (num_units > m->vm_update_sa.size) {
> -				err = -ENOBUFS;
> -				goto err_bb;
> -			}
> -			sa_bo = drm_suballoc_new(&m->vm_update_sa,
> num_units,
> -						 GFP_KERNEL, true,
> 0);
> -			if (IS_ERR(sa_bo)) {
> -				err = PTR_ERR(sa_bo);
> -				goto err_bb;
> -			}
> -
> -			ppgtt_ofs = NUM_KERNEL_PDE +
> -				(drm_suballoc_soffset(sa_bo) /
> -				 NUM_VMUSA_UNIT_PER_PAGE);
> -			page_ofs = (drm_suballoc_soffset(sa_bo) %
> -				    NUM_VMUSA_UNIT_PER_PAGE) *
> -				VM_SA_UPDATE_UNIT_SIZE;
> -		}
> -
> -		/* Map our PT's to gtt */
> -		i = 0;
> -		j = 0;
> -		ptes = num_updates;
> -		ofs = ppgtt_ofs * XE_PAGE_SIZE + page_ofs;
> -		while (ptes) {
> -			u32 chunk = min(MAX_PTE_PER_SDI, ptes);
> -			u32 idx = 0;
> -
> -			bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> -				MI_SDI_NUM_QW(chunk);
> -			bb->cs[bb->len++] = ofs;
> -			bb->cs[bb->len++] = 0; /* upper_32_bits */
> -
> -			for (; i < pt_update_ops->num_ops; ++i) {
> -				struct xe_vm_pgtable_update_op
> *pt_op =
> -					&pt_update_ops->ops[i];
> -				struct xe_vm_pgtable_update *updates
> = pt_op->entries;
> -
> -				for (; j < pt_op->num_entries; ++j,
> ++current_update, ++idx) {
> -					struct xe_vm *vm =
> pt_update->vops->vm;
> -					struct xe_bo *pt_bo =
> updates[j].pt_bo;
> -
> -					if (idx == chunk)
> -						goto next_cmd;
> -
> -					xe_tile_assert(tile, pt_bo-
> >size == SZ_4K);
> -
> -					/* Map a PT at most once */
> -					if (pt_bo->update_index < 0)
> -						pt_bo->update_index
> = current_update;
> -
> -					addr = vm->pt_ops-
> >pte_encode_bo(pt_bo, 0,
> -
> 									 pat_index, 0);
> -					bb->cs[bb->len++] =
> lower_32_bits(addr);
> -					bb->cs[bb->len++] =
> upper_32_bits(addr);
> -				}
> -
> -				j = 0;
> -			}
> -
> -next_cmd:
> -			ptes -= chunk;
> -			ofs += chunk * sizeof(u64);
> -		}
> -
> -		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> -		update_idx = bb->len;
> -
> -		addr = xe_migrate_vm_addr(ppgtt_ofs, 0) +
> -			(page_ofs / sizeof(u64)) * XE_PAGE_SIZE;
> -		for (i = 0; i < pt_update_ops->num_ops; ++i) {
> -			struct xe_vm_pgtable_update_op *pt_op =
> -				&pt_update_ops->ops[i];
> -			struct xe_vm_pgtable_update *updates =
> pt_op->entries;
> -
> -			for (j = 0; j < pt_op->num_entries; ++j) {
> -				struct xe_bo *pt_bo =
> updates[j].pt_bo;
> -
> -				write_pgtable(tile, bb, addr +
> -					      pt_bo->update_index *
> XE_PAGE_SIZE,
> -					      pt_op, &updates[j],
> pt_update);
> -			}
> -		}
> -	} else {
> -		/* phys pages, no preamble required */
> -		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> -		update_idx = bb->len;
> -
> -		for (i = 0; i < pt_update_ops->num_ops; ++i) {
> -			struct xe_vm_pgtable_update_op *pt_op =
> -				&pt_update_ops->ops[i];
> -			struct xe_vm_pgtable_update *updates =
> pt_op->entries;
> -
> -			for (j = 0; j < pt_op->num_entries; ++j)
> -				write_pgtable(tile, bb, 0, pt_op,
> &updates[j],
> -					      pt_update);
> -		}
> -	}
> +	int err;
>  
> -	job = xe_bb_create_migration_job(pt_update_ops->q, bb,
> -					 xe_migrate_batch_base(m,
> usm),
> -					 update_idx);
> +	job = xe_sched_job_create(pt_update_ops->q, NULL);
>  	if (IS_ERR(job)) {
>  		err = PTR_ERR(job);
> -		goto err_sa;
> +		goto err_out;
>  	}
>  
> +	xe_tile_assert(tile, job->is_pt_job);
> +
>  	if (ops->pre_commit) {
>  		pt_update->job = job;
>  		err = ops->pre_commit(pt_update);
> @@ -1491,6 +1290,12 @@ __xe_migrate_update_pgtables(struct xe_migrate
> *m,
>  	if (is_migrate)
>  		mutex_lock(&m->job_mutex);
>  
> +	job->pt_update[0].vm = pt_update->vops->vm;
> +	job->pt_update[0].tile = tile;
> +	job->pt_update[0].ops = ops;
> +	job->pt_update[0].pt_job_ops =
> +		xe_pt_job_ops_get(pt_update_ops->pt_job_ops);
> +
>  	xe_sched_job_arm(job);
>  	fence = dma_fence_get(&job->drm.s_fence->finished);
>  	xe_sched_job_push(job);
> @@ -1498,17 +1303,11 @@ __xe_migrate_update_pgtables(struct
> xe_migrate *m,
>  	if (is_migrate)
>  		mutex_unlock(&m->job_mutex);
>  
> -	xe_bb_free(bb, fence);
> -	drm_suballoc_free(sa_bo, fence);
> -
>  	return fence;
>  
>  err_job:
>  	xe_sched_job_put(job);
> -err_sa:
> -	drm_suballoc_free(sa_bo, NULL);
> -err_bb:
> -	xe_bb_free(bb, NULL);
> +err_out:
>  	return ERR_PTR(err);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> b/drivers/gpu/drm/xe/xe_migrate.h
> index b064455b604e..0986ffdd8d9a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.h
> +++ b/drivers/gpu/drm/xe/xe_migrate.h
> @@ -22,6 +22,7 @@ struct xe_pt;
>  struct xe_tile;
>  struct xe_vm;
>  struct xe_vm_pgtable_update;
> +struct xe_vm_pgtable_update_op;
>  struct xe_vma;
>  
>  /**
> @@ -125,6 +126,11 @@ struct dma_fence *xe_migrate_clear(struct
> xe_migrate *m,
>  
>  struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m);
>  
> +void __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct
> xe_tile *tile,
> +				      const struct
> xe_migrate_pt_update_ops *ops,
> +				      struct xe_vm_pgtable_update_op
> *pt_op,
> +				      int num_ops);
> +
>  struct dma_fence *
>  xe_migrate_update_pgtables(struct xe_migrate *m,
>  			   struct xe_migrate_pt_update *pt_update);
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index db1c363a65d5..1ad31f444b79 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -200,7 +200,9 @@ unsigned int xe_pt_shift(unsigned int level)
>   * and finally frees @pt. TODO: Can we remove the @flags argument?
>   */
>  void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head
> *deferred)
> +
>  {
> +	bool added = false;
>  	int i;
>  
>  	if (!pt)
> @@ -208,7 +210,18 @@ void xe_pt_destroy(struct xe_pt *pt, u32 flags,
> struct llist_head *deferred)
>  
>  	XE_WARN_ON(!list_empty(&pt->bo->ttm.base.gpuva.list));
>  	xe_bo_unpin(pt->bo);
> -	xe_bo_put_deferred(pt->bo, deferred);
> +	xe_bo_put_deferred(pt->bo, deferred, &added);
> +	if (added) {
> +		/*
> +		 * We need the VM present until the BO is destroyed
> as it shares
> +		 * a dma-resv and BO destroy is async. Reinit BO
> refcount so
> +		 * xe_bo_put_async can be used when the PT job ops
> refcount goes
> +		 * to zero.
> +		 */
> +		xe_vm_get(pt->bo->vm);
> +		pt->bo->flags |= XE_BO_FLAG_PUT_VM_ASYNC;
> +		kref_init(&pt->bo->ttm.base.refcount);
> +	}
>  
>  	if (pt->level > 0 && pt->num_live) {
>  		struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
> @@ -361,7 +374,7 @@ xe_pt_new_shared(struct xe_walk_update *wupd,
> struct xe_pt *parent,
>  	entry->pt = parent;
>  	entry->flags = 0;
>  	entry->qwords = 0;
> -	entry->pt_bo->update_index = -1;
> +	entry->level = parent->level;
>  
>  	if (alloc_entries) {
>  		entry->pt_entries = kmalloc_array(XE_PDES,
> @@ -1739,7 +1752,7 @@ xe_migrate_clear_pgtable_callback(struct xe_vm
> *vm, struct xe_tile *tile,
>  				  u32 qword_ofs, u32 num_qwords,
>  				  const struct xe_vm_pgtable_update
> *update)
>  {
> -	u64 empty = __xe_pt_empty_pte(tile, vm, update->pt->level);
> +	u64 empty = __xe_pt_empty_pte(tile, vm, update->level);
>  	int i;
>  
>  	if (map && map->is_iomem)
> @@ -1805,13 +1818,20 @@ xe_pt_commit_prepare_unbind(struct xe_vma
> *vma,
>  	}
>  }
>  
> +static struct xe_vm_pgtable_update_op *
> +to_pt_op(struct xe_vm_pgtable_update_ops *pt_update_ops, u32
> current_op)
> +{
> +	return &pt_update_ops->pt_job_ops->ops[current_op];
> +}
> +
>  static void
>  xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops
> *pt_update_ops,
>  				 u64 start, u64 end)
>  {
>  	u64 last;
> -	u32 current_op = pt_update_ops->current_op;
> -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
> +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> +	struct xe_vm_pgtable_update_op *pt_op =
> +		to_pt_op(pt_update_ops, current_op);
>  	int i, level = 0;
>  
>  	for (i = 0; i < pt_op->num_entries; i++) {
> @@ -1846,8 +1866,9 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  			   struct xe_vm_pgtable_update_ops
> *pt_update_ops,
>  			   struct xe_vma *vma, bool
> invalidate_on_bind)
>  {
> -	u32 current_op = pt_update_ops->current_op;
> -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
> +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> +	struct xe_vm_pgtable_update_op *pt_op =
> +		to_pt_op(pt_update_ops, current_op);
>  	int err;
>  
>  	xe_tile_assert(tile, !xe_vma_is_cpu_addr_mirror(vma));
> @@ -1876,7 +1897,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  		xe_pt_update_ops_rfence_interval(pt_update_ops,
>  						 xe_vma_start(vma),
>  						 xe_vma_end(vma));
> -		++pt_update_ops->current_op;
> +		++pt_update_ops->pt_job_ops->current_op;
>  		pt_update_ops->needs_userptr_lock |=
> xe_vma_is_userptr(vma);
>  
>  		/*
> @@ -1913,8 +1934,9 @@ static int bind_range_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  			      struct xe_vm_pgtable_update_ops
> *pt_update_ops,
>  			      struct xe_vma *vma, struct
> xe_svm_range *range)
>  {
> -	u32 current_op = pt_update_ops->current_op;
> -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
> +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> +	struct xe_vm_pgtable_update_op *pt_op =
> +		to_pt_op(pt_update_ops, current_op);
>  	int err;
>  
>  	xe_tile_assert(tile, xe_vma_is_cpu_addr_mirror(vma));
> @@ -1938,7 +1960,7 @@ static int bind_range_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  		xe_pt_update_ops_rfence_interval(pt_update_ops,
>  						 range-
> >base.itree.start,
>  						 range-
> >base.itree.last + 1);
> -		++pt_update_ops->current_op;
> +		++pt_update_ops->pt_job_ops->current_op;
>  		pt_update_ops->needs_svm_lock = true;
>  
>  		pt_op->vma = vma;
> @@ -1955,8 +1977,9 @@ static int unbind_op_prepare(struct xe_tile
> *tile,
>  			     struct xe_vm_pgtable_update_ops
> *pt_update_ops,
>  			     struct xe_vma *vma)
>  {
> -	u32 current_op = pt_update_ops->current_op;
> -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
> +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> +	struct xe_vm_pgtable_update_op *pt_op =
> +		to_pt_op(pt_update_ops, current_op);
>  	int err;
>  
>  	if (!((vma->tile_present | vma->tile_staged) & BIT(tile-
> >id)))
> @@ -1984,7 +2007,7 @@ static int unbind_op_prepare(struct xe_tile
> *tile,
>  				pt_op->num_entries, false);
>  	xe_pt_update_ops_rfence_interval(pt_update_ops,
> xe_vma_start(vma),
>  					 xe_vma_end(vma));
> -	++pt_update_ops->current_op;
> +	++pt_update_ops->pt_job_ops->current_op;
>  	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
>  	pt_update_ops->needs_invalidation = true;
>  
> @@ -1998,8 +2021,9 @@ static int unbind_range_prepare(struct xe_vm
> *vm,
>  				struct xe_vm_pgtable_update_ops
> *pt_update_ops,
>  				struct xe_svm_range *range)
>  {
> -	u32 current_op = pt_update_ops->current_op;
> -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
> +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> +	struct xe_vm_pgtable_update_op *pt_op =
> +		to_pt_op(pt_update_ops, current_op);
>  
>  	if (!(range->tile_present & BIT(tile->id)))
>  		return 0;
> @@ -2019,7 +2043,7 @@ static int unbind_range_prepare(struct xe_vm
> *vm,
>  				pt_op->num_entries, false);
>  	xe_pt_update_ops_rfence_interval(pt_update_ops, range-
> >base.itree.start,
>  					 range->base.itree.last +
> 1);
> -	++pt_update_ops->current_op;
> +	++pt_update_ops->pt_job_ops->current_op;
>  	pt_update_ops->needs_svm_lock = true;
>  	pt_update_ops->needs_invalidation = true;
>  
> @@ -2122,7 +2146,6 @@ static int op_prepare(struct xe_vm *vm,
>  static void
>  xe_pt_update_ops_init(struct xe_vm_pgtable_update_ops
> *pt_update_ops)
>  {
> -	init_llist_head(&pt_update_ops->deferred);
>  	pt_update_ops->start = ~0x0ull;
>  	pt_update_ops->last = 0x0ull;
>  }
> @@ -2163,7 +2186,7 @@ int xe_pt_update_ops_prepare(struct xe_tile
> *tile, struct xe_vma_ops *vops)
>  			return err;
>  	}
>  
> -	xe_tile_assert(tile, pt_update_ops->current_op <=
> +	xe_tile_assert(tile, pt_update_ops->pt_job_ops->current_op
> <=
>  		       pt_update_ops->num_ops);
>  
>  #ifdef TEST_VM_OPS_ERROR
> @@ -2396,7 +2419,7 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> struct xe_vma_ops *vops)
>  	lockdep_assert_held(&vm->lock);
>  	xe_vm_assert_held(vm);
>  
> -	if (!pt_update_ops->current_op) {
> +	if (!pt_update_ops->pt_job_ops->current_op) {
>  		xe_tile_assert(tile, xe_vm_in_fault_mode(vm));
>  
>  		return dma_fence_get_stub();
> @@ -2445,12 +2468,16 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> struct xe_vma_ops *vops)
>  		goto free_rfence;
>  	}
>  
> -	/* Point of no return - VM killed if failure after this */
> -	for (i = 0; i < pt_update_ops->current_op; ++i) {
> -		struct xe_vm_pgtable_update_op *pt_op =
> &pt_update_ops->ops[i];
> +	/*
> +	 * Point of no return - VM killed if failure after this
> +	 */
> +	for (i = 0; i < pt_update_ops->pt_job_ops->current_op; ++i)
> {
> +		struct xe_vm_pgtable_update_op *pt_op =
> +			to_pt_op(pt_update_ops, i);
>  
>  		xe_pt_commit(pt_op->vma, pt_op->entries,
> -			     pt_op->num_entries, &pt_update_ops-
> >deferred);
> +			     pt_op->num_entries,
> +			     &pt_update_ops->pt_job_ops->deferred);
>  		pt_op->vma = NULL;	/* skip in
> xe_pt_update_ops_abort */
>  	}
>  
> @@ -2530,27 +2557,19 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> struct xe_vma_ops *vops)
>  ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
>  
>  /**
> - * xe_pt_update_ops_fini() - Finish PT update operations
> - * @tile: Tile of PT update operations
> - * @vops: VMA operations
> + * xe_pt_update_ops_free() - Free PT update operations
> + * @pt_op: Array of PT update operations
> + * @num_ops: Number of PT update operations
>   *
> - * Finish PT update operations by committing to destroy page table
> memory
> + * Free PT update operations
>   */
> -void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops
> *vops)
> +static void xe_pt_update_ops_free(struct xe_vm_pgtable_update_op
> *pt_op,
> +				  u32 num_ops)
>  {
> -	struct xe_vm_pgtable_update_ops *pt_update_ops =
> -		&vops->pt_update_ops[tile->id];
> -	int i;
> -
> -	lockdep_assert_held(&vops->vm->lock);
> -	xe_vm_assert_held(vops->vm);
> -
> -	for (i = 0; i < pt_update_ops->current_op; ++i) {
> -		struct xe_vm_pgtable_update_op *pt_op =
> &pt_update_ops->ops[i];
> +	u32 i;
>  
> +	for (i = 0; i < num_ops; ++i, ++pt_op)
>  		xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
> -	}
> -	xe_bo_put_commit(&vops->pt_update_ops[tile->id].deferred);
>  }
>  
>  /**
> @@ -2571,9 +2590,9 @@ void xe_pt_update_ops_abort(struct xe_tile
> *tile, struct xe_vma_ops *vops)
>  
>  	for (i = pt_update_ops->num_ops - 1; i >= 0; --i) {
>  		struct xe_vm_pgtable_update_op *pt_op =
> -			&pt_update_ops->ops[i];
> +			to_pt_op(pt_update_ops, i);
>  
> -		if (!pt_op->vma || i >= pt_update_ops->current_op)
> +		if (!pt_op->vma || i >= pt_update_ops->pt_job_ops-
> >current_op)
>  			continue;
>  
>  		if (pt_op->bind)
> @@ -2584,6 +2603,89 @@ void xe_pt_update_ops_abort(struct xe_tile
> *tile, struct xe_vma_ops *vops)
>  			xe_pt_abort_unbind(pt_op->vma, pt_op-
> >entries,
>  					   pt_op->num_entries);
>  	}
> +}
> +
> +/**
> + * xe_pt_job_ops_alloc() - Allocate PT job ops
> + * @num_ops: Number of VM PT update ops
> + *
> + * Allocate PT job ops and internal array of VM PT update ops.
> + *
> + * Return: Pointer to PT job ops or NULL
> + */
> +struct xe_pt_job_ops *xe_pt_job_ops_alloc(u32 num_ops)
> +{
> +	struct xe_pt_job_ops *pt_job_ops;
> +
> +	pt_job_ops = kmalloc(sizeof(*pt_job_ops), GFP_KERNEL);
> +	if (!pt_job_ops)
> +		return NULL;
> +
> +	pt_job_ops->ops = kvmalloc_array(num_ops,
> sizeof(*pt_job_ops->ops),
> +					 GFP_KERNEL);
> +	if (!pt_job_ops->ops) {
> +		kvfree(pt_job_ops);
> +		return NULL;
> +	}
> +
> +	pt_job_ops->current_op = 0;
> +	kref_init(&pt_job_ops->refcount);
> +	init_llist_head(&pt_job_ops->deferred);
> +
> +	return pt_job_ops;
> +}
> +
> +/**
> + * xe_pt_job_ops_get() - Get PT job ops
> + * @pt_job_ops: PT job ops to get
> + *
> + * Take a reference to PT job ops
> + *
> + * Return: Pointer to PT job ops or NULL
> + */
> +struct xe_pt_job_ops *xe_pt_job_ops_get(struct xe_pt_job_ops
> *pt_job_ops)
> +{
> +	if (pt_job_ops)
> +		kref_get(&pt_job_ops->refcount);
> +
> +	return pt_job_ops;
> +}
> +
> +static void xe_pt_job_ops_destroy(struct kref *ref)
> +{
> +	struct xe_pt_job_ops *pt_job_ops =
> +		container_of(ref, struct xe_pt_job_ops, refcount);
> +	struct llist_node *freed;
> +	struct xe_bo *bo, *next;
> +
> +	xe_pt_update_ops_free(pt_job_ops->ops,
> +			      pt_job_ops->current_op);
> +
> +	freed = llist_del_all(&pt_job_ops->deferred);
> +	if (freed) {
> +		llist_for_each_entry_safe(bo, next, freed, freed)
> +			/*
> +			 * If called from run_job, we are in the
> dma-fencing
> +			 * path and cannot take dma-resv locks so
> use an async
> +			 * put.
> +			 */
> +			xe_bo_put_async(bo);
> +	}
> +
> +	kvfree(pt_job_ops->ops);
> +	kfree(pt_job_ops);
> +}
> +
> +/**
> + * xe_pt_job_ops_put() - Put PT job ops
> + * @pt_job_ops: PT job ops to put
> + *
> + * Drop a reference to PT job ops
> + */
> +void xe_pt_job_ops_put(struct xe_pt_job_ops *pt_job_ops)
> +{
> +	if (!pt_job_ops)
> +		return;
>  
> -	xe_pt_update_ops_fini(tile, vops);
> +	kref_put(&pt_job_ops->refcount, xe_pt_job_ops_destroy);
>  }
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 5ecf003d513c..c9904573db82 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -41,11 +41,14 @@ void xe_pt_clear(struct xe_device *xe, struct
> xe_pt *pt);
>  int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops
> *vops);
>  struct dma_fence *xe_pt_update_ops_run(struct xe_tile *tile,
>  				       struct xe_vma_ops *vops);
> -void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops
> *vops);
>  void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops
> *vops);
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
>  bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
>  			  struct xe_svm_range *range);
>  
> +struct xe_pt_job_ops *xe_pt_job_ops_alloc(u32 num_ops);
> +struct xe_pt_job_ops *xe_pt_job_ops_get(struct xe_pt_job_ops
> *pt_job_ops);
> +void xe_pt_job_ops_put(struct xe_pt_job_ops *pt_job_ops);
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> b/drivers/gpu/drm/xe/xe_pt_types.h
> index 69eab6f37cfe..33d0d20e0ac6 100644
> --- a/drivers/gpu/drm/xe/xe_pt_types.h
> +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> @@ -70,6 +70,9 @@ struct xe_vm_pgtable_update {
>  	/** @pt_entries: Newly added pagetable entries */
>  	struct xe_pt_entry *pt_entries;
>  
> +	/** @level: level of update */
> +	unsigned int level;
> +
>  	/** @flags: Target flags */
>  	u32 flags;
>  };
> @@ -88,12 +91,28 @@ struct xe_vm_pgtable_update_op {
>  	bool rebind;
>  };
>  
> -/** struct xe_vm_pgtable_update_ops: page table update operations */
> -struct xe_vm_pgtable_update_ops {
> -	/** @ops: operations */
> -	struct xe_vm_pgtable_update_op *ops;
> +/**
> + * struct xe_pt_job_ops: page table update operations dynamic
> allocation
> + *
> + * This is the part of struct xe_vma_ops and struct
> xe_vm_pgtable_update_ops
> + * which is dynamic allocated as it must be available until the bind
> job is
> + * complete.
> + */
> +struct xe_pt_job_ops {
> +	/** @current_op: current operations */
> +	u32 current_op;
> +	/** @refcount: ref count ops allocation */
> +	struct kref refcount;
>  	/** @deferred: deferred list to destroy PT entries */
>  	struct llist_head deferred;
> +	/** @ops: operations */
> +	struct xe_vm_pgtable_update_op *ops;
> +};
> +
> +/** struct xe_vm_pgtable_update_ops: page table update operations */
> +struct xe_vm_pgtable_update_ops {
> +	/** @pt_job_ops: PT update operations dynamic allocation*/
> +	struct xe_pt_job_ops *pt_job_ops;
>  	/** @q: exec queue for PT operations */
>  	struct xe_exec_queue *q;
>  	/** @start: start address of ops */
> @@ -102,8 +121,6 @@ struct xe_vm_pgtable_update_ops {
>  	u64 last;
>  	/** @num_ops: number of operations */
>  	u32 num_ops;
> -	/** @current_op: current operations */
> -	u32 current_op;
>  	/** @needs_svm_lock: Needs SVM lock */
>  	bool needs_svm_lock;
>  	/** @needs_userptr_lock: Needs userptr lock */
> diff --git a/drivers/gpu/drm/xe/xe_sched_job.c
> b/drivers/gpu/drm/xe/xe_sched_job.c
> index d21bf8f26964..09cdd14d9ef7 100644
> --- a/drivers/gpu/drm/xe/xe_sched_job.c
> +++ b/drivers/gpu/drm/xe/xe_sched_job.c
> @@ -26,19 +26,22 @@ static struct kmem_cache
> *xe_sched_job_parallel_slab;
>  
>  int __init xe_sched_job_module_init(void)
>  {
> +	struct xe_sched_job *job;
> +	size_t size;
> +
> +	size = struct_size(job, ptrs, 1);
>  	xe_sched_job_slab =
> -		kmem_cache_create("xe_sched_job",
> -				  sizeof(struct xe_sched_job) +
> -				  sizeof(struct xe_job_ptrs), 0,
> +		kmem_cache_create("xe_sched_job", size, 0,
>  				  SLAB_HWCACHE_ALIGN, NULL);
>  	if (!xe_sched_job_slab)
>  		return -ENOMEM;
>  
> +	size = max_t(size_t,
> +		     struct_size(job, ptrs,
> +				 XE_HW_ENGINE_MAX_INSTANCE),
> +		     struct_size(job, pt_update, 1));
>  	xe_sched_job_parallel_slab =
> -		kmem_cache_create("xe_sched_job_parallel",
> -				  sizeof(struct xe_sched_job) +
> -				  sizeof(struct xe_job_ptrs) *
> -				  XE_HW_ENGINE_MAX_INSTANCE, 0,
> +		kmem_cache_create("xe_sched_job_parallel", size, 0,
>  				  SLAB_HWCACHE_ALIGN, NULL);
>  	if (!xe_sched_job_parallel_slab) {
>  		kmem_cache_destroy(xe_sched_job_slab);
> @@ -84,7 +87,7 @@ static void xe_sched_job_free_fences(struct
> xe_sched_job *job)
>  {
>  	int i;
>  
> -	for (i = 0; i < job->q->width; ++i) {
> +	for (i = 0; !job->is_pt_job && i < job->q->width; ++i) {
>  		struct xe_job_ptrs *ptrs = &job->ptrs[i];
>  
>  		if (ptrs->lrc_fence)
> @@ -118,33 +121,44 @@ struct xe_sched_job *xe_sched_job_create(struct
> xe_exec_queue *q,
>  	if (err)
>  		goto err_free;
>  
> -	for (i = 0; i < q->width; ++i) {
> -		struct dma_fence *fence =
> xe_lrc_alloc_seqno_fence();
> -		struct dma_fence_chain *chain;
> -
> -		if (IS_ERR(fence)) {
> -			err = PTR_ERR(fence);
> -			goto err_sched_job;
> -		}
> -		job->ptrs[i].lrc_fence = fence;
> -
> -		if (i + 1 == q->width)
> -			continue;
> -
> -		chain = dma_fence_chain_alloc();
> -		if (!chain) {
> +	if (!batch_addr) {
> +		job->fence =
> dma_fence_allocate_private_stub(ktime_get());
> +		if (!job->fence) {
>  			err = -ENOMEM;
>  			goto err_sched_job;
>  		}
> -		job->ptrs[i].chain_fence = chain;
> +		job->is_pt_job = true;
> +	} else {
> +		for (i = 0; i < q->width; ++i) {
> +			struct dma_fence *fence =
> xe_lrc_alloc_seqno_fence();
> +			struct dma_fence_chain *chain;
> +
> +			if (IS_ERR(fence)) {
> +				err = PTR_ERR(fence);
> +				goto err_sched_job;
> +			}
> +			job->ptrs[i].lrc_fence = fence;
> +
> +			if (i + 1 == q->width)
> +				continue;
> +
> +			chain = dma_fence_chain_alloc();
> +			if (!chain) {
> +				err = -ENOMEM;
> +				goto err_sched_job;
> +			}
> +			job->ptrs[i].chain_fence = chain;
> +		}
>  	}
>  
> -	width = q->width;
> -	if (is_migration)
> -		width = 2;
> +	if (batch_addr) {
> +		width = q->width;
> +		if (is_migration)
> +			width = 2;
>  
> -	for (i = 0; i < width; ++i)
> -		job->ptrs[i].batch_addr = batch_addr[i];
> +		for (i = 0; i < width; ++i)
> +			job->ptrs[i].batch_addr = batch_addr[i];
> +	}
>  
>  	xe_pm_runtime_get_noresume(job_to_xe(job));
>  	trace_xe_sched_job_create(job);
> @@ -243,7 +257,7 @@ bool xe_sched_job_completed(struct xe_sched_job
> *job)
>  void xe_sched_job_arm(struct xe_sched_job *job)
>  {
>  	struct xe_exec_queue *q = job->q;
> -	struct dma_fence *fence, *prev;
> +	struct dma_fence *fence = job->fence, *prev;
>  	struct xe_vm *vm = q->vm;
>  	u64 seqno = 0;
>  	int i;
> @@ -263,6 +277,9 @@ void xe_sched_job_arm(struct xe_sched_job *job)
>  		job->ring_ops_flush_tlb = true;
>  	}
>  
> +	if (job->is_pt_job)
> +		goto arm;
> +
>  	/* Arm the pre-allocated fences */
>  	for (i = 0; i < q->width; prev = fence, ++i) {
>  		struct dma_fence_chain *chain;
> @@ -283,6 +300,7 @@ void xe_sched_job_arm(struct xe_sched_job *job)
>  		fence = &chain->base;
>  	}
>  
> +arm:
>  	job->fence = dma_fence_get(fence);	/* Pairs with put in
> scheduler */
>  	drm_sched_job_arm(&job->drm);
>  }
> diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h
> b/drivers/gpu/drm/xe/xe_sched_job_types.h
> index dbf260dded8d..79a459f2a0a8 100644
> --- a/drivers/gpu/drm/xe/xe_sched_job_types.h
> +++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
> @@ -10,10 +10,29 @@
>  
>  #include <drm/gpu_scheduler.h>
>  
> -struct xe_exec_queue;
>  struct dma_fence;
>  struct dma_fence_chain;
>  
> +struct xe_exec_queue;
> +struct xe_migrate_pt_update_ops;
> +struct xe_pt_job_ops;
> +struct xe_tile;
> +struct xe_vm;
> +
> +/**
> + * struct xe_pt_update_args - PT update arguments
> + */
> +struct xe_pt_update_args {
> +	/** @vm: VM */
> +	struct xe_vm *vm;
> +	/** @tile: Tile */
> +	struct xe_tile *tile;
> +	/** @ops: Migrate PT update ops */
> +	const struct xe_migrate_pt_update_ops *ops;
> +	/** @pt_job_ops: PT update ops */
> +	struct xe_pt_job_ops *pt_job_ops;
> +};
> +
>  /**
>   * struct xe_job_ptrs - Per hw engine instance data
>   */
> @@ -58,8 +77,14 @@ struct xe_sched_job {
>  	bool ring_ops_flush_tlb;
>  	/** @ggtt: mapped in ggtt. */
>  	bool ggtt;
> -	/** @ptrs: per instance pointers. */
> -	struct xe_job_ptrs ptrs[];
> +	/** @is_pt_job: is a PT job */
> +	bool is_pt_job;
> +	union {
> +		/** @ptrs: per instance pointers. */
> +		DECLARE_FLEX_ARRAY(struct xe_job_ptrs, ptrs);
> +		/** @pt_update: PT update arguments */
> +		DECLARE_FLEX_ARRAY(struct xe_pt_update_args,
> pt_update);
> +	};
>  };
>  
>  struct xe_sched_job_snapshot {
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 18f967ce1f1a..6fc01fdd7286 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -780,6 +780,19 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm)
>  		list_empty_careful(&vm->userptr.invalidated)) ? 0 :
> -EAGAIN;
>  }
>  
> +static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm
> *vm,
> +			    struct xe_exec_queue *q,
> +			    struct xe_sync_entry *syncs, u32
> num_syncs)
> +{
> +	memset(vops, 0, sizeof(*vops));
> +	INIT_LIST_HEAD(&vops->list);
> +	vops->vm = vm;
> +	vops->q = q;
> +	vops->syncs = syncs;
> +	vops->num_syncs = num_syncs;
> +	vops->flags = 0;
> +}
> +
>  static int xe_vma_ops_alloc(struct xe_vma_ops *vops, bool
> array_of_binds)
>  {
>  	int i;
> @@ -788,11 +801,9 @@ static int xe_vma_ops_alloc(struct xe_vma_ops
> *vops, bool array_of_binds)
>  		if (!vops->pt_update_ops[i].num_ops)
>  			continue;
>  
> -		vops->pt_update_ops[i].ops =
> -			kmalloc_array(vops-
> >pt_update_ops[i].num_ops,
> -				      sizeof(*vops-
> >pt_update_ops[i].ops),
> -				      GFP_KERNEL |
> __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> -		if (!vops->pt_update_ops[i].ops)
> +		vops->pt_update_ops[i].pt_job_ops =
> +			xe_pt_job_ops_alloc(vops-
> >pt_update_ops[i].num_ops);
> +		if (!vops->pt_update_ops[i].pt_job_ops)
>  			return array_of_binds ? -ENOBUFS : -ENOMEM;
>  	}
>  
> @@ -828,7 +839,7 @@ static void xe_vma_ops_fini(struct xe_vma_ops
> *vops)
>  	xe_vma_svm_prefetch_ops_fini(vops);
>  
>  	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
> -		kfree(vops->pt_update_ops[i].ops);
> +		xe_pt_job_ops_put(vops-
> >pt_update_ops[i].pt_job_ops);
>  }
>  
>  static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops,
> u8 tile_mask, int inc_val)
> @@ -877,9 +888,6 @@ static int xe_vm_ops_add_rebind(struct xe_vma_ops
> *vops, struct xe_vma *vma,
>  
>  static struct dma_fence *ops_execute(struct xe_vm *vm,
>  				     struct xe_vma_ops *vops);
> -static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm
> *vm,
> -			    struct xe_exec_queue *q,
> -			    struct xe_sync_entry *syncs, u32
> num_syncs);
>  
>  int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
>  {
> @@ -3163,13 +3171,6 @@ static struct dma_fence *ops_execute(struct
> xe_vm *vm,
>  		fence = &cf->base;
>  	}
>  
> -	for_each_tile(tile, vm->xe, id) {
> -		if (!vops->pt_update_ops[id].num_ops)
> -			continue;
> -
> -		xe_pt_update_ops_fini(tile, vops);
> -	}
> -
>  	return fence;
>  
>  err_out:
> @@ -3447,19 +3448,6 @@ static int vm_bind_ioctl_signal_fences(struct
> xe_vm *vm,
>  	return err;
>  }
>  
> -static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm
> *vm,
> -			    struct xe_exec_queue *q,
> -			    struct xe_sync_entry *syncs, u32
> num_syncs)
> -{
> -	memset(vops, 0, sizeof(*vops));
> -	INIT_LIST_HEAD(&vops->list);
> -	vops->vm = vm;
> -	vops->q = q;
> -	vops->syncs = syncs;
> -	vops->num_syncs = num_syncs;
> -	vops->flags = 0;
> -}
> -
>  static int xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct
> xe_bo *bo,
>  					u64 addr, u64 range, u64
> obj_offset,
>  					u16 pat_index, u32 op, u32
> bind_flags)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 03/15] drm/xe: CPU binds for jobs
  2025-06-05 15:44   ` Thomas Hellström
@ 2025-06-05 16:13     ` Matthew Brost
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Brost @ 2025-06-05 16:13 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: intel-xe, francois.dugast, himal.prasad.ghimiray

On Thu, Jun 05, 2025 at 05:44:07PM +0200, Thomas Hellström wrote:
> Hi, Matt,
> 
> An early comment:
> 
> Previous concerns have also included:
> 
> 1) If clearing and binding happens on the same exec_queue, GPU binding
> is actually likely to be faster, right since it can be queued without
> waiting for additional dependencies? Do we have any timings from start-
> of-clear to support or debunk this argument.
> 

The cases where the clearing / moving are on the same queue + pipelined are:
- non-SVM pagefaults
- rebinds in exec IOCTL or preempt resume work

The case where we are using different queues:
- User binds

The cases where we use the same queue but would likely still have a GuC
/ HW context switch:
- SVM pagefaults as we need to wait on copy job in the migration, thus
  the bind is not pipelined
- SVM prefetch is same as above

The common case is clearly user binds. SVM pagefaults + prefetch seem
likely more common than non-SVM pagefaults or exec IOCTL rebinds. Let me
see if I can measure difference between CPU and GPU binds for cases
where the GPU might be faster and get back to you.

> 2) Is page-tables in unmappable VRAM something we'd want to support at
> some point.

Do we? This would be an entire rewrite of our binding code as we always
use the CPU to populate PTEs that not part of the current page table
structure. Likewise, zapping PTEs is always done via the CPU too. This
would a signicantly larger change than anything purposed here and IMO
really out of scope as this change in minor compared supporting
unmappable VRAM PTEs.

Matt

> 
> Thanks,
> Thomas
> 
> 
> On Thu, 2025-06-05 at 08:32 -0700, Matthew Brost wrote:
> > No reason to use the GPU for binds. In run_job, use the CPU to
> > perform
> > binds once the bind job's dependencies are resolved.
> > 
> > Benefits of CPU-based binds:
> > - Lower latency once dependencies are resolved, as there is no
> >   interaction with the GuC or a hardware context switch both of which
> >   are relatively slow.
> > - Large arrays of binds do not risk running out of migration PTEs,
> >   avoiding -ENOBUFS being returned to userspace.
> > - Kernel binds are decoupled from the migration exec queue (which
> > issues
> >   copies and clears), so they cannot get stuck behind unrelated
> >   jobs—this can be a problem with parallel GPU faults.
> > - Enables ULLS on the migration exec queue, as this queue has
> > exclusive
> >   access to the paging copy engine.
> > 
> > The basic idea of the implementation is to store the VM page table
> > update operations (struct xe_vm_pgtable_update_op *pt_op) and
> > additional
> > arguments for the migrate layer’s CPU PTE update function in a job.
> > The
> > submission backend can then call into the migrate layer using the CPU
> > to
> > write the PTEs and free the stored resources for the PTE update.
> > 
> > PT job submission is implemented in the GuC backend for simplicity. A
> > follow-up could introduce a specific backend for PT jobs.
> > 
> > All code related to GPU-based binding has been removed.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_bo.c              |   7 +-
> >  drivers/gpu/drm/xe/xe_bo.h              |   9 +-
> >  drivers/gpu/drm/xe/xe_bo_types.h        |   2 -
> >  drivers/gpu/drm/xe/xe_drm_client.c      |   3 +-
> >  drivers/gpu/drm/xe/xe_guc_submit.c      |  36 +++-
> >  drivers/gpu/drm/xe/xe_migrate.c         | 251 +++-------------------
> > --
> >  drivers/gpu/drm/xe/xe_migrate.h         |   6 +
> >  drivers/gpu/drm/xe/xe_pt.c              | 188 ++++++++++++++----
> >  drivers/gpu/drm/xe/xe_pt.h              |   5 +-
> >  drivers/gpu/drm/xe/xe_pt_types.h        |  29 ++-
> >  drivers/gpu/drm/xe/xe_sched_job.c       |  78 +++++---
> >  drivers/gpu/drm/xe/xe_sched_job_types.h |  31 ++-
> >  drivers/gpu/drm/xe/xe_vm.c              |  46 ++---
> >  13 files changed, 341 insertions(+), 350 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > index 61d208c85281..7aa598b584d2 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -3033,8 +3033,13 @@ void xe_bo_put_commit(struct llist_head
> > *deferred)
> >  	if (!freed)
> >  		return;
> >  
> > -	llist_for_each_entry_safe(bo, next, freed, freed)
> > +	llist_for_each_entry_safe(bo, next, freed, freed) {
> > +		struct xe_vm *vm = bo->vm;
> > +
> >  		drm_gem_object_free(&bo->ttm.base.refcount);
> > +		if (bo->flags & XE_BO_FLAG_PUT_VM_ASYNC)
> > +			xe_vm_put(vm);
> > +	}
> >  }
> >  
> >  static void xe_bo_dev_work_func(struct work_struct *work)
> > diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> > index 02ada1fb8a23..967b1fe92560 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.h
> > +++ b/drivers/gpu/drm/xe/xe_bo.h
> > @@ -46,6 +46,7 @@
> >  #define XE_BO_FLAG_GGTT2		BIT(22)
> >  #define XE_BO_FLAG_GGTT3		BIT(23)
> >  #define XE_BO_FLAG_CPU_ADDR_MIRROR	BIT(24)
> > +#define XE_BO_FLAG_PUT_VM_ASYNC		BIT(25)
> >  
> >  /* this one is trigger internally only */
> >  #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
> > @@ -319,6 +320,7 @@ void __xe_bo_release_dummy(struct kref *kref);
> >   * @bo: The bo to put.
> >   * @deferred: List to which to add the buffer object if we cannot
> > put, or
> >   * NULL if the function is to put unconditionally.
> > + * @added: BO was added to deferred list
> >   *
> >   * Since the final freeing of an object includes both sleeping and
> > (!)
> >   * memory allocation in the dma_resv individualization, it's not ok
> > @@ -338,7 +340,8 @@ void __xe_bo_release_dummy(struct kref *kref);
> >   * false otherwise.
> >   */
> >  static inline bool
> > -xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred)
> > +xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred,
> > +		   bool *added)
> >  {
> >  	if (!deferred) {
> >  		xe_bo_put(bo);
> > @@ -348,6 +351,7 @@ xe_bo_put_deferred(struct xe_bo *bo, struct
> > llist_head *deferred)
> >  	if (!kref_put(&bo->ttm.base.refcount,
> > __xe_bo_release_dummy))
> >  		return false;
> >  
> > +	*added = true;
> >  	return llist_add(&bo->freed, deferred);
> >  }
> >  
> > @@ -363,8 +367,9 @@ static inline void
> >  xe_bo_put_async(struct xe_bo *bo)
> >  {
> >  	struct xe_bo_dev *bo_device = &xe_bo_device(bo)->bo_device;
> > +	bool added = false;
> >  
> > -	if (xe_bo_put_deferred(bo, &bo_device->async_list))
> > +	if (xe_bo_put_deferred(bo, &bo_device->async_list, &added))
> >  		schedule_work(&bo_device->async_free);
> >  }
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_bo_types.h
> > b/drivers/gpu/drm/xe/xe_bo_types.h
> > index eb5e83c5f233..ecf42a04640a 100644
> > --- a/drivers/gpu/drm/xe/xe_bo_types.h
> > +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> > @@ -70,8 +70,6 @@ struct xe_bo {
> >  
> >  	/** @freed: List node for delayed put. */
> >  	struct llist_node freed;
> > -	/** @update_index: Update index if PT BO */
> > -	int update_index;
> >  	/** @created: Whether the bo has passed initial creation */
> >  	bool created;
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_drm_client.c
> > b/drivers/gpu/drm/xe/xe_drm_client.c
> > index 31f688e953d7..6f5a91ef7491 100644
> > --- a/drivers/gpu/drm/xe/xe_drm_client.c
> > +++ b/drivers/gpu/drm/xe/xe_drm_client.c
> > @@ -200,6 +200,7 @@ static void show_meminfo(struct drm_printer *p,
> > struct drm_file *file)
> >  	LLIST_HEAD(deferred);
> >  	unsigned int id;
> >  	u32 mem_type;
> > +	bool added = false;
> >  
> >  	client = xef->client;
> >  
> > @@ -246,7 +247,7 @@ static void show_meminfo(struct drm_printer *p,
> > struct drm_file *file)
> >  			xe_assert(xef->xe, !list_empty(&bo-
> > >client_link));
> >  		}
> >  
> > -		xe_bo_put_deferred(bo, &deferred);
> > +		xe_bo_put_deferred(bo, &deferred, &added);
> >  	}
> >  	spin_unlock(&client->bos_lock);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 2b61d017eeca..551cd21a6465 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -19,6 +19,7 @@
> >  #include "abi/guc_klvs_abi.h"
> >  #include "regs/xe_lrc_layout.h"
> >  #include "xe_assert.h"
> > +#include "xe_bo.h"
> >  #include "xe_devcoredump.h"
> >  #include "xe_device.h"
> >  #include "xe_exec_queue.h"
> > @@ -38,8 +39,10 @@
> >  #include "xe_lrc.h"
> >  #include "xe_macros.h"
> >  #include "xe_map.h"
> > +#include "xe_migrate.h"
> >  #include "xe_mocs.h"
> >  #include "xe_pm.h"
> > +#include "xe_pt.h"
> >  #include "xe_ring_ops_types.h"
> >  #include "xe_sched_job.h"
> >  #include "xe_trace.h"
> > @@ -745,6 +748,20 @@ static void submit_exec_queue(struct
> > xe_exec_queue *q)
> >  	}
> >  }
> >  
> > +static bool is_pt_job(struct xe_sched_job *job)
> > +{
> > +	return job->is_pt_job;
> > +}
> > +
> > +static void run_pt_job(struct xe_sched_job *job)
> > +{
> > +	__xe_migrate_update_pgtables_cpu(job->pt_update[0].vm,
> > +					 job->pt_update[0].tile,
> > +					 job->pt_update[0].ops,
> > +					 job-
> > >pt_update[0].pt_job_ops->ops,
> > +					 job-
> > >pt_update[0].pt_job_ops->current_op);
> > +}
> > +
> >  static struct dma_fence *
> >  guc_exec_queue_run_job(struct drm_sched_job *drm_job)
> >  {
> > @@ -760,14 +777,21 @@ guc_exec_queue_run_job(struct drm_sched_job
> > *drm_job)
> >  	trace_xe_sched_job_run(job);
> >  
> >  	if (!exec_queue_killed_or_banned_or_wedged(q) &&
> > !xe_sched_job_is_error(job)) {
> > -		if (!exec_queue_registered(q))
> > -			register_exec_queue(q);
> > -		if (!lr)	/* LR jobs are emitted in the exec
> > IOCTL */
> > -			q->ring_ops->emit_job(job);
> > -		submit_exec_queue(q);
> > +		if (is_pt_job(job)) {
> > +			run_pt_job(job);
> > +		} else {
> > +			if (!exec_queue_registered(q))
> > +				register_exec_queue(q);
> > +			if (!lr)	/* LR jobs are emitted in
> > the exec IOCTL */
> > +				q->ring_ops->emit_job(job);
> > +			submit_exec_queue(q);
> > +		}
> >  	}
> >  
> > -	if (lr) {
> > +	if (is_pt_job(job)) {
> > +		xe_pt_job_ops_put(job->pt_update[0].pt_job_ops);
> > +		dma_fence_put(job->fence);	/* Drop ref from
> > xe_sched_job_arm */
> > +	} else if (lr) {
> >  		xe_sched_job_set_error(job, -EOPNOTSUPP);
> >  		dma_fence_put(job->fence);	/* Drop ref from
> > xe_sched_job_arm */
> >  	} else {
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > b/drivers/gpu/drm/xe/xe_migrate.c
> > index 9084f5cbc02d..e444f3fae97c 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -58,18 +58,12 @@ struct xe_migrate {
> >  	 * Protected by @job_mutex.
> >  	 */
> >  	struct dma_fence *fence;
> > -	/**
> > -	 * @vm_update_sa: For integrated, used to suballocate page-
> > tables
> > -	 * out of the pt_bo.
> > -	 */
> > -	struct drm_suballoc_manager vm_update_sa;
> >  	/** @min_chunk_size: For dgfx, Minimum chunk size */
> >  	u64 min_chunk_size;
> >  };
> >  
> >  #define MAX_PREEMPTDISABLE_TRANSFER SZ_8M /* Around 1ms. */
> >  #define MAX_CCS_LIMITED_TRANSFER SZ_4M /* XE_PAGE_SIZE *
> > (FIELD_MAX(XE2_CCS_SIZE_MASK) + 1) */
> > -#define NUM_KERNEL_PDE 15
> >  #define NUM_PT_SLOTS 32
> >  #define LEVEL0_PAGE_TABLE_ENCODE_SIZE SZ_2M
> >  #define MAX_NUM_PTE 512
> > @@ -107,7 +101,6 @@ static void xe_migrate_fini(void *arg)
> >  
> >  	dma_fence_put(m->fence);
> >  	xe_bo_put(m->pt_bo);
> > -	drm_suballoc_manager_fini(&m->vm_update_sa);
> >  	mutex_destroy(&m->job_mutex);
> >  	xe_vm_close_and_put(m->q->vm);
> >  	xe_exec_queue_put(m->q);
> > @@ -199,8 +192,6 @@ static int xe_migrate_prepare_vm(struct xe_tile
> > *tile, struct xe_migrate *m,
> >  	BUILD_BUG_ON(NUM_PT_SLOTS > SZ_2M/XE_PAGE_SIZE);
> >  	/* Must be a multiple of 64K to support all platforms */
> >  	BUILD_BUG_ON(NUM_PT_SLOTS * XE_PAGE_SIZE % SZ_64K);
> > -	/* And one slot reserved for the 4KiB page table updates */
> > -	BUILD_BUG_ON(!(NUM_KERNEL_PDE & 1));
> >  
> >  	/* Need to be sure everything fits in the first PT, or
> > create more */
> >  	xe_tile_assert(tile, m->batch_base_ofs + batch->size <
> > SZ_2M);
> > @@ -333,8 +324,6 @@ static int xe_migrate_prepare_vm(struct xe_tile
> > *tile, struct xe_migrate *m,
> >  	/*
> >  	 * Example layout created above, with root level = 3:
> >  	 * [PT0...PT7]: kernel PT's for copy/clear; 64 or 4KiB PTE's
> > -	 * [PT8]: Kernel PT for VM_BIND, 4 KiB PTE's
> > -	 * [PT9...PT26]: Userspace PT's for VM_BIND, 4 KiB PTE's
> >  	 * [PT27 = PDE 0] [PT28 = PDE 1] [PT29 = PDE 2] [PT30 & PT31
> > = 2M vram identity map]
> >  	 *
> >  	 * This makes the lowest part of the VM point to the
> > pagetables.
> > @@ -342,19 +331,10 @@ static int xe_migrate_prepare_vm(struct xe_tile
> > *tile, struct xe_migrate *m,
> >  	 * and flushes, other parts of the VM can be used either for
> > copying and
> >  	 * clearing.
> >  	 *
> > -	 * For performance, the kernel reserves PDE's, so about 20
> > are left
> > -	 * for async VM updates.
> > -	 *
> >  	 * To make it easier to work, each scratch PT is put in slot
> > (1 + PT #)
> >  	 * everywhere, this allows lockless updates to scratch pages
> > by using
> >  	 * the different addresses in VM.
> >  	 */
> > -#define NUM_VMUSA_UNIT_PER_PAGE	32
> > -#define VM_SA_UPDATE_UNIT_SIZE		(XE_PAGE_SIZE /
> > NUM_VMUSA_UNIT_PER_PAGE)
> > -#define NUM_VMUSA_WRITES_PER_UNIT	(VM_SA_UPDATE_UNIT_SIZE /
> > sizeof(u64))
> > -	drm_suballoc_manager_init(&m->vm_update_sa,
> > -				  (size_t)(map_ofs / XE_PAGE_SIZE -
> > NUM_KERNEL_PDE) *
> > -				  NUM_VMUSA_UNIT_PER_PAGE, 0);
> >  
> >  	m->pt_bo = bo;
> >  	return 0;
> > @@ -1193,56 +1173,6 @@ struct dma_fence *xe_migrate_clear(struct
> > xe_migrate *m,
> >  	return fence;
> >  }
> >  
> > -static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb,
> > u64 ppgtt_ofs,
> > -			  const struct xe_vm_pgtable_update_op
> > *pt_op,
> > -			  const struct xe_vm_pgtable_update *update,
> > -			  struct xe_migrate_pt_update *pt_update)
> > -{
> > -	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
> > -	struct xe_vm *vm = pt_update->vops->vm;
> > -	u32 chunk;
> > -	u32 ofs = update->ofs, size = update->qwords;
> > -
> > -	/*
> > -	 * If we have 512 entries (max), we would populate it
> > ourselves,
> > -	 * and update the PDE above it to the new pointer.
> > -	 * The only time this can only happen if we have to update
> > the top
> > -	 * PDE. This requires a BO that is almost vm->size big.
> > -	 *
> > -	 * This shouldn't be possible in practice.. might change
> > when 16K
> > -	 * pages are used. Hence the assert.
> > -	 */
> > -	xe_tile_assert(tile, update->qwords < MAX_NUM_PTE);
> > -	if (!ppgtt_ofs)
> > -		ppgtt_ofs = xe_migrate_vram_ofs(tile_to_xe(tile),
> > -						xe_bo_addr(update-
> > >pt_bo, 0,
> > -							  
> > XE_PAGE_SIZE), false);
> > -
> > -	do {
> > -		u64 addr = ppgtt_ofs + ofs * 8;
> > -
> > -		chunk = min(size, MAX_PTE_PER_SDI);
> > -
> > -		/* Ensure populatefn can do memset64 by aligning bb-
> > >cs */
> > -		if (!(bb->len & 1))
> > -			bb->cs[bb->len++] = MI_NOOP;
> > -
> > -		bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> > MI_SDI_NUM_QW(chunk);
> > -		bb->cs[bb->len++] = lower_32_bits(addr);
> > -		bb->cs[bb->len++] = upper_32_bits(addr);
> > -		if (pt_op->bind)
> > -			ops->populate(tile, NULL, bb->cs + bb->len,
> > -				      ofs, chunk, update);
> > -		else
> > -			ops->clear(vm, tile, NULL, bb->cs + bb->len,
> > -				   ofs, chunk, update);
> > -
> > -		bb->len += chunk * 2;
> > -		ofs += chunk;
> > -		size -= chunk;
> > -	} while (size);
> > -}
> > -
> >  struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m)
> >  {
> >  	return xe_vm_get(m->q->vm);
> > @@ -1258,7 +1188,18 @@ struct migrate_test_params {
> >  	container_of(_priv, struct migrate_test_params, base)
> >  #endif
> >  
> > -static void
> > +/**
> > + * __xe_migrate_update_pgtables_cpu() - Update a VM's PTEs via the
> > CPU
> > + * @vm: The VM being updated
> > + * @tile: The tile being updated
> > + * @ops: The migrate PT update ops
> > + * @pt_ops: The VM PT update ops
> > + * @num_ops: The number of The VM PT update ops
> > + *
> > + * Execute the VM PT update ops array which results in a VM's PTEs
> > being updated
> > + * via the CPU.
> > + */
> > +void
> >  __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile
> > *tile,
> >  				 const struct
> > xe_migrate_pt_update_ops *ops,
> >  				 struct xe_vm_pgtable_update_op
> > *pt_op,
> > @@ -1314,7 +1255,7 @@ xe_migrate_update_pgtables_cpu(struct
> > xe_migrate *m,
> >  	}
> >  
> >  	__xe_migrate_update_pgtables_cpu(vm, m->tile, ops,
> > -					 pt_update_ops->ops,
> > +					 pt_update_ops->pt_job_ops-
> > >ops,
> >  					 pt_update_ops->num_ops);
> >  
> >  	return dma_fence_get_stub();
> > @@ -1327,161 +1268,19 @@ __xe_migrate_update_pgtables(struct
> > xe_migrate *m,
> >  {
> >  	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
> >  	struct xe_tile *tile = m->tile;
> > -	struct xe_gt *gt = tile->primary_gt;
> > -	struct xe_device *xe = tile_to_xe(tile);
> >  	struct xe_sched_job *job;
> >  	struct dma_fence *fence;
> > -	struct drm_suballoc *sa_bo = NULL;
> > -	struct xe_bb *bb;
> > -	u32 i, j, batch_size = 0, ppgtt_ofs, update_idx, page_ofs =
> > 0;
> > -	u32 num_updates = 0, current_update = 0;
> > -	u64 addr;
> > -	int err = 0;
> >  	bool is_migrate = pt_update_ops->q == m->q;
> > -	bool usm = is_migrate && xe->info.has_usm;
> > -
> > -	for (i = 0; i < pt_update_ops->num_ops; ++i) {
> > -		struct xe_vm_pgtable_update_op *pt_op =
> > &pt_update_ops->ops[i];
> > -		struct xe_vm_pgtable_update *updates = pt_op-
> > >entries;
> > -
> > -		num_updates += pt_op->num_entries;
> > -		for (j = 0; j < pt_op->num_entries; ++j) {
> > -			u32 num_cmds =
> > DIV_ROUND_UP(updates[j].qwords,
> > -						   
> > MAX_PTE_PER_SDI);
> > -
> > -			/* align noop + MI_STORE_DATA_IMM cmd prefix
> > */
> > -			batch_size += 4 * num_cmds +
> > updates[j].qwords * 2;
> > -		}
> > -	}
> > -
> > -	/* fixed + PTE entries */
> > -	if (IS_DGFX(xe))
> > -		batch_size += 2;
> > -	else
> > -		batch_size += 6 * (num_updates / MAX_PTE_PER_SDI +
> > 1) +
> > -			num_updates * 2;
> > -
> > -	bb = xe_bb_new(gt, batch_size, usm);
> > -	if (IS_ERR(bb))
> > -		return ERR_CAST(bb);
> > -
> > -	/* For sysmem PTE's, need to map them in our hole.. */
> > -	if (!IS_DGFX(xe)) {
> > -		u16 pat_index = xe->pat.idx[XE_CACHE_WB];
> > -		u32 ptes, ofs;
> > -
> > -		ppgtt_ofs = NUM_KERNEL_PDE - 1;
> > -		if (!is_migrate) {
> > -			u32 num_units = DIV_ROUND_UP(num_updates,
> > -						    
> > NUM_VMUSA_WRITES_PER_UNIT);
> > -
> > -			if (num_units > m->vm_update_sa.size) {
> > -				err = -ENOBUFS;
> > -				goto err_bb;
> > -			}
> > -			sa_bo = drm_suballoc_new(&m->vm_update_sa,
> > num_units,
> > -						 GFP_KERNEL, true,
> > 0);
> > -			if (IS_ERR(sa_bo)) {
> > -				err = PTR_ERR(sa_bo);
> > -				goto err_bb;
> > -			}
> > -
> > -			ppgtt_ofs = NUM_KERNEL_PDE +
> > -				(drm_suballoc_soffset(sa_bo) /
> > -				 NUM_VMUSA_UNIT_PER_PAGE);
> > -			page_ofs = (drm_suballoc_soffset(sa_bo) %
> > -				    NUM_VMUSA_UNIT_PER_PAGE) *
> > -				VM_SA_UPDATE_UNIT_SIZE;
> > -		}
> > -
> > -		/* Map our PT's to gtt */
> > -		i = 0;
> > -		j = 0;
> > -		ptes = num_updates;
> > -		ofs = ppgtt_ofs * XE_PAGE_SIZE + page_ofs;
> > -		while (ptes) {
> > -			u32 chunk = min(MAX_PTE_PER_SDI, ptes);
> > -			u32 idx = 0;
> > -
> > -			bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> > -				MI_SDI_NUM_QW(chunk);
> > -			bb->cs[bb->len++] = ofs;
> > -			bb->cs[bb->len++] = 0; /* upper_32_bits */
> > -
> > -			for (; i < pt_update_ops->num_ops; ++i) {
> > -				struct xe_vm_pgtable_update_op
> > *pt_op =
> > -					&pt_update_ops->ops[i];
> > -				struct xe_vm_pgtable_update *updates
> > = pt_op->entries;
> > -
> > -				for (; j < pt_op->num_entries; ++j,
> > ++current_update, ++idx) {
> > -					struct xe_vm *vm =
> > pt_update->vops->vm;
> > -					struct xe_bo *pt_bo =
> > updates[j].pt_bo;
> > -
> > -					if (idx == chunk)
> > -						goto next_cmd;
> > -
> > -					xe_tile_assert(tile, pt_bo-
> > >size == SZ_4K);
> > -
> > -					/* Map a PT at most once */
> > -					if (pt_bo->update_index < 0)
> > -						pt_bo->update_index
> > = current_update;
> > -
> > -					addr = vm->pt_ops-
> > >pte_encode_bo(pt_bo, 0,
> > -
> > 									 pat_index, 0);
> > -					bb->cs[bb->len++] =
> > lower_32_bits(addr);
> > -					bb->cs[bb->len++] =
> > upper_32_bits(addr);
> > -				}
> > -
> > -				j = 0;
> > -			}
> > -
> > -next_cmd:
> > -			ptes -= chunk;
> > -			ofs += chunk * sizeof(u64);
> > -		}
> > -
> > -		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> > -		update_idx = bb->len;
> > -
> > -		addr = xe_migrate_vm_addr(ppgtt_ofs, 0) +
> > -			(page_ofs / sizeof(u64)) * XE_PAGE_SIZE;
> > -		for (i = 0; i < pt_update_ops->num_ops; ++i) {
> > -			struct xe_vm_pgtable_update_op *pt_op =
> > -				&pt_update_ops->ops[i];
> > -			struct xe_vm_pgtable_update *updates =
> > pt_op->entries;
> > -
> > -			for (j = 0; j < pt_op->num_entries; ++j) {
> > -				struct xe_bo *pt_bo =
> > updates[j].pt_bo;
> > -
> > -				write_pgtable(tile, bb, addr +
> > -					      pt_bo->update_index *
> > XE_PAGE_SIZE,
> > -					      pt_op, &updates[j],
> > pt_update);
> > -			}
> > -		}
> > -	} else {
> > -		/* phys pages, no preamble required */
> > -		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> > -		update_idx = bb->len;
> > -
> > -		for (i = 0; i < pt_update_ops->num_ops; ++i) {
> > -			struct xe_vm_pgtable_update_op *pt_op =
> > -				&pt_update_ops->ops[i];
> > -			struct xe_vm_pgtable_update *updates =
> > pt_op->entries;
> > -
> > -			for (j = 0; j < pt_op->num_entries; ++j)
> > -				write_pgtable(tile, bb, 0, pt_op,
> > &updates[j],
> > -					      pt_update);
> > -		}
> > -	}
> > +	int err;
> >  
> > -	job = xe_bb_create_migration_job(pt_update_ops->q, bb,
> > -					 xe_migrate_batch_base(m,
> > usm),
> > -					 update_idx);
> > +	job = xe_sched_job_create(pt_update_ops->q, NULL);
> >  	if (IS_ERR(job)) {
> >  		err = PTR_ERR(job);
> > -		goto err_sa;
> > +		goto err_out;
> >  	}
> >  
> > +	xe_tile_assert(tile, job->is_pt_job);
> > +
> >  	if (ops->pre_commit) {
> >  		pt_update->job = job;
> >  		err = ops->pre_commit(pt_update);
> > @@ -1491,6 +1290,12 @@ __xe_migrate_update_pgtables(struct xe_migrate
> > *m,
> >  	if (is_migrate)
> >  		mutex_lock(&m->job_mutex);
> >  
> > +	job->pt_update[0].vm = pt_update->vops->vm;
> > +	job->pt_update[0].tile = tile;
> > +	job->pt_update[0].ops = ops;
> > +	job->pt_update[0].pt_job_ops =
> > +		xe_pt_job_ops_get(pt_update_ops->pt_job_ops);
> > +
> >  	xe_sched_job_arm(job);
> >  	fence = dma_fence_get(&job->drm.s_fence->finished);
> >  	xe_sched_job_push(job);
> > @@ -1498,17 +1303,11 @@ __xe_migrate_update_pgtables(struct
> > xe_migrate *m,
> >  	if (is_migrate)
> >  		mutex_unlock(&m->job_mutex);
> >  
> > -	xe_bb_free(bb, fence);
> > -	drm_suballoc_free(sa_bo, fence);
> > -
> >  	return fence;
> >  
> >  err_job:
> >  	xe_sched_job_put(job);
> > -err_sa:
> > -	drm_suballoc_free(sa_bo, NULL);
> > -err_bb:
> > -	xe_bb_free(bb, NULL);
> > +err_out:
> >  	return ERR_PTR(err);
> >  }
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> > b/drivers/gpu/drm/xe/xe_migrate.h
> > index b064455b604e..0986ffdd8d9a 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.h
> > +++ b/drivers/gpu/drm/xe/xe_migrate.h
> > @@ -22,6 +22,7 @@ struct xe_pt;
> >  struct xe_tile;
> >  struct xe_vm;
> >  struct xe_vm_pgtable_update;
> > +struct xe_vm_pgtable_update_op;
> >  struct xe_vma;
> >  
> >  /**
> > @@ -125,6 +126,11 @@ struct dma_fence *xe_migrate_clear(struct
> > xe_migrate *m,
> >  
> >  struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m);
> >  
> > +void __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct
> > xe_tile *tile,
> > +				      const struct
> > xe_migrate_pt_update_ops *ops,
> > +				      struct xe_vm_pgtable_update_op
> > *pt_op,
> > +				      int num_ops);
> > +
> >  struct dma_fence *
> >  xe_migrate_update_pgtables(struct xe_migrate *m,
> >  			   struct xe_migrate_pt_update *pt_update);
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index db1c363a65d5..1ad31f444b79 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -200,7 +200,9 @@ unsigned int xe_pt_shift(unsigned int level)
> >   * and finally frees @pt. TODO: Can we remove the @flags argument?
> >   */
> >  void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head
> > *deferred)
> > +
> >  {
> > +	bool added = false;
> >  	int i;
> >  
> >  	if (!pt)
> > @@ -208,7 +210,18 @@ void xe_pt_destroy(struct xe_pt *pt, u32 flags,
> > struct llist_head *deferred)
> >  
> >  	XE_WARN_ON(!list_empty(&pt->bo->ttm.base.gpuva.list));
> >  	xe_bo_unpin(pt->bo);
> > -	xe_bo_put_deferred(pt->bo, deferred);
> > +	xe_bo_put_deferred(pt->bo, deferred, &added);
> > +	if (added) {
> > +		/*
> > +		 * We need the VM present until the BO is destroyed
> > as it shares
> > +		 * a dma-resv and BO destroy is async. Reinit BO
> > refcount so
> > +		 * xe_bo_put_async can be used when the PT job ops
> > refcount goes
> > +		 * to zero.
> > +		 */
> > +		xe_vm_get(pt->bo->vm);
> > +		pt->bo->flags |= XE_BO_FLAG_PUT_VM_ASYNC;
> > +		kref_init(&pt->bo->ttm.base.refcount);
> > +	}
> >  
> >  	if (pt->level > 0 && pt->num_live) {
> >  		struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
> > @@ -361,7 +374,7 @@ xe_pt_new_shared(struct xe_walk_update *wupd,
> > struct xe_pt *parent,
> >  	entry->pt = parent;
> >  	entry->flags = 0;
> >  	entry->qwords = 0;
> > -	entry->pt_bo->update_index = -1;
> > +	entry->level = parent->level;
> >  
> >  	if (alloc_entries) {
> >  		entry->pt_entries = kmalloc_array(XE_PDES,
> > @@ -1739,7 +1752,7 @@ xe_migrate_clear_pgtable_callback(struct xe_vm
> > *vm, struct xe_tile *tile,
> >  				  u32 qword_ofs, u32 num_qwords,
> >  				  const struct xe_vm_pgtable_update
> > *update)
> >  {
> > -	u64 empty = __xe_pt_empty_pte(tile, vm, update->pt->level);
> > +	u64 empty = __xe_pt_empty_pte(tile, vm, update->level);
> >  	int i;
> >  
> >  	if (map && map->is_iomem)
> > @@ -1805,13 +1818,20 @@ xe_pt_commit_prepare_unbind(struct xe_vma
> > *vma,
> >  	}
> >  }
> >  
> > +static struct xe_vm_pgtable_update_op *
> > +to_pt_op(struct xe_vm_pgtable_update_ops *pt_update_ops, u32
> > current_op)
> > +{
> > +	return &pt_update_ops->pt_job_ops->ops[current_op];
> > +}
> > +
> >  static void
> >  xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> >  				 u64 start, u64 end)
> >  {
> >  	u64 last;
> > -	u32 current_op = pt_update_ops->current_op;
> > -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> > +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> > +	struct xe_vm_pgtable_update_op *pt_op =
> > +		to_pt_op(pt_update_ops, current_op);
> >  	int i, level = 0;
> >  
> >  	for (i = 0; i < pt_op->num_entries; i++) {
> > @@ -1846,8 +1866,9 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  			   struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> >  			   struct xe_vma *vma, bool
> > invalidate_on_bind)
> >  {
> > -	u32 current_op = pt_update_ops->current_op;
> > -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> > +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> > +	struct xe_vm_pgtable_update_op *pt_op =
> > +		to_pt_op(pt_update_ops, current_op);
> >  	int err;
> >  
> >  	xe_tile_assert(tile, !xe_vma_is_cpu_addr_mirror(vma));
> > @@ -1876,7 +1897,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  		xe_pt_update_ops_rfence_interval(pt_update_ops,
> >  						 xe_vma_start(vma),
> >  						 xe_vma_end(vma));
> > -		++pt_update_ops->current_op;
> > +		++pt_update_ops->pt_job_ops->current_op;
> >  		pt_update_ops->needs_userptr_lock |=
> > xe_vma_is_userptr(vma);
> >  
> >  		/*
> > @@ -1913,8 +1934,9 @@ static int bind_range_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  			      struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> >  			      struct xe_vma *vma, struct
> > xe_svm_range *range)
> >  {
> > -	u32 current_op = pt_update_ops->current_op;
> > -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> > +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> > +	struct xe_vm_pgtable_update_op *pt_op =
> > +		to_pt_op(pt_update_ops, current_op);
> >  	int err;
> >  
> >  	xe_tile_assert(tile, xe_vma_is_cpu_addr_mirror(vma));
> > @@ -1938,7 +1960,7 @@ static int bind_range_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  		xe_pt_update_ops_rfence_interval(pt_update_ops,
> >  						 range-
> > >base.itree.start,
> >  						 range-
> > >base.itree.last + 1);
> > -		++pt_update_ops->current_op;
> > +		++pt_update_ops->pt_job_ops->current_op;
> >  		pt_update_ops->needs_svm_lock = true;
> >  
> >  		pt_op->vma = vma;
> > @@ -1955,8 +1977,9 @@ static int unbind_op_prepare(struct xe_tile
> > *tile,
> >  			     struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> >  			     struct xe_vma *vma)
> >  {
> > -	u32 current_op = pt_update_ops->current_op;
> > -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> > +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> > +	struct xe_vm_pgtable_update_op *pt_op =
> > +		to_pt_op(pt_update_ops, current_op);
> >  	int err;
> >  
> >  	if (!((vma->tile_present | vma->tile_staged) & BIT(tile-
> > >id)))
> > @@ -1984,7 +2007,7 @@ static int unbind_op_prepare(struct xe_tile
> > *tile,
> >  				pt_op->num_entries, false);
> >  	xe_pt_update_ops_rfence_interval(pt_update_ops,
> > xe_vma_start(vma),
> >  					 xe_vma_end(vma));
> > -	++pt_update_ops->current_op;
> > +	++pt_update_ops->pt_job_ops->current_op;
> >  	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
> >  	pt_update_ops->needs_invalidation = true;
> >  
> > @@ -1998,8 +2021,9 @@ static int unbind_range_prepare(struct xe_vm
> > *vm,
> >  				struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> >  				struct xe_svm_range *range)
> >  {
> > -	u32 current_op = pt_update_ops->current_op;
> > -	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> > +	u32 current_op = pt_update_ops->pt_job_ops->current_op;
> > +	struct xe_vm_pgtable_update_op *pt_op =
> > +		to_pt_op(pt_update_ops, current_op);
> >  
> >  	if (!(range->tile_present & BIT(tile->id)))
> >  		return 0;
> > @@ -2019,7 +2043,7 @@ static int unbind_range_prepare(struct xe_vm
> > *vm,
> >  				pt_op->num_entries, false);
> >  	xe_pt_update_ops_rfence_interval(pt_update_ops, range-
> > >base.itree.start,
> >  					 range->base.itree.last +
> > 1);
> > -	++pt_update_ops->current_op;
> > +	++pt_update_ops->pt_job_ops->current_op;
> >  	pt_update_ops->needs_svm_lock = true;
> >  	pt_update_ops->needs_invalidation = true;
> >  
> > @@ -2122,7 +2146,6 @@ static int op_prepare(struct xe_vm *vm,
> >  static void
> >  xe_pt_update_ops_init(struct xe_vm_pgtable_update_ops
> > *pt_update_ops)
> >  {
> > -	init_llist_head(&pt_update_ops->deferred);
> >  	pt_update_ops->start = ~0x0ull;
> >  	pt_update_ops->last = 0x0ull;
> >  }
> > @@ -2163,7 +2186,7 @@ int xe_pt_update_ops_prepare(struct xe_tile
> > *tile, struct xe_vma_ops *vops)
> >  			return err;
> >  	}
> >  
> > -	xe_tile_assert(tile, pt_update_ops->current_op <=
> > +	xe_tile_assert(tile, pt_update_ops->pt_job_ops->current_op
> > <=
> >  		       pt_update_ops->num_ops);
> >  
> >  #ifdef TEST_VM_OPS_ERROR
> > @@ -2396,7 +2419,7 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> > struct xe_vma_ops *vops)
> >  	lockdep_assert_held(&vm->lock);
> >  	xe_vm_assert_held(vm);
> >  
> > -	if (!pt_update_ops->current_op) {
> > +	if (!pt_update_ops->pt_job_ops->current_op) {
> >  		xe_tile_assert(tile, xe_vm_in_fault_mode(vm));
> >  
> >  		return dma_fence_get_stub();
> > @@ -2445,12 +2468,16 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> > struct xe_vma_ops *vops)
> >  		goto free_rfence;
> >  	}
> >  
> > -	/* Point of no return - VM killed if failure after this */
> > -	for (i = 0; i < pt_update_ops->current_op; ++i) {
> > -		struct xe_vm_pgtable_update_op *pt_op =
> > &pt_update_ops->ops[i];
> > +	/*
> > +	 * Point of no return - VM killed if failure after this
> > +	 */
> > +	for (i = 0; i < pt_update_ops->pt_job_ops->current_op; ++i)
> > {
> > +		struct xe_vm_pgtable_update_op *pt_op =
> > +			to_pt_op(pt_update_ops, i);
> >  
> >  		xe_pt_commit(pt_op->vma, pt_op->entries,
> > -			     pt_op->num_entries, &pt_update_ops-
> > >deferred);
> > +			     pt_op->num_entries,
> > +			     &pt_update_ops->pt_job_ops->deferred);
> >  		pt_op->vma = NULL;	/* skip in
> > xe_pt_update_ops_abort */
> >  	}
> >  
> > @@ -2530,27 +2557,19 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> > struct xe_vma_ops *vops)
> >  ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
> >  
> >  /**
> > - * xe_pt_update_ops_fini() - Finish PT update operations
> > - * @tile: Tile of PT update operations
> > - * @vops: VMA operations
> > + * xe_pt_update_ops_free() - Free PT update operations
> > + * @pt_op: Array of PT update operations
> > + * @num_ops: Number of PT update operations
> >   *
> > - * Finish PT update operations by committing to destroy page table
> > memory
> > + * Free PT update operations
> >   */
> > -void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops
> > *vops)
> > +static void xe_pt_update_ops_free(struct xe_vm_pgtable_update_op
> > *pt_op,
> > +				  u32 num_ops)
> >  {
> > -	struct xe_vm_pgtable_update_ops *pt_update_ops =
> > -		&vops->pt_update_ops[tile->id];
> > -	int i;
> > -
> > -	lockdep_assert_held(&vops->vm->lock);
> > -	xe_vm_assert_held(vops->vm);
> > -
> > -	for (i = 0; i < pt_update_ops->current_op; ++i) {
> > -		struct xe_vm_pgtable_update_op *pt_op =
> > &pt_update_ops->ops[i];
> > +	u32 i;
> >  
> > +	for (i = 0; i < num_ops; ++i, ++pt_op)
> >  		xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
> > -	}
> > -	xe_bo_put_commit(&vops->pt_update_ops[tile->id].deferred);
> >  }
> >  
> >  /**
> > @@ -2571,9 +2590,9 @@ void xe_pt_update_ops_abort(struct xe_tile
> > *tile, struct xe_vma_ops *vops)
> >  
> >  	for (i = pt_update_ops->num_ops - 1; i >= 0; --i) {
> >  		struct xe_vm_pgtable_update_op *pt_op =
> > -			&pt_update_ops->ops[i];
> > +			to_pt_op(pt_update_ops, i);
> >  
> > -		if (!pt_op->vma || i >= pt_update_ops->current_op)
> > +		if (!pt_op->vma || i >= pt_update_ops->pt_job_ops-
> > >current_op)
> >  			continue;
> >  
> >  		if (pt_op->bind)
> > @@ -2584,6 +2603,89 @@ void xe_pt_update_ops_abort(struct xe_tile
> > *tile, struct xe_vma_ops *vops)
> >  			xe_pt_abort_unbind(pt_op->vma, pt_op-
> > >entries,
> >  					   pt_op->num_entries);
> >  	}
> > +}
> > +
> > +/**
> > + * xe_pt_job_ops_alloc() - Allocate PT job ops
> > + * @num_ops: Number of VM PT update ops
> > + *
> > + * Allocate PT job ops and internal array of VM PT update ops.
> > + *
> > + * Return: Pointer to PT job ops or NULL
> > + */
> > +struct xe_pt_job_ops *xe_pt_job_ops_alloc(u32 num_ops)
> > +{
> > +	struct xe_pt_job_ops *pt_job_ops;
> > +
> > +	pt_job_ops = kmalloc(sizeof(*pt_job_ops), GFP_KERNEL);
> > +	if (!pt_job_ops)
> > +		return NULL;
> > +
> > +	pt_job_ops->ops = kvmalloc_array(num_ops,
> > sizeof(*pt_job_ops->ops),
> > +					 GFP_KERNEL);
> > +	if (!pt_job_ops->ops) {
> > +		kvfree(pt_job_ops);
> > +		return NULL;
> > +	}
> > +
> > +	pt_job_ops->current_op = 0;
> > +	kref_init(&pt_job_ops->refcount);
> > +	init_llist_head(&pt_job_ops->deferred);
> > +
> > +	return pt_job_ops;
> > +}
> > +
> > +/**
> > + * xe_pt_job_ops_get() - Get PT job ops
> > + * @pt_job_ops: PT job ops to get
> > + *
> > + * Take a reference to PT job ops
> > + *
> > + * Return: Pointer to PT job ops or NULL
> > + */
> > +struct xe_pt_job_ops *xe_pt_job_ops_get(struct xe_pt_job_ops
> > *pt_job_ops)
> > +{
> > +	if (pt_job_ops)
> > +		kref_get(&pt_job_ops->refcount);
> > +
> > +	return pt_job_ops;
> > +}
> > +
> > +static void xe_pt_job_ops_destroy(struct kref *ref)
> > +{
> > +	struct xe_pt_job_ops *pt_job_ops =
> > +		container_of(ref, struct xe_pt_job_ops, refcount);
> > +	struct llist_node *freed;
> > +	struct xe_bo *bo, *next;
> > +
> > +	xe_pt_update_ops_free(pt_job_ops->ops,
> > +			      pt_job_ops->current_op);
> > +
> > +	freed = llist_del_all(&pt_job_ops->deferred);
> > +	if (freed) {
> > +		llist_for_each_entry_safe(bo, next, freed, freed)
> > +			/*
> > +			 * If called from run_job, we are in the
> > dma-fencing
> > +			 * path and cannot take dma-resv locks so
> > use an async
> > +			 * put.
> > +			 */
> > +			xe_bo_put_async(bo);
> > +	}
> > +
> > +	kvfree(pt_job_ops->ops);
> > +	kfree(pt_job_ops);
> > +}
> > +
> > +/**
> > + * xe_pt_job_ops_put() - Put PT job ops
> > + * @pt_job_ops: PT job ops to put
> > + *
> > + * Drop a reference to PT job ops
> > + */
> > +void xe_pt_job_ops_put(struct xe_pt_job_ops *pt_job_ops)
> > +{
> > +	if (!pt_job_ops)
> > +		return;
> >  
> > -	xe_pt_update_ops_fini(tile, vops);
> > +	kref_put(&pt_job_ops->refcount, xe_pt_job_ops_destroy);
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> > index 5ecf003d513c..c9904573db82 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.h
> > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > @@ -41,11 +41,14 @@ void xe_pt_clear(struct xe_device *xe, struct
> > xe_pt *pt);
> >  int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops
> > *vops);
> >  struct dma_fence *xe_pt_update_ops_run(struct xe_tile *tile,
> >  				       struct xe_vma_ops *vops);
> > -void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops
> > *vops);
> >  void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops
> > *vops);
> >  
> >  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> >  bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
> >  			  struct xe_svm_range *range);
> >  
> > +struct xe_pt_job_ops *xe_pt_job_ops_alloc(u32 num_ops);
> > +struct xe_pt_job_ops *xe_pt_job_ops_get(struct xe_pt_job_ops
> > *pt_job_ops);
> > +void xe_pt_job_ops_put(struct xe_pt_job_ops *pt_job_ops);
> > +
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > b/drivers/gpu/drm/xe/xe_pt_types.h
> > index 69eab6f37cfe..33d0d20e0ac6 100644
> > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > @@ -70,6 +70,9 @@ struct xe_vm_pgtable_update {
> >  	/** @pt_entries: Newly added pagetable entries */
> >  	struct xe_pt_entry *pt_entries;
> >  
> > +	/** @level: level of update */
> > +	unsigned int level;
> > +
> >  	/** @flags: Target flags */
> >  	u32 flags;
> >  };
> > @@ -88,12 +91,28 @@ struct xe_vm_pgtable_update_op {
> >  	bool rebind;
> >  };
> >  
> > -/** struct xe_vm_pgtable_update_ops: page table update operations */
> > -struct xe_vm_pgtable_update_ops {
> > -	/** @ops: operations */
> > -	struct xe_vm_pgtable_update_op *ops;
> > +/**
> > + * struct xe_pt_job_ops: page table update operations dynamic
> > allocation
> > + *
> > + * This is the part of struct xe_vma_ops and struct
> > xe_vm_pgtable_update_ops
> > + * which is dynamic allocated as it must be available until the bind
> > job is
> > + * complete.
> > + */
> > +struct xe_pt_job_ops {
> > +	/** @current_op: current operations */
> > +	u32 current_op;
> > +	/** @refcount: ref count ops allocation */
> > +	struct kref refcount;
> >  	/** @deferred: deferred list to destroy PT entries */
> >  	struct llist_head deferred;
> > +	/** @ops: operations */
> > +	struct xe_vm_pgtable_update_op *ops;
> > +};
> > +
> > +/** struct xe_vm_pgtable_update_ops: page table update operations */
> > +struct xe_vm_pgtable_update_ops {
> > +	/** @pt_job_ops: PT update operations dynamic allocation*/
> > +	struct xe_pt_job_ops *pt_job_ops;
> >  	/** @q: exec queue for PT operations */
> >  	struct xe_exec_queue *q;
> >  	/** @start: start address of ops */
> > @@ -102,8 +121,6 @@ struct xe_vm_pgtable_update_ops {
> >  	u64 last;
> >  	/** @num_ops: number of operations */
> >  	u32 num_ops;
> > -	/** @current_op: current operations */
> > -	u32 current_op;
> >  	/** @needs_svm_lock: Needs SVM lock */
> >  	bool needs_svm_lock;
> >  	/** @needs_userptr_lock: Needs userptr lock */
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job.c
> > b/drivers/gpu/drm/xe/xe_sched_job.c
> > index d21bf8f26964..09cdd14d9ef7 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job.c
> > +++ b/drivers/gpu/drm/xe/xe_sched_job.c
> > @@ -26,19 +26,22 @@ static struct kmem_cache
> > *xe_sched_job_parallel_slab;
> >  
> >  int __init xe_sched_job_module_init(void)
> >  {
> > +	struct xe_sched_job *job;
> > +	size_t size;
> > +
> > +	size = struct_size(job, ptrs, 1);
> >  	xe_sched_job_slab =
> > -		kmem_cache_create("xe_sched_job",
> > -				  sizeof(struct xe_sched_job) +
> > -				  sizeof(struct xe_job_ptrs), 0,
> > +		kmem_cache_create("xe_sched_job", size, 0,
> >  				  SLAB_HWCACHE_ALIGN, NULL);
> >  	if (!xe_sched_job_slab)
> >  		return -ENOMEM;
> >  
> > +	size = max_t(size_t,
> > +		     struct_size(job, ptrs,
> > +				 XE_HW_ENGINE_MAX_INSTANCE),
> > +		     struct_size(job, pt_update, 1));
> >  	xe_sched_job_parallel_slab =
> > -		kmem_cache_create("xe_sched_job_parallel",
> > -				  sizeof(struct xe_sched_job) +
> > -				  sizeof(struct xe_job_ptrs) *
> > -				  XE_HW_ENGINE_MAX_INSTANCE, 0,
> > +		kmem_cache_create("xe_sched_job_parallel", size, 0,
> >  				  SLAB_HWCACHE_ALIGN, NULL);
> >  	if (!xe_sched_job_parallel_slab) {
> >  		kmem_cache_destroy(xe_sched_job_slab);
> > @@ -84,7 +87,7 @@ static void xe_sched_job_free_fences(struct
> > xe_sched_job *job)
> >  {
> >  	int i;
> >  
> > -	for (i = 0; i < job->q->width; ++i) {
> > +	for (i = 0; !job->is_pt_job && i < job->q->width; ++i) {
> >  		struct xe_job_ptrs *ptrs = &job->ptrs[i];
> >  
> >  		if (ptrs->lrc_fence)
> > @@ -118,33 +121,44 @@ struct xe_sched_job *xe_sched_job_create(struct
> > xe_exec_queue *q,
> >  	if (err)
> >  		goto err_free;
> >  
> > -	for (i = 0; i < q->width; ++i) {
> > -		struct dma_fence *fence =
> > xe_lrc_alloc_seqno_fence();
> > -		struct dma_fence_chain *chain;
> > -
> > -		if (IS_ERR(fence)) {
> > -			err = PTR_ERR(fence);
> > -			goto err_sched_job;
> > -		}
> > -		job->ptrs[i].lrc_fence = fence;
> > -
> > -		if (i + 1 == q->width)
> > -			continue;
> > -
> > -		chain = dma_fence_chain_alloc();
> > -		if (!chain) {
> > +	if (!batch_addr) {
> > +		job->fence =
> > dma_fence_allocate_private_stub(ktime_get());
> > +		if (!job->fence) {
> >  			err = -ENOMEM;
> >  			goto err_sched_job;
> >  		}
> > -		job->ptrs[i].chain_fence = chain;
> > +		job->is_pt_job = true;
> > +	} else {
> > +		for (i = 0; i < q->width; ++i) {
> > +			struct dma_fence *fence =
> > xe_lrc_alloc_seqno_fence();
> > +			struct dma_fence_chain *chain;
> > +
> > +			if (IS_ERR(fence)) {
> > +				err = PTR_ERR(fence);
> > +				goto err_sched_job;
> > +			}
> > +			job->ptrs[i].lrc_fence = fence;
> > +
> > +			if (i + 1 == q->width)
> > +				continue;
> > +
> > +			chain = dma_fence_chain_alloc();
> > +			if (!chain) {
> > +				err = -ENOMEM;
> > +				goto err_sched_job;
> > +			}
> > +			job->ptrs[i].chain_fence = chain;
> > +		}
> >  	}
> >  
> > -	width = q->width;
> > -	if (is_migration)
> > -		width = 2;
> > +	if (batch_addr) {
> > +		width = q->width;
> > +		if (is_migration)
> > +			width = 2;
> >  
> > -	for (i = 0; i < width; ++i)
> > -		job->ptrs[i].batch_addr = batch_addr[i];
> > +		for (i = 0; i < width; ++i)
> > +			job->ptrs[i].batch_addr = batch_addr[i];
> > +	}
> >  
> >  	xe_pm_runtime_get_noresume(job_to_xe(job));
> >  	trace_xe_sched_job_create(job);
> > @@ -243,7 +257,7 @@ bool xe_sched_job_completed(struct xe_sched_job
> > *job)
> >  void xe_sched_job_arm(struct xe_sched_job *job)
> >  {
> >  	struct xe_exec_queue *q = job->q;
> > -	struct dma_fence *fence, *prev;
> > +	struct dma_fence *fence = job->fence, *prev;
> >  	struct xe_vm *vm = q->vm;
> >  	u64 seqno = 0;
> >  	int i;
> > @@ -263,6 +277,9 @@ void xe_sched_job_arm(struct xe_sched_job *job)
> >  		job->ring_ops_flush_tlb = true;
> >  	}
> >  
> > +	if (job->is_pt_job)
> > +		goto arm;
> > +
> >  	/* Arm the pre-allocated fences */
> >  	for (i = 0; i < q->width; prev = fence, ++i) {
> >  		struct dma_fence_chain *chain;
> > @@ -283,6 +300,7 @@ void xe_sched_job_arm(struct xe_sched_job *job)
> >  		fence = &chain->base;
> >  	}
> >  
> > +arm:
> >  	job->fence = dma_fence_get(fence);	/* Pairs with put in
> > scheduler */
> >  	drm_sched_job_arm(&job->drm);
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h
> > b/drivers/gpu/drm/xe/xe_sched_job_types.h
> > index dbf260dded8d..79a459f2a0a8 100644
> > --- a/drivers/gpu/drm/xe/xe_sched_job_types.h
> > +++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
> > @@ -10,10 +10,29 @@
> >  
> >  #include <drm/gpu_scheduler.h>
> >  
> > -struct xe_exec_queue;
> >  struct dma_fence;
> >  struct dma_fence_chain;
> >  
> > +struct xe_exec_queue;
> > +struct xe_migrate_pt_update_ops;
> > +struct xe_pt_job_ops;
> > +struct xe_tile;
> > +struct xe_vm;
> > +
> > +/**
> > + * struct xe_pt_update_args - PT update arguments
> > + */
> > +struct xe_pt_update_args {
> > +	/** @vm: VM */
> > +	struct xe_vm *vm;
> > +	/** @tile: Tile */
> > +	struct xe_tile *tile;
> > +	/** @ops: Migrate PT update ops */
> > +	const struct xe_migrate_pt_update_ops *ops;
> > +	/** @pt_job_ops: PT update ops */
> > +	struct xe_pt_job_ops *pt_job_ops;
> > +};
> > +
> >  /**
> >   * struct xe_job_ptrs - Per hw engine instance data
> >   */
> > @@ -58,8 +77,14 @@ struct xe_sched_job {
> >  	bool ring_ops_flush_tlb;
> >  	/** @ggtt: mapped in ggtt. */
> >  	bool ggtt;
> > -	/** @ptrs: per instance pointers. */
> > -	struct xe_job_ptrs ptrs[];
> > +	/** @is_pt_job: is a PT job */
> > +	bool is_pt_job;
> > +	union {
> > +		/** @ptrs: per instance pointers. */
> > +		DECLARE_FLEX_ARRAY(struct xe_job_ptrs, ptrs);
> > +		/** @pt_update: PT update arguments */
> > +		DECLARE_FLEX_ARRAY(struct xe_pt_update_args,
> > pt_update);
> > +	};
> >  };
> >  
> >  struct xe_sched_job_snapshot {
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 18f967ce1f1a..6fc01fdd7286 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -780,6 +780,19 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm)
> >  		list_empty_careful(&vm->userptr.invalidated)) ? 0 :
> > -EAGAIN;
> >  }
> >  
> > +static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm
> > *vm,
> > +			    struct xe_exec_queue *q,
> > +			    struct xe_sync_entry *syncs, u32
> > num_syncs)
> > +{
> > +	memset(vops, 0, sizeof(*vops));
> > +	INIT_LIST_HEAD(&vops->list);
> > +	vops->vm = vm;
> > +	vops->q = q;
> > +	vops->syncs = syncs;
> > +	vops->num_syncs = num_syncs;
> > +	vops->flags = 0;
> > +}
> > +
> >  static int xe_vma_ops_alloc(struct xe_vma_ops *vops, bool
> > array_of_binds)
> >  {
> >  	int i;
> > @@ -788,11 +801,9 @@ static int xe_vma_ops_alloc(struct xe_vma_ops
> > *vops, bool array_of_binds)
> >  		if (!vops->pt_update_ops[i].num_ops)
> >  			continue;
> >  
> > -		vops->pt_update_ops[i].ops =
> > -			kmalloc_array(vops-
> > >pt_update_ops[i].num_ops,
> > -				      sizeof(*vops-
> > >pt_update_ops[i].ops),
> > -				      GFP_KERNEL |
> > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > -		if (!vops->pt_update_ops[i].ops)
> > +		vops->pt_update_ops[i].pt_job_ops =
> > +			xe_pt_job_ops_alloc(vops-
> > >pt_update_ops[i].num_ops);
> > +		if (!vops->pt_update_ops[i].pt_job_ops)
> >  			return array_of_binds ? -ENOBUFS : -ENOMEM;
> >  	}
> >  
> > @@ -828,7 +839,7 @@ static void xe_vma_ops_fini(struct xe_vma_ops
> > *vops)
> >  	xe_vma_svm_prefetch_ops_fini(vops);
> >  
> >  	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
> > -		kfree(vops->pt_update_ops[i].ops);
> > +		xe_pt_job_ops_put(vops-
> > >pt_update_ops[i].pt_job_ops);
> >  }
> >  
> >  static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops,
> > u8 tile_mask, int inc_val)
> > @@ -877,9 +888,6 @@ static int xe_vm_ops_add_rebind(struct xe_vma_ops
> > *vops, struct xe_vma *vma,
> >  
> >  static struct dma_fence *ops_execute(struct xe_vm *vm,
> >  				     struct xe_vma_ops *vops);
> > -static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm
> > *vm,
> > -			    struct xe_exec_queue *q,
> > -			    struct xe_sync_entry *syncs, u32
> > num_syncs);
> >  
> >  int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
> >  {
> > @@ -3163,13 +3171,6 @@ static struct dma_fence *ops_execute(struct
> > xe_vm *vm,
> >  		fence = &cf->base;
> >  	}
> >  
> > -	for_each_tile(tile, vm->xe, id) {
> > -		if (!vops->pt_update_ops[id].num_ops)
> > -			continue;
> > -
> > -		xe_pt_update_ops_fini(tile, vops);
> > -	}
> > -
> >  	return fence;
> >  
> >  err_out:
> > @@ -3447,19 +3448,6 @@ static int vm_bind_ioctl_signal_fences(struct
> > xe_vm *vm,
> >  	return err;
> >  }
> >  
> > -static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm
> > *vm,
> > -			    struct xe_exec_queue *q,
> > -			    struct xe_sync_entry *syncs, u32
> > num_syncs)
> > -{
> > -	memset(vops, 0, sizeof(*vops));
> > -	INIT_LIST_HEAD(&vops->list);
> > -	vops->vm = vm;
> > -	vops->q = q;
> > -	vops->syncs = syncs;
> > -	vops->num_syncs = num_syncs;
> > -	vops->flags = 0;
> > -}
> > -
> >  static int xe_vm_bind_ioctl_validate_bo(struct xe_device *xe, struct
> > xe_bo *bo,
> >  					u64 addr, u64 range, u64
> > obj_offset,
> >  					u16 pat_index, u32 op, u32
> > bind_flags)
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* ✓ CI.Patch_applied: success for CPU binds and ULLS on migration queue
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (14 preceding siblings ...)
  2025-06-05 15:32 ` [PATCH 15/15] drm/xe: Add modparam to enable / disable high SLPC " Matthew Brost
@ 2025-06-05 22:30 ` Patchwork
  2025-06-05 22:31 ` ✓ CI.checkpatch: " Patchwork
  2025-06-05 22:31 ` ✗ CI.KUnit: failure " Patchwork
  17 siblings, 0 replies; 21+ messages in thread
From: Patchwork @ 2025-06-05 22:30 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: CPU binds and ULLS on migration queue
URL   : https://patchwork.freedesktop.org/series/149888/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 832a676ed689 drm-tip: 2025y-06m-05d-18h-32m-30s UTC integration manifest
=== git am output follows ===
Applying: drm/xe: Drop struct xe_migrate_pt_update argument from populate / clear vfuns
Applying: drm/xe: Add __xe_migrate_update_pgtables_cpu helper
Applying: drm/xe: CPU binds for jobs
Applying: drm/xe: Remove unused arguments from xe_migrate_pt_update_ops
Applying: drm/xe: Don't use migrate exec queue for page fault binds
Applying: drm/xe: Add xe_hw_engine_write_ring_tail
Applying: drm/xe: Add ULLS support to LRC
Applying: drm/xe: Add ULLS migration job support to migration layer
Applying: drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
Applying: drm/xe: Add ULLS migration job support to ring ops
Applying: drm/xe: Add ULLS migration job support to GuC submission
Applying: drm/xe: Enable ULLS migration jobs when opening LR VM
Applying: drm/xe: Set slpc freq to max on ULLS jobs
Applying: drm/xe: Add modparam to enable / disable ULLS on migrate queue
Applying: drm/xe: Add modparam to enable / disable high SLPC on migrate queue



^ permalink raw reply	[flat|nested] 21+ messages in thread

* ✓ CI.checkpatch: success for CPU binds and ULLS on migration queue
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (15 preceding siblings ...)
  2025-06-05 22:30 ` ✓ CI.Patch_applied: success for CPU binds and ULLS on migration queue Patchwork
@ 2025-06-05 22:31 ` Patchwork
  2025-06-05 22:31 ` ✗ CI.KUnit: failure " Patchwork
  17 siblings, 0 replies; 21+ messages in thread
From: Patchwork @ 2025-06-05 22:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: CPU binds and ULLS on migration queue
URL   : https://patchwork.freedesktop.org/series/149888/
State : success

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
202708c00696422fd217223bb679a353a5936e23
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 279b28877fabf6dc0d532a4a59512f14637941e8
Author: Matthew Brost <matthew.brost@intel.com>
Date:   Thu Jun 5 08:32:23 2025 -0700

    drm/xe: Add modparam to enable / disable high SLPC on migrate queue
    
    Having modparam to enable / disable high SLPC on migrate queue will help
    with quick experiments.
    
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
+ /mt/dim checkpatch 832a676ed68901ea8c404d8602381cd8ca413e88 drm-intel
f4c2f0320be4 drm/xe: Drop struct xe_migrate_pt_update argument from populate / clear vfuns
d813f13debcc drm/xe: Add __xe_migrate_update_pgtables_cpu helper
61e8e5136893 drm/xe: CPU binds for jobs
6f6d95dd398f drm/xe: Remove unused arguments from xe_migrate_pt_update_ops
e310cfc8cc2a drm/xe: Don't use migrate exec queue for page fault binds
0390a9be07e2 drm/xe: Add xe_hw_engine_write_ring_tail
031e47d4be17 drm/xe: Add ULLS support to LRC
150fab7c33a2 drm/xe: Add ULLS migration job support to migration layer
833c20d8b5da drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
35429c87199f drm/xe: Add ULLS migration job support to ring ops
dbc5ead91ac7 drm/xe: Add ULLS migration job support to GuC submission
d09e11ff799f drm/xe: Enable ULLS migration jobs when opening LR VM
fcf6b019327a drm/xe: Set slpc freq to max on ULLS jobs
4125a1114dc2 drm/xe: Add modparam to enable / disable ULLS on migrate queue
279b28877fab drm/xe: Add modparam to enable / disable high SLPC on migrate queue



^ permalink raw reply	[flat|nested] 21+ messages in thread

* ✗ CI.KUnit: failure for CPU binds and ULLS on migration queue
  2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
                   ` (16 preceding siblings ...)
  2025-06-05 22:31 ` ✓ CI.checkpatch: " Patchwork
@ 2025-06-05 22:31 ` Patchwork
  17 siblings, 0 replies; 21+ messages in thread
From: Patchwork @ 2025-06-05 22:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: CPU binds and ULLS on migration queue
URL   : https://patchwork.freedesktop.org/series/149888/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:In file included from ../include/drm/ttm/ttm_resource.h:31,
                 from ../include/drm/ttm/ttm_device.h:30,
                 from ../drivers/gpu/drm/xe/xe_device_types.h:14,
                 from ../drivers/gpu/drm/xe/xe_gt_types.h:9,
                 from ../drivers/gpu/drm/xe/xe_assert.h:13,
                 from ../drivers/gpu/drm/xe/xe_migrate.c:21:
../drivers/gpu/drm/xe/tests/xe_migrate.c: In function ‘xe_migrate_sanity_test’:
../drivers/gpu/drm/xe/tests/xe_migrate.c:242:50: error: ‘NUM_KERNEL_PDE’ undeclared (first use in this function)
  242 |         xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
      |                                                  ^~~~~~~~~~~~~~
../include/linux/iosys-map.h:372:27: note: in definition of macro ‘__iosys_map_wr_io’
  372 |         u8: writeb(val__, vaddr_iomem__),                                       \
      |                           ^~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_map.h:78:9: note: in expansion of macro ‘iosys_map_wr’
   78 |         iosys_map_wr(map__, offset__, type__, val__);                   \
      |         ^~~~~~~~~~~~
../drivers/gpu/drm/xe/tests/xe_migrate.c:242:9: note: in expansion of macro ‘xe_map_wr’
  242 |         xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
      |         ^~~~~~~~~
../drivers/gpu/drm/xe/tests/xe_migrate.c:242:50: note: each undeclared identifier is reported only once for each function it appears in
  242 |         xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
      |                                                  ^~~~~~~~~~~~~~
../include/linux/iosys-map.h:372:27: note: in definition of macro ‘__iosys_map_wr_io’
  372 |         u8: writeb(val__, vaddr_iomem__),                                       \
      |                           ^~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_map.h:78:9: note: in expansion of macro ‘iosys_map_wr’
   78 |         iosys_map_wr(map__, offset__, type__, val__);                   \
      |         ^~~~~~~~~~~~
../drivers/gpu/drm/xe/tests/xe_migrate.c:242:9: note: in expansion of macro ‘xe_map_wr’
  242 |         xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1), u64,
      |         ^~~~~~~~~
make[7]: *** [../scripts/Makefile.build:203: drivers/gpu/drm/xe/xe_migrate.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:461: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:461: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:461: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:461: drivers] Error 2
make[2]: *** [/kernel/Makefile:2003: .] Error 2
make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
make: *** [Makefile:248: __sub-make] Error 2

[22:31:11] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[22:31:16] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-06-05 22:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 15:32 [PATCH 00/15] CPU binds and ULLS on migration queue Matthew Brost
2025-06-05 15:32 ` [PATCH 01/15] drm/xe: Drop struct xe_migrate_pt_update argument from populate / clear vfuns Matthew Brost
2025-06-05 15:32 ` [PATCH 02/15] drm/xe: Add __xe_migrate_update_pgtables_cpu helper Matthew Brost
2025-06-05 15:32 ` [PATCH 03/15] drm/xe: CPU binds for jobs Matthew Brost
2025-06-05 15:44   ` Thomas Hellström
2025-06-05 16:13     ` Matthew Brost
2025-06-05 15:32 ` [PATCH 04/15] drm/xe: Remove unused arguments from xe_migrate_pt_update_ops Matthew Brost
2025-06-05 15:32 ` [PATCH 05/15] drm/xe: Don't use migrate exec queue for page fault binds Matthew Brost
2025-06-05 15:32 ` [PATCH 06/15] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
2025-06-05 15:32 ` [PATCH 07/15] drm/xe: Add ULLS support to LRC Matthew Brost
2025-06-05 15:32 ` [PATCH 08/15] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
2025-06-05 15:32 ` [PATCH 09/15] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
2025-06-05 15:32 ` [PATCH 10/15] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
2025-06-05 15:32 ` [PATCH 11/15] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
2025-06-05 15:32 ` [PATCH 12/15] drm/xe: Enable ULLS migration jobs when opening LR VM Matthew Brost
2025-06-05 15:32 ` [PATCH 13/15] drm/xe: Set slpc freq to max on ULLS jobs Matthew Brost
2025-06-05 15:32 ` [PATCH 14/15] drm/xe: Add modparam to enable / disable ULLS on migrate queue Matthew Brost
2025-06-05 15:32 ` [PATCH 15/15] drm/xe: Add modparam to enable / disable high SLPC " Matthew Brost
2025-06-05 22:30 ` ✓ CI.Patch_applied: success for CPU binds and ULLS on migration queue Patchwork
2025-06-05 22:31 ` ✓ CI.checkpatch: " Patchwork
2025-06-05 22:31 ` ✗ CI.KUnit: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.