[v2 00/31] Basic system allocator support in xe driver

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [v2 00/31] Basic system allocator support in xe driver
@ 2024-04-09 20:17 Oak Zeng
  2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
                   ` (31 more replies)
  0 siblings, 32 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

This is the v2 of basic system allocator support in xe kmd driver.
v1 is here: https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/

Significant design changes were made since v1, based on drm community
review feedback:

1) Introduce vm_bind uAPI for system allocator. With this uAPI, user can
optionally bind CPU virtual address range A..B to GPU virtual address
range C..D. Right now we force A..B == C..D since we don't have
a valid use case where A..B != C..D. But the interface is built so we
can extend easily in the future if valid use case come out. See patch
3 to 8 for this work.

2) Unify system allocator and user ptr code. Now system allocator and
userptr share the same codes for gpu page table programming, mmu
interval notifier and vma invalidation, page fault handling and lock
design. The codes are more unified.

This work is built on top of Matt Brost's huge vm_bind refactor series.
The first patch is a squash of Matt's 30 patch series for reference
purpose.

This work is still at its early stage. It is sent out so we can get some
early eyes on it. We are open to any comments and suggestions.

The work that are planned in our bucket are:

*Virtual address range based memory attributes and hints: We plan to
expose uAPI for user to set memory attributes such as preferred location
or migration granularity etc to a virtual address range. This is
important to tune SVM performance.

*GPU vram eviction: One key design choice of this series is, SVM
layer allocate GPU memory directly from drm buddy allocator, instead
of from xe vram manager. There is no BO (buffer object) concept
in this implementation. The key benefit of this approach is we can
migrate memory at page granularity easily. This also means SVM bypasses
TTM's memory eviction logic. But we want the SVM memory and BO driver
memory can mutually evicted each other. We have some prove of concept
work to rework TTM resource manager for this purpose, see
https://lore.kernel.org/dri-devel/20231102043306.2931989-1-oak.zeng@intel.com/
We will continue work on that series then implement SVM's eviction
function based on the concept of shared drm LRU list b/t SVM and TTM/BO
driver.

* Try 1 vma with N PT_state for system allocator and userptr. One
gigantic vma to hold address space initial default constant state
and N PT_state to hold mutable page table state. Also try to register
only one mmu interval notifier for the whole address range.

* Multiple GPU device support

Matthew Brost (7):
  drm/xe: Refactor vm_bind
  drm/xe: Invalidate userptr VMA on page pin fault
  drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse
  drm/xe: Fix op->tile_mask for fault mode
  drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag
  drm/xe: Create userptr if page fault occurs on system_allocator VMA
  drm/xe: Add faulted userptr VMA garbage collector

Oak Zeng (24):
  drm/xe/svm: Add SVM document
  drm/xe: Introduce helper to populate userptr
  drm/xe: Introduce a helper to free sg table
  drm/xe: Use hmm_range_fault to populate user pages
  drm/xe/svm: Remap and provide memmap backing for GPU vram
  drm/xe/svm: Introduce DRM_XE_SVM kernel config
  drm/xe: Introduce helper to get tile from memory region
  drm/xe: Introduce a helper to get dpa from pfn
  drm/xe/svm: Get xe memory region from page
  drm/xe: Get xe_vma from xe_userptr
  drm/xe/svm: Build userptr sg table for device pages
  drm/xe/svm: Determine a vma is backed by device memory
  drm/xe: add xe lock document
  drm/xe/svm: Introduce svm migration function
  drm/xe/svm: implement functions to allocate and free device memory
  drm/xe/svm: Trace buddy block allocation and free
  drm/xe/svm: Create and destroy xe svm
  drm/xe/svm: Add vm to xe_svm process
  drm/xe: Make function lookup_vma public
  drm/xe/svm: Handle CPU page fault
  drm/xe/svm: Introduce helper to migrate vma to vram
  drm/xe/svm: trace svm migration
  drm/xe/svm: Add a helper to determine a vma is fault userptr
  drm/xe/svm: Migration from sram to vram for system allocator

 Documentation/gpu/xe/index.rst              |    2 +
 Documentation/gpu/xe/xe_lock.rst            |    8 +
 Documentation/gpu/xe/xe_svm.rst             |    8 +
 drivers/gpu/drm/xe/Kconfig                  |   22 +
 drivers/gpu/drm/xe/Makefile                 |    6 +
 drivers/gpu/drm/xe/tests/xe_migrate.c       |   86 -
 drivers/gpu/drm/xe/xe_bo.c                  |    7 +-
 drivers/gpu/drm/xe/xe_bo.h                  |    4 +-
 drivers/gpu/drm/xe/xe_device.c              |   35 +
 drivers/gpu/drm/xe/xe_device.h              |   10 +
 drivers/gpu/drm/xe/xe_device_types.h        |   24 +
 drivers/gpu/drm/xe/xe_exec.c                |   41 +-
 drivers/gpu/drm/xe/xe_exec_queue.c          |  120 +-
 drivers/gpu/drm/xe/xe_exec_queue_types.h    |   20 +-
 drivers/gpu/drm/xe/xe_gt_pagefault.c        |   52 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   59 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    3 +
 drivers/gpu/drm/xe/xe_guc_submit.c          |   22 +-
 drivers/gpu/drm/xe/xe_hmm.c                 |  329 ++++
 drivers/gpu/drm/xe/xe_hmm.h                 |   18 +
 drivers/gpu/drm/xe/xe_lock_doc.h            |  113 ++
 drivers/gpu/drm/xe/xe_migrate.c             |  602 ++++---
 drivers/gpu/drm/xe/xe_migrate.h             |   53 +-
 drivers/gpu/drm/xe/xe_mmio.c                |    6 +
 drivers/gpu/drm/xe/xe_pci.c                 |    1 +
 drivers/gpu/drm/xe/xe_pt.c                  | 1301 +++++++++-----
 drivers/gpu/drm/xe/xe_pt.h                  |   15 +-
 drivers/gpu/drm/xe/xe_pt_exec_queue.c       |  180 ++
 drivers/gpu/drm/xe/xe_pt_exec_queue.h       |   14 +
 drivers/gpu/drm/xe/xe_pt_types.h            |   53 +
 drivers/gpu/drm/xe/xe_sched_job.c           |   68 +-
 drivers/gpu/drm/xe/xe_sched_job_types.h     |   31 +-
 drivers/gpu/drm/xe/xe_svm.c                 |  122 ++
 drivers/gpu/drm/xe/xe_svm.h                 |   88 +
 drivers/gpu/drm/xe/xe_svm_devmem.c          |  231 +++
 drivers/gpu/drm/xe/xe_svm_doc.h             |  121 ++
 drivers/gpu/drm/xe/xe_svm_migrate.c         |  340 ++++
 drivers/gpu/drm/xe/xe_sync.c                |   15 +
 drivers/gpu/drm/xe/xe_sync.h                |    1 +
 drivers/gpu/drm/xe/xe_tile.c                |    7 +
 drivers/gpu/drm/xe/xe_trace.h               |   69 +-
 drivers/gpu/drm/xe/xe_uc_fw.c               |    1 +
 drivers/gpu/drm/xe/xe_vm.c                  | 1768 ++++++++++---------
 drivers/gpu/drm/xe/xe_vm.h                  |   40 +-
 drivers/gpu/drm/xe/xe_vm_types.h            |  229 ++-
 include/drm/xe_pciids.h                     |   16 +
 include/uapi/drm/xe_drm.h                   |   15 +-
 47 files changed, 4432 insertions(+), 1944 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_lock.rst
 create mode 100644 Documentation/gpu/xe/xe_svm.rst
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
 create mode 100644 drivers/gpu/drm/xe/xe_lock_doc.h
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.c
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm_doc.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c

-- 
2.26.3

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [v2 01/31] drm/xe: Refactor vm_bind
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 02/31] drm/xe/svm: Add SVM document Oak Zeng
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

drm/xe: Lock all gpuva ops during VM bind IOCTL

Lock all gpuva ops and validate all BOs in a single step durin the VM
bind IOCTL. This help with the transition to making all gpuva ops in a
VM bind IOCTL a single atomic job.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add ops_execute function which returns a fence

Add ops_execute function which returns a fence. This will be helpful to
initiate all binds (VM bind IOCTL, rebinds in exec IOCTL, rebinds in
preempt rebind worker, and rebinds in pagefaults) via a gpuva ops list.
Returning a fence is needed in various paths.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move migrate to prefetch to op_lock function

Migrates need to be done under drm exec to make lockdep happy, move
the migrate done for prefetches under the op_lock function.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add struct xe_vma_ops abstraction

Having a structure which encapsulates a list of VMA operations will help
enable 1 job for the entire list.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update xe_vm_rebind to use dummy VMA operations

All bind interfaces are transitioning to use VMA ops, update
xe_vm_rebind to use VMA ops.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Simplify VM bind IOCTL error handling and cleanup

Clean up everything in VM bind IOCTL in 1 path for both errors and
non-errors. Also move VM bind IOCTL cleanup from ops (also used by
non-IOCTL binds) to the VM bind IOCTL.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update pagefaults to use dummy VMA operations

All bind interfaces are transitioning to use VMA ops, update
pagefaults to use VMA ops.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: s/xe_tile_migrate_engine/xe_tile_migrate_exec_queue

xe_engine is now xe_exec_queue, adjust this function's name to reflect.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add some members to xe_vma_ops

This will help with moving to single jobs for many bind operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add vm_bind_ioctl_ops_install_fences helper

Simplify VM bind code by signaling out-fences / destroying VMAs in a
single location. Will help with transition single job for many bind ops.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move setting last fence to vm_bind_ioctl_ops_install_fences

This moves setting of the last fence to a single location.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move ufence check to op_lock

Rather than checking for an unsignaled ufence ay unbind time, check for
this during the op_lock function. This will help with the transition to
job 1 per VM bind IOCTL.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move ufence add to vm_bind_ioctl_ops_install_fences

Rather than adding a ufence to a VMA in the bind function, add the
ufence to all VMAs in the IOCTL that require binds in
vm_bind_ioctl_ops_install_fences. This will help with the transition to
job 1 per VM bind IOCTL.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add xe_gt_tlb_invalidation_range and convert PT layer to use this

xe_gt_tlb_invalidation_range accepts a start and end address rather than
a VMA. This will enable multiple VMAs to be invalidated in a single
invalidation. Update the PT layer to use this new function.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add xe_vm_pgtable_update_op to xe_vma_ops

Will help with the converstion to 1 job per VM bind IOCTL. Allocation
only implemented in this patch.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Use ordered WQ for TLB invalidation fences

TLB invalidation fences need to be ordered within an exec queue and if
an unordered WQ is used TLB invalidation fences could be reordered. Use
an ordered WQ to fix this.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Delete PT update selftest

IGTs (e.g. xe_vm) can provide the exact same coverage as the PT update
selftest. The PT update selftest is dependent on internal functions
which can change thus maintaining this test is costly and provide no
extra coverage. Delete this test.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Convert multiple bind ops into single job

This aligns with the uAPI of an array of binds or single bind that
results in multiple GPUVA ops to be considered a single atomic
operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Remove old functions defs in xe_pt.h

__xe_pt_bind_vma and __xe_pt_unbind_vma are unused, remove these.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update PT layer with better error handling

Update PT layer so if a memory allocation for a PTE fails the error can
be propagated to the user without requiring to be killed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update xe_vm_rebind to return int

Now that rebinds are installed in the kernel dma-resv slot the fence
returned from xe_vm_rebind is unused aside from error checking. Update
to xe_vm_rebind to return int.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

drm/xe: Move vma rebinding to the drm_exec locking loop

Rebinding might allocate page-table bos, causing evictions.
To support blocking locking during these evictions,
perform the rebinding in the drm_exec locking loop.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

drm/xe: Update VM trace events

The trace events have changed moving to a single job per VM bind IOCTL,
update the trace events align with old behavior as much as possible.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update clear / populate arguments

This will help implement CPU binds in run_job() as 'struct
xe_migrate_pt_update' is not available at the time of run_job().

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add __xe_migrate_update_pgtables_cpu helper

This will help implement CPU binds as the submision backend can call
this helper when a bind jobs dependencies are resolved.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: CPU binds for jobs

No reason to use the GPU for binds. In run_job use the CPU to do binds
once the bind job dependencies are resolved.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Don't use migrate exec queue for page fault binds

Now that the CPU is always used for binds even in jobs, CPU bind jobs
can pass GPU jobs in the same exec queue resulting dma-fences signaling
out-of-order. Use a dedicated exec queue for binds issued from page
faults to avoid ordering issues.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add VM bind IOCTL error injection

Add VM bind IOCTL error injection which steals MSB of the bind flags
field which if set injects errors at various points in the VM bind
IOCTL. Intended to validate error paths. Enabled by CONFIG_DRM_XE_DEBUG.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe/guc: Assert time'd out jobs are not from a VM exec queue

With CPU binds jobs cannot timeout, assert this is not happening.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add PT exec queues

Add PT exec queues which are used to implement VM bind / unbind
operations. PT exec queues use a different DRM scheduler backend
(compared GuC / execlist submission backends) which use the CPU to
update page tables once all dependecies for a job are resolved.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add PVC support

Add PVC pcie IDs and GuC firmware.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/Makefile                 |    1 +
 drivers/gpu/drm/xe/tests/xe_migrate.c       |   86 --
 drivers/gpu/drm/xe/xe_bo.c                  |    7 +-
 drivers/gpu/drm/xe/xe_bo.h                  |    4 +-
 drivers/gpu/drm/xe/xe_device.c              |   35 +
 drivers/gpu/drm/xe/xe_device.h              |    2 +
 drivers/gpu/drm/xe/xe_device_types.h        |   16 +
 drivers/gpu/drm/xe/xe_exec.c                |   41 +-
 drivers/gpu/drm/xe/xe_exec_queue.c          |  120 +-
 drivers/gpu/drm/xe/xe_exec_queue_types.h    |   20 +-
 drivers/gpu/drm/xe/xe_gt_pagefault.c        |   10 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   59 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    3 +
 drivers/gpu/drm/xe/xe_guc_submit.c          |   22 +-
 drivers/gpu/drm/xe/xe_migrate.c             |  385 ++----
 drivers/gpu/drm/xe/xe_migrate.h             |   46 +-
 drivers/gpu/drm/xe/xe_pci.c                 |    1 +
 drivers/gpu/drm/xe/xe_pt.c                  | 1242 ++++++++++++-------
 drivers/gpu/drm/xe/xe_pt.h                  |   15 +-
 drivers/gpu/drm/xe/xe_pt_exec_queue.c       |  180 +++
 drivers/gpu/drm/xe/xe_pt_exec_queue.h       |   14 +
 drivers/gpu/drm/xe/xe_pt_types.h            |   53 +
 drivers/gpu/drm/xe/xe_sched_job.c           |   68 +-
 drivers/gpu/drm/xe/xe_sched_job_types.h     |   31 +-
 drivers/gpu/drm/xe/xe_sync.c                |   15 +
 drivers/gpu/drm/xe/xe_sync.h                |    1 +
 drivers/gpu/drm/xe/xe_trace.h               |   21 +-
 drivers/gpu/drm/xe/xe_uc_fw.c               |    1 +
 drivers/gpu/drm/xe/xe_vm.c                  | 1124 ++++++++---------
 drivers/gpu/drm/xe/xe_vm.h                  |    9 +-
 drivers/gpu/drm/xe/xe_vm_types.h            |  198 +--
 include/drm/xe_pciids.h                     |   16 +
 32 files changed, 2142 insertions(+), 1704 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.c
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 3c3e67885559..bf43a3690e13 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -118,6 +118,7 @@ xe-y += xe_bb.o \
 	xe_pm.o \
 	xe_preempt_fence.o \
 	xe_pt.o \
+	xe_pt_exec_queue.o \
 	xe_pt_walk.o \
 	xe_query.o \
 	xe_range_fence.o \
diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
index ce531498f57f..de2c1b7ec371 100644
--- a/drivers/gpu/drm/xe/tests/xe_migrate.c
+++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
@@ -62,36 +62,6 @@ static int run_sanity_job(struct xe_migrate *m, struct xe_device *xe,
 	return 0;
 }
 
-static void
-sanity_populate_cb(struct xe_migrate_pt_update *pt_update,
-		   struct xe_tile *tile, struct iosys_map *map, void *dst,
-		   u32 qword_ofs, u32 num_qwords,
-		   const struct xe_vm_pgtable_update *update)
-{
-	struct migrate_test_params *p =
-		to_migrate_test_params(xe_cur_kunit_priv(XE_TEST_LIVE_MIGRATE));
-	int i;
-	u64 *ptr = dst;
-	u64 value;
-
-	for (i = 0; i < num_qwords; i++) {
-		value = (qword_ofs + i - update->ofs) * 0x1111111111111111ULL;
-		if (map)
-			xe_map_wr(tile_to_xe(tile), map, (qword_ofs + i) *
-				  sizeof(u64), u64, value);
-		else
-			ptr[i] = value;
-	}
-
-	kunit_info(xe_cur_kunit(), "Used %s.\n", map ? "CPU" : "GPU");
-	if (p->force_gpu && map)
-		KUNIT_FAIL(xe_cur_kunit(), "GPU pagetable update used CPU.\n");
-}
-
-static const struct xe_migrate_pt_update_ops sanity_ops = {
-	.populate = sanity_populate_cb,
-};
-
 #define check(_retval, _expected, str, _test)				\
 	do { if ((_retval) != (_expected)) {				\
 			KUNIT_FAIL(_test, "Sanity check failed: " str	\
@@ -209,57 +179,6 @@ static void test_copy_vram(struct xe_migrate *m, struct xe_bo *bo,
 	test_copy(m, bo, test, region);
 }
 
-static void test_pt_update(struct xe_migrate *m, struct xe_bo *pt,
-			   struct kunit *test, bool force_gpu)
-{
-	struct xe_device *xe = tile_to_xe(m->tile);
-	struct dma_fence *fence;
-	u64 retval, expected;
-	ktime_t then, now;
-	int i;
-
-	struct xe_vm_pgtable_update update = {
-		.ofs = 1,
-		.qwords = 0x10,
-		.pt_bo = pt,
-	};
-	struct xe_migrate_pt_update pt_update = {
-		.ops = &sanity_ops,
-	};
-	struct migrate_test_params p = {
-		.base.id = XE_TEST_LIVE_MIGRATE,
-		.force_gpu = force_gpu,
-	};
-
-	test->priv = &p;
-	/* Test xe_migrate_update_pgtables() updates the pagetable as expected */
-	expected = 0xf0f0f0f0f0f0f0f0ULL;
-	xe_map_memset(xe, &pt->vmap, 0, (u8)expected, pt->size);
-
-	then = ktime_get();
-	fence = xe_migrate_update_pgtables(m, m->q->vm, NULL, m->q, &update, 1,
-					   NULL, 0, &pt_update);
-	now = ktime_get();
-	if (sanity_fence_failed(xe, fence, "Migration pagetable update", test))
-		return;
-
-	kunit_info(test, "Updating without syncing took %llu us,\n",
-		   (unsigned long long)ktime_to_us(ktime_sub(now, then)));
-
-	dma_fence_put(fence);
-	retval = xe_map_rd(xe, &pt->vmap, 0, u64);
-	check(retval, expected, "PTE[0] must stay untouched", test);
-
-	for (i = 0; i < update.qwords; i++) {
-		retval = xe_map_rd(xe, &pt->vmap, (update.ofs + i) * 8, u64);
-		check(retval, i * 0x1111111111111111ULL, "PTE update", test);
-	}
-
-	retval = xe_map_rd(xe, &pt->vmap, 8 * (update.ofs + update.qwords),
-			   u64);
-	check(retval, expected, "PTE[0x11] must stay untouched", test);
-}
-
 static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
 {
 	struct xe_tile *tile = m->tile;
@@ -398,11 +317,6 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
 		test_copy_vram(m, big, test);
 	}
 
-	kunit_info(test, "Testing page table update using CPU if GPU idle.\n");
-	test_pt_update(m, pt, test, false);
-	kunit_info(test, "Testing page table update using GPU\n");
-	test_pt_update(m, pt, test, true);
-
 out:
 	xe_bb_free(bb, NULL);
 free_tiny:
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index b89ac6db68a1..7a90d269d4dd 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -2265,16 +2265,16 @@ void __xe_bo_release_dummy(struct kref *kref)
 
 /**
  * xe_bo_put_commit() - Put bos whose put was deferred by xe_bo_put_deferred().
+ * @xe: Xe device
  * @deferred: The lockless list used for the call to xe_bo_put_deferred().
  *
  * Puts all bos whose put was deferred by xe_bo_put_deferred().
  * The @deferred list can be either an onstack local list or a global
  * shared list used by a workqueue.
  */
-void xe_bo_put_commit(struct llist_head *deferred)
+void xe_bo_put_commit(struct xe_device *xe, struct llist_head *deferred)
 {
 	struct llist_node *freed;
-	struct xe_bo *bo, *next;
 
 	if (!deferred)
 		return;
@@ -2283,8 +2283,7 @@ void xe_bo_put_commit(struct llist_head *deferred)
 	if (!freed)
 		return;
 
-	llist_for_each_entry_safe(bo, next, freed, freed)
-		drm_gem_object_free(&bo->ttm.base.refcount);
+	xe_device_put_deferred(xe, freed);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index c59ad15961ce..10b2b14b4c0d 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -10,7 +10,6 @@
 
 #include "xe_bo_types.h"
 #include "xe_macros.h"
-#include "xe_vm_types.h"
 #include "xe_vm.h"
 
 /**
@@ -309,10 +308,11 @@ xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred)
 	if (!kref_put(&bo->ttm.base.refcount, __xe_bo_release_dummy))
 		return false;
 
+	xe_vm_get(bo->vm);
 	return llist_add(&bo->freed, deferred);
 }
 
-void xe_bo_put_commit(struct llist_head *deferred);
+void xe_bo_put_commit(struct xe_device *xe, struct llist_head *deferred);
 
 struct sg_table *xe_bo_sg(struct xe_bo *bo);
 
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 919ad88f0495..80628bdcfd48 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -226,6 +226,9 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
 {
 	struct xe_device *xe = to_xe_device(dev);
 
+	flush_work(&xe->mem.deferred_work);
+	xe_assert(xe, !llist_del_all(&xe->mem.deferred));
+
 	if (xe->ordered_wq)
 		destroy_workqueue(xe->ordered_wq);
 
@@ -235,6 +238,35 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
 	ttm_device_fini(&xe->ttm);
 }
 
+void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred)
+{
+	struct xe_bo *bo, *next;
+
+	llist_for_each_entry_safe(bo, next, deferred, freed) {
+		init_llist_node(&bo->freed);
+		llist_add(&bo->freed, &xe->mem.deferred);
+	}
+	queue_work(system_wq, &xe->mem.deferred_work);
+}
+
+static void deferred_work(struct work_struct *w)
+{
+	struct xe_device *xe = container_of(w, struct xe_device,
+					    mem.deferred_work);
+	struct llist_node *freed = llist_del_all(&xe->mem.deferred);
+	struct xe_bo *bo, *next;
+
+	if (!freed)
+		return;
+
+	llist_for_each_entry_safe(bo, next, freed, freed) {
+		struct xe_vm *vm = bo->vm;
+
+		drm_gem_object_free(&bo->ttm.base.refcount);
+		xe_vm_put(vm);
+	}
+}
+
 struct xe_device *xe_device_create(struct pci_dev *pdev,
 				   const struct pci_device_id *ent)
 {
@@ -299,6 +331,9 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 		goto err;
 	}
 
+	init_llist_head(&xe->mem.deferred);
+	INIT_WORK(&xe->mem.deferred_work, deferred_work);
+
 	err = xe_display_create(xe);
 	if (WARN_ON(err))
 		goto err;
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 14be34d9f543..74eb9833d4d8 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -176,4 +176,6 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
 u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
 u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
 
+void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 9785eef2e5a4..e73b9a086718 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -22,6 +22,10 @@
 #include "xe_sriov_types.h"
 #include "xe_step_types.h"
 
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+#define TEST_VM_OPS_ERROR
+#endif
+
 #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
 #include "soc/intel_pch.h"
 #include "intel_display_core.h"
@@ -315,6 +319,10 @@ struct xe_device {
 		struct xe_mem_region vram;
 		/** @mem.sys_mgr: system TTM manager */
 		struct ttm_resource_manager sys_mgr;
+		/** @mem.deferred: deferred list to destroy PT entries */
+		struct llist_head deferred;
+		/** @mem.deferred_work: worker to destroy PT entries */
+		struct work_struct deferred_work;
 	} mem;
 
 	/** @sriov: device level virtualization data */
@@ -455,6 +463,14 @@ struct xe_device {
 	/** @needs_flr_on_fini: requests function-reset on fini */
 	bool needs_flr_on_fini;
 
+#ifdef TEST_VM_OPS_ERROR
+	/**
+	 * @vm_inject_error_position: inject errors at different places in VM
+	 * bind IOCTL based on this value
+	 */
+	u8 vm_inject_error_position;
+#endif
+
 	/* private: */
 
 #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c
index 952496c6260d..64dc412f84a6 100644
--- a/drivers/gpu/drm/xe/xe_exec.c
+++ b/drivers/gpu/drm/xe/xe_exec.c
@@ -135,6 +135,10 @@ static int xe_exec_fn(struct drm_gpuvm_exec *vm_exec)
 			return ret;
 	}
 
+	ret = xe_vm_rebind(vm, false);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -152,7 +156,6 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	struct drm_exec *exec = &vm_exec.exec;
 	u32 i, num_syncs = 0, num_ufence = 0;
 	struct xe_sched_job *job;
-	struct dma_fence *rebind_fence;
 	struct xe_vm *vm;
 	bool write_locked, skip_retry = false;
 	ktime_t end = 0;
@@ -167,7 +170,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	if (XE_IOCTL_DBG(xe, !q))
 		return -ENOENT;
 
-	if (XE_IOCTL_DBG(xe, q->flags & EXEC_QUEUE_FLAG_VM))
+	if (XE_IOCTL_DBG(xe, q->flags & EXEC_QUEUE_FLAG_PT))
 		return -EINVAL;
 
 	if (XE_IOCTL_DBG(xe, args->num_batch_buffer &&
@@ -285,39 +288,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto err_exec;
 	}
 
-	/*
-	 * Rebind any invalidated userptr or evicted BOs in the VM, non-compute
-	 * VM mode only.
-	 */
-	rebind_fence = xe_vm_rebind(vm, false);
-	if (IS_ERR(rebind_fence)) {
-		err = PTR_ERR(rebind_fence);
-		goto err_put_job;
-	}
-
-	/*
-	 * We store the rebind_fence in the VM so subsequent execs don't get
-	 * scheduled before the rebinds of userptrs / evicted BOs is complete.
-	 */
-	if (rebind_fence) {
-		dma_fence_put(vm->rebind_fence);
-		vm->rebind_fence = rebind_fence;
-	}
-	if (vm->rebind_fence) {
-		if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
-			     &vm->rebind_fence->flags)) {
-			dma_fence_put(vm->rebind_fence);
-			vm->rebind_fence = NULL;
-		} else {
-			dma_fence_get(vm->rebind_fence);
-			err = drm_sched_job_add_dependency(&job->drm,
-							   vm->rebind_fence);
-			if (err)
-				goto err_put_job;
-		}
-	}
-
-	/* Wait behind munmap style rebinds */
+	/* Wait for rebinds */
 	if (!xe_vm_in_lr_mode(vm)) {
 		err = drm_sched_job_add_resv_dependencies(&job->drm,
 							  xe_vm_resv(vm),
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 6a83bc57826a..149b6ffcda6e 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -19,6 +19,7 @@
 #include "xe_macros.h"
 #include "xe_migrate.h"
 #include "xe_pm.h"
+#include "xe_pt_exec_queue.h"
 #include "xe_ring_ops_types.h"
 #include "xe_trace.h"
 #include "xe_vm.h"
@@ -43,6 +44,8 @@ static struct xe_exec_queue *__xe_exec_queue_alloc(struct xe_device *xe,
 	struct xe_gt *gt = hwe->gt;
 	int err;
 
+	xe_assert(xe, !(flags & EXEC_QUEUE_FLAG_PT));
+
 	/* only kernel queues can be permanent */
 	XE_WARN_ON((flags & EXEC_QUEUE_FLAG_PERMANENT) && !(flags & EXEC_QUEUE_FLAG_KERNEL));
 
@@ -53,6 +56,7 @@ static struct xe_exec_queue *__xe_exec_queue_alloc(struct xe_device *xe,
 	kref_init(&q->refcount);
 	q->flags = flags;
 	q->hwe = hwe;
+	q->xe = xe;
 	q->gt = gt;
 	q->class = hwe->class;
 	q->width = width;
@@ -61,7 +65,6 @@ static struct xe_exec_queue *__xe_exec_queue_alloc(struct xe_device *xe,
 	q->ring_ops = gt->ring_ops[hwe->class];
 	q->ops = gt->exec_queue_ops;
 	INIT_LIST_HEAD(&q->compute.link);
-	INIT_LIST_HEAD(&q->multi_gt_link);
 
 	q->sched_props.timeslice_us = hwe->eclass->sched_props.timeslice_us;
 	q->sched_props.preempt_timeout_us =
@@ -106,7 +109,7 @@ static void __xe_exec_queue_free(struct xe_exec_queue *q)
 
 static int __xe_exec_queue_init(struct xe_exec_queue *q)
 {
-	struct xe_device *xe = gt_to_xe(q->gt);
+	struct xe_device *xe = q->xe;
 	int i, err;
 
 	for (i = 0; i < q->width; ++i) {
@@ -127,7 +130,7 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q)
 	 * can perform GuC CT actions when needed. Caller is expected to have
 	 * already grabbed the rpm ref outside any sensitive locks.
 	 */
-	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm))
+	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && !q->vm)
 		drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));
 
 	return 0;
@@ -198,15 +201,8 @@ struct xe_exec_queue *xe_exec_queue_create_class(struct xe_device *xe, struct xe
 void xe_exec_queue_destroy(struct kref *ref)
 {
 	struct xe_exec_queue *q = container_of(ref, struct xe_exec_queue, refcount);
-	struct xe_exec_queue *eq, *next;
 
 	xe_exec_queue_last_fence_put_unlocked(q);
-	if (!(q->flags & EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD)) {
-		list_for_each_entry_safe(eq, next, &q->multi_gt_list,
-					 multi_gt_link)
-			xe_exec_queue_put(eq);
-	}
-
 	q->ops->fini(q);
 }
 
@@ -216,7 +212,7 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
 
 	for (i = 0; i < q->width; ++i)
 		xe_lrc_finish(q->lrc + i);
-	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm))
+	if (q->gt && !(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && !q->vm)
 		xe_device_mem_access_put(gt_to_xe(q->gt));
 	__xe_exec_queue_free(q);
 }
@@ -454,35 +450,6 @@ find_hw_engine(struct xe_device *xe,
 			       eci.engine_instance, true);
 }
 
-static u32 bind_exec_queue_logical_mask(struct xe_device *xe, struct xe_gt *gt,
-					struct drm_xe_engine_class_instance *eci,
-					u16 width, u16 num_placements)
-{
-	struct xe_hw_engine *hwe;
-	enum xe_hw_engine_id id;
-	u32 logical_mask = 0;
-
-	if (XE_IOCTL_DBG(xe, width != 1))
-		return 0;
-	if (XE_IOCTL_DBG(xe, num_placements != 1))
-		return 0;
-	if (XE_IOCTL_DBG(xe, eci[0].engine_instance != 0))
-		return 0;
-
-	eci[0].engine_class = DRM_XE_ENGINE_CLASS_COPY;
-
-	for_each_hw_engine(hwe, gt, id) {
-		if (xe_hw_engine_is_reserved(hwe))
-			continue;
-
-		if (hwe->class ==
-		    user_to_xe_engine_class[DRM_XE_ENGINE_CLASS_COPY])
-			logical_mask |= BIT(hwe->logical_instance);
-	}
-
-	return logical_mask;
-}
-
 static u32 calc_validate_logical_mask(struct xe_device *xe, struct xe_gt *gt,
 				      struct drm_xe_engine_class_instance *eci,
 				      u16 width, u16 num_placements)
@@ -544,7 +511,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
 	struct drm_xe_engine_class_instance __user *user_eci =
 		u64_to_user_ptr(args->instances);
 	struct xe_hw_engine *hwe;
-	struct xe_vm *vm, *migrate_vm;
+	struct xe_vm *vm;
 	struct xe_gt *gt;
 	struct xe_exec_queue *q = NULL;
 	u32 logical_mask;
@@ -570,48 +537,15 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
 		return -EINVAL;
 
 	if (eci[0].engine_class == DRM_XE_ENGINE_CLASS_VM_BIND) {
-		for_each_gt(gt, xe, id) {
-			struct xe_exec_queue *new;
-			u32 flags;
-
-			if (xe_gt_is_media_type(gt))
-				continue;
-
-			eci[0].gt_id = gt->info.id;
-			logical_mask = bind_exec_queue_logical_mask(xe, gt, eci,
-								    args->width,
-								    args->num_placements);
-			if (XE_IOCTL_DBG(xe, !logical_mask))
-				return -EINVAL;
+		if (XE_IOCTL_DBG(xe, args->extensions))
+			return -EINVAL;
 
-			hwe = find_hw_engine(xe, eci[0]);
-			if (XE_IOCTL_DBG(xe, !hwe))
-				return -EINVAL;
-
-			/* The migration vm doesn't hold rpm ref */
-			xe_device_mem_access_get(xe);
-
-			flags = EXEC_QUEUE_FLAG_VM | (id ? EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD : 0);
-
-			migrate_vm = xe_migrate_get_vm(gt_to_tile(gt)->migrate);
-			new = xe_exec_queue_create(xe, migrate_vm, logical_mask,
-						   args->width, hwe, flags,
-						   args->extensions);
-
-			xe_device_mem_access_put(xe); /* now held by engine */
-
-			xe_vm_put(migrate_vm);
-			if (IS_ERR(new)) {
-				err = PTR_ERR(new);
-				if (q)
-					goto put_exec_queue;
-				return err;
-			}
-			if (id == 0)
-				q = new;
-			else
-				list_add_tail(&new->multi_gt_list,
-					      &q->multi_gt_link);
+		xe_device_mem_access_get(xe);
+		q = xe_pt_exec_queue_create(xe);
+		xe_device_mem_access_put(xe); /* now held by exec queue */
+		if (IS_ERR(q)) {
+			err = PTR_ERR(q);
+			return err;
 		}
 	} else {
 		gt = xe_device_get_gt(xe, eci[0].gt_id);
@@ -714,8 +648,7 @@ int xe_exec_queue_get_property_ioctl(struct drm_device *dev, void *data,
  */
 bool xe_exec_queue_is_lr(struct xe_exec_queue *q)
 {
-	return q->vm && xe_vm_in_lr_mode(q->vm) &&
-		!(q->flags & EXEC_QUEUE_FLAG_VM);
+	return q->vm && xe_vm_in_lr_mode(q->vm);
 }
 
 static s32 xe_exec_queue_num_job_inflight(struct xe_exec_queue *q)
@@ -753,6 +686,12 @@ bool xe_exec_queue_ring_full(struct xe_exec_queue *q)
  */
 bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
 {
+	if (q->flags & EXEC_QUEUE_FLAG_PT) {
+		struct dma_fence *fence = q->last_fence ?: dma_fence_get_stub();
+
+		return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
+	}
+
 	if (xe_exec_queue_is_parallel(q)) {
 		int i;
 
@@ -771,16 +710,9 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
 
 void xe_exec_queue_kill(struct xe_exec_queue *q)
 {
-	struct xe_exec_queue *eq = q, *next;
-
-	list_for_each_entry_safe(eq, next, &eq->multi_gt_list,
-				 multi_gt_link) {
-		q->ops->kill(eq);
-		xe_vm_remove_compute_exec_queue(q->vm, eq);
-	}
-
 	q->ops->kill(q);
-	xe_vm_remove_compute_exec_queue(q->vm, q);
+	if (q->vm)
+		xe_vm_remove_compute_exec_queue(q->vm, q);
 }
 
 int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
@@ -812,7 +744,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
 static void xe_exec_queue_last_fence_lockdep_assert(struct xe_exec_queue *q,
 						    struct xe_vm *vm)
 {
-	if (q->flags & EXEC_QUEUE_FLAG_VM)
+	if (q->flags & EXEC_QUEUE_FLAG_PT)
 		lockdep_assert_held(&vm->lock);
 	else
 		xe_vm_assert_held(vm);
diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
index 62b3d9d1d7cd..3a2dcaed561f 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
@@ -19,6 +19,7 @@ struct xe_execlist_exec_queue;
 struct xe_gt;
 struct xe_guc_exec_queue;
 struct xe_hw_engine;
+struct xe_pt_exec_queue;
 struct xe_vm;
 
 enum xe_exec_queue_priority {
@@ -38,6 +39,8 @@ enum xe_exec_queue_priority {
  * a kernel object.
  */
 struct xe_exec_queue {
+	/** @xe: Xe device */
+	struct xe_device *xe;
 	/** @gt: graphics tile this exec queue can submit to */
 	struct xe_gt *gt;
 	/**
@@ -78,12 +81,10 @@ struct xe_exec_queue {
 #define EXEC_QUEUE_FLAG_PERMANENT		BIT(2)
 /* queue keeps running pending jobs after destroy ioctl */
 #define EXEC_QUEUE_FLAG_PERSISTENT		BIT(3)
-/* for VM jobs. Caller needs to hold rpm ref when creating queue with this flag */
-#define EXEC_QUEUE_FLAG_VM			BIT(4)
-/* child of VM queue for multi-tile VM jobs */
-#define EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD	BIT(5)
+/* for PT jobs. Caller needs to hold rpm ref when creating queue with this flag */
+#define EXEC_QUEUE_FLAG_PT			BIT(4)
 /* kernel exec_queue only, set priority to highest level */
-#define EXEC_QUEUE_FLAG_HIGH_PRIORITY		BIT(6)
+#define EXEC_QUEUE_FLAG_HIGH_PRIORITY		BIT(5)
 
 	/**
 	 * @flags: flags for this exec queue, should statically setup aside from ban
@@ -91,18 +92,13 @@ struct xe_exec_queue {
 	 */
 	unsigned long flags;
 
-	union {
-		/** @multi_gt_list: list head for VM bind engines if multi-GT */
-		struct list_head multi_gt_list;
-		/** @multi_gt_link: link for VM bind engines if multi-GT */
-		struct list_head multi_gt_link;
-	};
-
 	union {
 		/** @execlist: execlist backend specific state for exec queue */
 		struct xe_execlist_exec_queue *execlist;
 		/** @guc: GuC backend specific state for exec queue */
 		struct xe_guc_exec_queue *guc;
+		/** @pt: PT backend specific state for exec queue */
+		struct xe_pt_exec_queue *pt;
 	};
 
 	/**
diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 73c535193a98..e4f5a80a46fc 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -19,7 +19,6 @@
 #include "xe_guc.h"
 #include "xe_guc_ct.h"
 #include "xe_migrate.h"
-#include "xe_pt.h"
 #include "xe_trace.h"
 #include "xe_vm.h"
 
@@ -209,8 +208,13 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 
 	/* Bind VMA only to the GT that has faulted */
 	trace_xe_vma_pf_bind(vma);
-	fence = __xe_pt_bind_vma(tile, vma, xe_tile_migrate_engine(tile), NULL, 0,
-				 vma->tile_present & BIT(tile->id));
+	ret = xe_vm_populate_dummy_rebind(vm, vma, BIT(tile->id));
+	if (ret)
+		goto unlock_dma_resv;
+	vm->dummy_ops.vops.pt_update_ops[tile->id].q =
+		xe_tile_migrate_bind_exec_queue(tile);
+	fence = xe_vm_ops_execute(vm, &vm->dummy_ops.vops);
+	xe_vma_ops_free(&vm->dummy_ops.vops);
 	if (IS_ERR(fence)) {
 		ret = PTR_ERR(fence);
 		goto unlock_dma_resv;
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
index a3c4ffba679d..ac2bf86de39a 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
@@ -264,11 +264,15 @@ int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt)
 }
 
 /**
- * xe_gt_tlb_invalidation_vma - Issue a TLB invalidation on this GT for a VMA
+ * xe_gt_tlb_invalidation_range - Issue a TLB invalidation on this GT for an
+ * address range
+ *
  * @gt: graphics tile
  * @fence: invalidation fence which will be signal on TLB invalidation
  * completion, can be NULL
- * @vma: VMA to invalidate
+ * @start: start address
+ * @end: end address
+ * @asid: address space id
  *
  * Issue a range based TLB invalidation if supported, if not fallback to a full
  * TLB invalidation. Completion of TLB is asynchronous and caller can either use
@@ -278,17 +282,15 @@ int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt)
  * Return: Seqno which can be passed to xe_gt_tlb_invalidation_wait on success,
  * negative error code on error.
  */
-int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
-			       struct xe_gt_tlb_invalidation_fence *fence,
-			       struct xe_vma *vma)
+int xe_gt_tlb_invalidation_range(struct xe_gt *gt,
+				 struct xe_gt_tlb_invalidation_fence *fence,
+				 u64 start, u64 end, u32 asid)
 {
 	struct xe_device *xe = gt_to_xe(gt);
 #define MAX_TLB_INVALIDATION_LEN	7
 	u32 action[MAX_TLB_INVALIDATION_LEN];
 	int len = 0;
 
-	xe_gt_assert(gt, vma);
-
 	/* Execlists not supported */
 	if (gt_to_xe(gt)->info.force_execlist) {
 		if (fence)
@@ -302,8 +304,8 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 	if (!xe->info.has_range_tlb_invalidation) {
 		action[len++] = MAKE_INVAL_OP(XE_GUC_TLB_INVAL_FULL);
 	} else {
-		u64 start = xe_vma_start(vma);
-		u64 length = xe_vma_size(vma);
+		u64 orig_start = start;
+		u64 length = end - start;
 		u64 align, end;
 
 		if (length < SZ_4K)
@@ -316,12 +318,12 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 		 * address mask covering the required range.
 		 */
 		align = roundup_pow_of_two(length);
-		start = ALIGN_DOWN(xe_vma_start(vma), align);
-		end = ALIGN(xe_vma_end(vma), align);
+		start = ALIGN_DOWN(start, align);
+		end = ALIGN(end, align);
 		length = align;
 		while (start + length < end) {
 			length <<= 1;
-			start = ALIGN_DOWN(xe_vma_start(vma), length);
+			start = ALIGN_DOWN(orig_start, length);
 		}
 
 		/*
@@ -330,16 +332,17 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 		 */
 		if (length >= SZ_2M) {
 			length = max_t(u64, SZ_16M, length);
-			start = ALIGN_DOWN(xe_vma_start(vma), length);
+			start = ALIGN_DOWN(orig_start, length);
 		}
 
 		xe_gt_assert(gt, length >= SZ_4K);
 		xe_gt_assert(gt, is_power_of_2(length));
-		xe_gt_assert(gt, !(length & GENMASK(ilog2(SZ_16M) - 1, ilog2(SZ_2M) + 1)));
+		xe_gt_assert(gt, !(length & GENMASK(ilog2(SZ_16M) - 1,
+						    ilog2(SZ_2M) + 1)));
 		xe_gt_assert(gt, IS_ALIGNED(start, length));
 
 		action[len++] = MAKE_INVAL_OP(XE_GUC_TLB_INVAL_PAGE_SELECTIVE);
-		action[len++] = xe_vma_vm(vma)->usm.asid;
+		action[len++] = asid;
 		action[len++] = lower_32_bits(start);
 		action[len++] = upper_32_bits(start);
 		action[len++] = ilog2(length) - ilog2(SZ_4K);
@@ -350,6 +353,32 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 	return send_tlb_invalidation(&gt->uc.guc, fence, action, len);
 }
 
+/**
+ * xe_gt_tlb_invalidation_vma - Issue a TLB invalidation on this GT for a VMA
+ * @gt: graphics tile
+ * @fence: invalidation fence which will be signal on TLB invalidation
+ * completion, can be NULL
+ * @vma: VMA to invalidate
+ *
+ * Issue a range based TLB invalidation if supported, if not fallback to a full
+ * TLB invalidation. Completion of TLB is asynchronous and caller can either use
+ * the invalidation fence or seqno + xe_gt_tlb_invalidation_wait to wait for
+ * completion.
+ *
+ * Return: Seqno which can be passed to xe_gt_tlb_invalidation_wait on success,
+ * negative error code on error.
+ */
+int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
+			       struct xe_gt_tlb_invalidation_fence *fence,
+			       struct xe_vma *vma)
+{
+	xe_gt_assert(gt, vma);
+
+	return xe_gt_tlb_invalidation_range(gt, fence, xe_vma_start(vma),
+					    xe_vma_end(vma),
+					    xe_vma_vm(vma)->usm.asid);
+}
+
 /**
  * xe_gt_tlb_invalidation_wait - Wait for TLB to complete
  * @gt: graphics tile
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
index fbb743d80d2c..bf3bebd9f985 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
@@ -20,6 +20,9 @@ int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt);
 int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 			       struct xe_gt_tlb_invalidation_fence *fence,
 			       struct xe_vma *vma);
+int xe_gt_tlb_invalidation_range(struct xe_gt *gt,
+				 struct xe_gt_tlb_invalidation_fence *fence,
+				 u64 start, u64 end, u32 asid);
 int xe_gt_tlb_invalidation_wait(struct xe_gt *gt, int seqno);
 int xe_guc_tlb_invalidation_done_handler(struct xe_guc *guc, u32 *msg, u32 len);
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 19efdb2f881f..83dc799589db 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -17,6 +17,7 @@
 #include "abi/guc_klvs_abi.h"
 #include "regs/xe_lrc_layout.h"
 #include "xe_assert.h"
+#include "xe_bo.h"
 #include "xe_devcoredump.h"
 #include "xe_device.h"
 #include "xe_exec_queue.h"
@@ -719,6 +720,11 @@ static void submit_exec_queue(struct xe_exec_queue *q)
 	}
 }
 
+static bool is_pt_job(struct xe_sched_job *job)
+{
+	return test_bit(JOB_FLAG_PT, &job->fence->flags);
+}
+
 static struct dma_fence *
 guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 {
@@ -728,6 +734,8 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	struct xe_device *xe = guc_to_xe(guc);
 	bool lr = xe_exec_queue_is_lr(q);
 
+	xe_assert(xe, !is_pt_job(job));
+	xe_assert(xe, !(q->flags & EXEC_QUEUE_FLAG_PT));
 	xe_assert(xe, !(exec_queue_destroyed(q) || exec_queue_pending_disable(q)) ||
 		  exec_queue_banned(q) || exec_queue_suspended(q));
 
@@ -929,6 +937,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	int err = -ETIME;
 	int i = 0;
 
+	xe_assert(xe, !(q->flags & EXEC_QUEUE_FLAG_PT));
+
 	/*
 	 * TDR has fired before free job worker. Common if exec queue
 	 * immediately closed after last fence signaled.
@@ -943,8 +953,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		   xe_sched_job_seqno(job), q->guc->id, q->flags);
 	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_KERNEL,
 		   "Kernel-submitted job timed out\n");
-	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q),
-		   "VM job timed out on non-killed execqueue\n");
 
 	simple_error_capture(q);
 	xe_devcoredump(job);
@@ -958,8 +966,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * Kernel jobs should never fail, nor should VM jobs if they do
 	 * somethings has gone wrong and the GT needs a reset
 	 */
-	if (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
-	    (q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q))) {
+	if (q->flags & EXEC_QUEUE_FLAG_KERNEL) {
 		if (!xe_sched_invalidate_job(job, 2)) {
 			xe_sched_add_pending_job(sched, job);
 			xe_sched_submission_start(sched);
@@ -1439,11 +1446,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 	trace_xe_exec_queue_stop(q);
 
 	/*
-	 * Ban any engine (aside from kernel and engines used for VM ops) with a
-	 * started but not complete job or if a job has gone through a GT reset
-	 * more than twice.
+	 * Ban any engine (aside from kernel) with a started but not complete
+	 * job or if a job has gone through a GT reset more than twice.
 	 */
-	if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL | EXEC_QUEUE_FLAG_VM))) {
+	if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL)) {
 		struct xe_sched_job *job = xe_sched_first_pending_job(sched);
 
 		if (job) {
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index ee1bb938c493..82b63bdb9c47 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -28,6 +28,7 @@
 #include "xe_map.h"
 #include "xe_mocs.h"
 #include "xe_pt.h"
+#include "xe_pt_exec_queue.h"
 #include "xe_res_cursor.h"
 #include "xe_sched_job.h"
 #include "xe_sync.h"
@@ -41,6 +42,8 @@
 struct xe_migrate {
 	/** @q: Default exec queue used for migration */
 	struct xe_exec_queue *q;
+	/** @bind_q: Default exec queue used for binds */
+	struct xe_exec_queue *bind_q;
 	/** @tile: Backpointer to the tile this struct xe_migrate belongs to. */
 	struct xe_tile *tile;
 	/** @job_mutex: Timeline mutex for @eng. */
@@ -84,19 +87,24 @@ struct xe_migrate {
 #define MAX_PTE_PER_SDI 0x1FE
 
 /**
- * xe_tile_migrate_engine() - Get this tile's migrate engine.
+ * xe_tile_migrate_exec_queue() - Get this tile's migrate exec queue.
  * @tile: The tile.
  *
- * Returns the default migrate engine of this tile.
+ * Returns the default migrate exec queue of this tile.
  * TODO: Perhaps this function is slightly misplaced, and even unneeded?
  *
- * Return: The default migrate engine
+ * Return: The default migrate exec queue
  */
-struct xe_exec_queue *xe_tile_migrate_engine(struct xe_tile *tile)
+struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile)
 {
 	return tile->migrate->q;
 }
 
+struct xe_exec_queue *xe_tile_migrate_bind_exec_queue(struct xe_tile *tile)
+{
+	return tile->migrate->bind_q;
+}
+
 static void xe_migrate_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_migrate *m = arg;
@@ -111,6 +119,8 @@ static void xe_migrate_fini(struct drm_device *dev, void *arg)
 	mutex_destroy(&m->job_mutex);
 	xe_vm_close_and_put(m->q->vm);
 	xe_exec_queue_put(m->q);
+	if (m->bind_q)
+		xe_exec_queue_put(m->bind_q);
 }
 
 static u64 xe_migrate_vm_addr(u64 slot, u32 level)
@@ -368,6 +378,12 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
 		if (!hwe || !logical_mask)
 			return ERR_PTR(-EINVAL);
 
+		m->bind_q = xe_pt_exec_queue_create(xe);
+		if (IS_ERR(m->bind_q)) {
+			xe_vm_close_and_put(vm);
+			return ERR_CAST(m->bind_q);
+		}
+
 		m->q = xe_exec_queue_create(xe, vm, logical_mask, 1, hwe,
 					    EXEC_QUEUE_FLAG_KERNEL |
 					    EXEC_QUEUE_FLAG_PERMANENT |
@@ -379,6 +395,8 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
 						  EXEC_QUEUE_FLAG_PERMANENT);
 	}
 	if (IS_ERR(m->q)) {
+		if (m->bind_q)
+			xe_exec_queue_put(m->bind_q);
 		xe_vm_close_and_put(vm);
 		return ERR_CAST(m->q);
 	}
@@ -1105,50 +1123,6 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 	return fence;
 }
 
-static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb, u64 ppgtt_ofs,
-			  const struct xe_vm_pgtable_update *update,
-			  struct xe_migrate_pt_update *pt_update)
-{
-	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
-	u32 chunk;
-	u32 ofs = update->ofs, size = update->qwords;
-
-	/*
-	 * If we have 512 entries (max), we would populate it ourselves,
-	 * and update the PDE above it to the new pointer.
-	 * The only time this can only happen if we have to update the top
-	 * PDE. This requires a BO that is almost vm->size big.
-	 *
-	 * This shouldn't be possible in practice.. might change when 16K
-	 * pages are used. Hence the assert.
-	 */
-	xe_tile_assert(tile, update->qwords < MAX_NUM_PTE);
-	if (!ppgtt_ofs)
-		ppgtt_ofs = xe_migrate_vram_ofs(tile_to_xe(tile),
-						xe_bo_addr(update->pt_bo, 0,
-							   XE_PAGE_SIZE));
-
-	do {
-		u64 addr = ppgtt_ofs + ofs * 8;
-
-		chunk = min(size, MAX_PTE_PER_SDI);
-
-		/* Ensure populatefn can do memset64 by aligning bb->cs */
-		if (!(bb->len & 1))
-			bb->cs[bb->len++] = MI_NOOP;
-
-		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
-		bb->cs[bb->len++] = lower_32_bits(addr);
-		bb->cs[bb->len++] = upper_32_bits(addr);
-		ops->populate(pt_update, tile, NULL, bb->cs + bb->len, ofs, chunk,
-			      update);
-
-		bb->len += chunk * 2;
-		ofs += chunk;
-		size -= chunk;
-	} while (size);
-}
-
 struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m)
 {
 	return xe_vm_get(m->q->vm);
@@ -1164,289 +1138,152 @@ struct migrate_test_params {
 	container_of(_priv, struct migrate_test_params, base)
 #endif
 
+void __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
+				      const struct xe_migrate_pt_update_ops *ops,
+				      struct xe_vm_pgtable_update_op *pt_op,
+				      int num_ops)
+{
+	u32 j, i;
+
+	for (j = 0; j < num_ops; ++j, ++pt_op) {
+		for (i = 0; i < pt_op->num_entries; i++) {
+			const struct xe_vm_pgtable_update *update =
+				&pt_op->entries[i];
+
+			if (pt_op->bind)
+				ops->populate(tile, &update->pt_bo->vmap,
+					      NULL, update->ofs, update->qwords,
+					      update);
+			else
+				ops->clear(vm, tile, &update->pt_bo->vmap,
+					   NULL, update->ofs, update->qwords,
+					   update);
+		}
+	}
+
+	trace_xe_vm_cpu_bind(vm);
+	xe_device_wmb(vm->xe);
+}
+
 static struct dma_fence *
 xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
-			       struct xe_vm *vm, struct xe_bo *bo,
-			       const struct  xe_vm_pgtable_update *updates,
-			       u32 num_updates, bool wait_vm,
 			       struct xe_migrate_pt_update *pt_update)
 {
 	XE_TEST_DECLARE(struct migrate_test_params *test =
 			to_migrate_test_params
 			(xe_cur_kunit_priv(XE_TEST_LIVE_MIGRATE));)
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
-	struct dma_fence *fence;
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&pt_update->vops->pt_update_ops[pt_update->tile_id];
 	int err;
-	u32 i;
 
 	if (XE_TEST_ONLY(test && test->force_gpu))
 		return ERR_PTR(-ETIME);
 
-	if (bo && !dma_resv_test_signaled(bo->ttm.base.resv,
-					  DMA_RESV_USAGE_KERNEL))
-		return ERR_PTR(-ETIME);
-
-	if (wait_vm && !dma_resv_test_signaled(xe_vm_resv(vm),
-					       DMA_RESV_USAGE_BOOKKEEP))
-		return ERR_PTR(-ETIME);
-
 	if (ops->pre_commit) {
 		pt_update->job = NULL;
 		err = ops->pre_commit(pt_update);
 		if (err)
 			return ERR_PTR(err);
 	}
-	for (i = 0; i < num_updates; i++) {
-		const struct xe_vm_pgtable_update *update = &updates[i];
-
-		ops->populate(pt_update, m->tile, &update->pt_bo->vmap, NULL,
-			      update->ofs, update->qwords, update);
-	}
-
-	if (vm) {
-		trace_xe_vm_cpu_bind(vm);
-		xe_device_wmb(vm->xe);
-	}
-
-	fence = dma_fence_get_stub();
-
-	return fence;
-}
-
-static bool no_in_syncs(struct xe_vm *vm, struct xe_exec_queue *q,
-			struct xe_sync_entry *syncs, u32 num_syncs)
-{
-	struct dma_fence *fence;
-	int i;
-
-	for (i = 0; i < num_syncs; i++) {
-		fence = syncs[i].fence;
 
-		if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
-				       &fence->flags))
-			return false;
-	}
-	if (q) {
-		fence = xe_exec_queue_last_fence_get(q, vm);
-		if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
-			dma_fence_put(fence);
-			return false;
-		}
-		dma_fence_put(fence);
-	}
+	__xe_migrate_update_pgtables_cpu(vm, m->tile, ops,
+					 pt_update_ops->ops,
+					 pt_update_ops->num_ops);
 
-	return true;
+	return dma_fence_get_stub();
 }
 
-/**
- * xe_migrate_update_pgtables() - Pipelined page-table update
- * @m: The migrate context.
- * @vm: The vm we'll be updating.
- * @bo: The bo whose dma-resv we will await before updating, or NULL if userptr.
- * @q: The exec queue to be used for the update or NULL if the default
- * migration engine is to be used.
- * @updates: An array of update descriptors.
- * @num_updates: Number of descriptors in @updates.
- * @syncs: Array of xe_sync_entry to await before updating. Note that waits
- * will block the engine timeline.
- * @num_syncs: Number of entries in @syncs.
- * @pt_update: Pointer to a struct xe_migrate_pt_update, which contains
- * pointers to callback functions and, if subclassed, private arguments to
- * those.
- *
- * Perform a pipelined page-table update. The update descriptors are typically
- * built under the same lock critical section as a call to this function. If
- * using the default engine for the updates, they will be performed in the
- * order they grab the job_mutex. If different engines are used, external
- * synchronization is needed for overlapping updates to maintain page-table
- * consistency. Note that the meaing of "overlapping" is that the updates
- * touch the same page-table, which might be a higher-level page-directory.
- * If no pipelining is needed, then updates may be performed by the cpu.
- *
- * Return: A dma_fence that, when signaled, indicates the update completion.
- */
-struct dma_fence *
-xe_migrate_update_pgtables(struct xe_migrate *m,
-			   struct xe_vm *vm,
-			   struct xe_bo *bo,
-			   struct xe_exec_queue *q,
-			   const struct xe_vm_pgtable_update *updates,
-			   u32 num_updates,
-			   struct xe_sync_entry *syncs, u32 num_syncs,
-			   struct xe_migrate_pt_update *pt_update)
+static struct dma_fence *
+__xe_migrate_update_pgtables(struct xe_migrate *m,
+			     struct xe_migrate_pt_update *pt_update,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops)
 {
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
 	struct xe_tile *tile = m->tile;
-	struct xe_gt *gt = tile->primary_gt;
-	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_sched_job *job;
 	struct dma_fence *fence;
-	struct drm_suballoc *sa_bo = NULL;
-	struct xe_vma *vma = pt_update->vma;
-	struct xe_bb *bb;
-	u32 i, batch_size, ppgtt_ofs, update_idx, page_ofs = 0;
-	u64 addr;
-	int err = 0;
-	bool usm = !q && xe->info.has_usm;
-	bool first_munmap_rebind = vma &&
-		vma->gpuva.flags & XE_VMA_FIRST_REBIND;
-	struct xe_exec_queue *q_override = !q ? m->q : q;
-	u16 pat_index = xe->pat.idx[XE_CACHE_WB];
-
-	/* Use the CPU if no in syncs and engine is idle */
-	if (no_in_syncs(vm, q, syncs, num_syncs) && xe_exec_queue_is_idle(q_override)) {
-		fence =  xe_migrate_update_pgtables_cpu(m, vm, bo, updates,
-							num_updates,
-							first_munmap_rebind,
-							pt_update);
-		if (!IS_ERR(fence) || fence == ERR_PTR(-EAGAIN))
-			return fence;
-	}
-
-	/* fixed + PTE entries */
-	if (IS_DGFX(xe))
-		batch_size = 2;
-	else
-		batch_size = 6 + num_updates * 2;
-
-	for (i = 0; i < num_updates; i++) {
-		u32 num_cmds = DIV_ROUND_UP(updates[i].qwords, MAX_PTE_PER_SDI);
-
-		/* align noop + MI_STORE_DATA_IMM cmd prefix */
-		batch_size += 4 * num_cmds + updates[i].qwords * 2;
-	}
-
-	/*
-	 * XXX: Create temp bo to copy from, if batch_size becomes too big?
-	 *
-	 * Worst case: Sum(2 * (each lower level page size) + (top level page size))
-	 * Should be reasonably bound..
-	 */
-	xe_tile_assert(tile, batch_size < SZ_128K);
-
-	bb = xe_bb_new(gt, batch_size, !q && xe->info.has_usm);
-	if (IS_ERR(bb))
-		return ERR_CAST(bb);
-
-	/* For sysmem PTE's, need to map them in our hole.. */
-	if (!IS_DGFX(xe)) {
-		ppgtt_ofs = NUM_KERNEL_PDE - 1;
-		if (q) {
-			xe_tile_assert(tile, num_updates <= NUM_VMUSA_WRITES_PER_UNIT);
-
-			sa_bo = drm_suballoc_new(&m->vm_update_sa, 1,
-						 GFP_KERNEL, true, 0);
-			if (IS_ERR(sa_bo)) {
-				err = PTR_ERR(sa_bo);
-				goto err;
-			}
-
-			ppgtt_ofs = NUM_KERNEL_PDE +
-				(drm_suballoc_soffset(sa_bo) /
-				 NUM_VMUSA_UNIT_PER_PAGE);
-			page_ofs = (drm_suballoc_soffset(sa_bo) %
-				    NUM_VMUSA_UNIT_PER_PAGE) *
-				VM_SA_UPDATE_UNIT_SIZE;
-		}
-
-		/* Map our PT's to gtt */
-		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(num_updates);
-		bb->cs[bb->len++] = ppgtt_ofs * XE_PAGE_SIZE + page_ofs;
-		bb->cs[bb->len++] = 0; /* upper_32_bits */
-
-		for (i = 0; i < num_updates; i++) {
-			struct xe_bo *pt_bo = updates[i].pt_bo;
-
-			xe_tile_assert(tile, pt_bo->size == SZ_4K);
-
-			addr = vm->pt_ops->pte_encode_bo(pt_bo, 0, pat_index, 0);
-			bb->cs[bb->len++] = lower_32_bits(addr);
-			bb->cs[bb->len++] = upper_32_bits(addr);
-		}
-
-		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
-		update_idx = bb->len;
-
-		addr = xe_migrate_vm_addr(ppgtt_ofs, 0) +
-			(page_ofs / sizeof(u64)) * XE_PAGE_SIZE;
-		for (i = 0; i < num_updates; i++)
-			write_pgtable(tile, bb, addr + i * XE_PAGE_SIZE,
-				      &updates[i], pt_update);
-	} else {
-		/* phys pages, no preamble required */
-		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
-		update_idx = bb->len;
-
-		for (i = 0; i < num_updates; i++)
-			write_pgtable(tile, bb, 0, &updates[i], pt_update);
-	}
+	bool is_migrate = pt_update_ops->q == m->bind_q;
+	int err;
 
-	if (!q)
+	if (is_migrate)
 		mutex_lock(&m->job_mutex);
 
-	job = xe_bb_create_migration_job(q ?: m->q, bb,
-					 xe_migrate_batch_base(m, usm),
-					 update_idx);
+	job = xe_sched_job_create(pt_update_ops->q, NULL);
 	if (IS_ERR(job)) {
 		err = PTR_ERR(job);
 		goto err_bb;
 	}
 
-	/* Wait on BO move */
-	if (bo) {
-		err = job_add_deps(job, bo->ttm.base.resv,
-				   DMA_RESV_USAGE_KERNEL);
-		if (err)
-			goto err_job;
-	}
-
-	/*
-	 * Munmap style VM unbind, need to wait for all jobs to be complete /
-	 * trigger preempts before moving forward
-	 */
-	if (first_munmap_rebind) {
-		err = job_add_deps(job, xe_vm_resv(vm),
-				   DMA_RESV_USAGE_BOOKKEEP);
-		if (err)
-			goto err_job;
-	}
-
-	err = xe_sched_job_last_fence_add_dep(job, vm);
-	for (i = 0; !err && i < num_syncs; i++)
-		err = xe_sync_entry_add_deps(&syncs[i], job);
-
-	if (err)
-		goto err_job;
-
 	if (ops->pre_commit) {
 		pt_update->job = job;
 		err = ops->pre_commit(pt_update);
 		if (err)
 			goto err_job;
 	}
+
+	set_bit(JOB_FLAG_PT, &job->fence->flags);
+	job->pt_update[0].vm = pt_update->vops->vm;
+	job->pt_update[0].tile = tile;
+	job->pt_update[0].ops = ops;
+	job->pt_update[0].pt_op = pt_update_ops->ops;
+	job->pt_update[0].num_ops = pt_update_ops->num_ops;
+	job->pt_update[0].deferred = pt_update_ops->deferred;
+
+	/* Submission backend now owns freeing of pt_update_ops->ops */
+	init_llist_head(&pt_update_ops->deferred);
+	pt_update_ops->skip_free = true;
+
 	xe_sched_job_arm(job);
 	fence = dma_fence_get(&job->drm.s_fence->finished);
 	xe_sched_job_push(job);
 
-	if (!q)
+	if (is_migrate)
 		mutex_unlock(&m->job_mutex);
 
-	xe_bb_free(bb, fence);
-	drm_suballoc_free(sa_bo, fence);
-
 	return fence;
 
 err_job:
 	xe_sched_job_put(job);
 err_bb:
-	if (!q)
+	if (is_migrate)
 		mutex_unlock(&m->job_mutex);
-	xe_bb_free(bb, NULL);
-err:
-	drm_suballoc_free(sa_bo, NULL);
 	return ERR_PTR(err);
 }
 
+/**
+ * xe_migrate_update_pgtables() - Pipelined page-table update
+ * @m: The migrate context.
+ * @pt_update: PT update arguments
+ *
+ * Perform a pipelined page-table update. The update descriptors are typically
+ * built under the same lock critical section as a call to this function. If
+ * using the default engine for the updates, they will be performed in the
+ * order they grab the job_mutex. If different engines are used, external
+ * synchronization is needed for overlapping updates to maintain page-table
+ * consistency. Note that the meaing of "overlapping" is that the updates
+ * touch the same page-table, which might be a higher-level page-directory.
+ * If no pipelining is needed, then updates may be performed by the cpu.
+ *
+ * Return: A dma_fence that, when signaled, indicates the update completion.
+ */
+struct dma_fence *
+xe_migrate_update_pgtables(struct xe_migrate *m,
+			   struct xe_migrate_pt_update *pt_update)
+
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&pt_update->vops->pt_update_ops[pt_update->tile_id];
+	struct dma_fence *fence;
+
+	fence =  xe_migrate_update_pgtables_cpu(m, pt_update);
+	if (!IS_ERR(fence))
+		return fence;
+
+	return __xe_migrate_update_pgtables(m, pt_update, pt_update_ops);
+}
+
 /**
  * xe_migrate_wait() - Complete all operations using the xe_migrate context
  * @m: Migrate context to wait for.
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 951f19318ea4..701bb27349b0 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -22,6 +22,7 @@ struct xe_pt;
 struct xe_tile;
 struct xe_vm;
 struct xe_vm_pgtable_update;
+struct xe_vm_pgtable_update_op;
 struct xe_vma;
 
 /**
@@ -31,7 +32,6 @@ struct xe_vma;
 struct xe_migrate_pt_update_ops {
 	/**
 	 * @populate: Populate a command buffer or page-table with ptes.
-	 * @pt_update: Embeddable callback argument.
 	 * @tile: The tile for the current operation.
 	 * @map: struct iosys_map into the memory to be populated.
 	 * @pos: If @map is NULL, map into the memory to be populated.
@@ -43,10 +43,27 @@ struct xe_migrate_pt_update_ops {
 	 * page-table system to populate command buffers or shared
 	 * page-tables with PTEs.
 	 */
-	void (*populate)(struct xe_migrate_pt_update *pt_update,
-			 struct xe_tile *tile, struct iosys_map *map,
+	void (*populate)(struct xe_tile *tile, struct iosys_map *map,
 			 void *pos, u32 ofs, u32 num_qwords,
 			 const struct xe_vm_pgtable_update *update);
+	/**
+	 * @clear: Clear a command buffer or page-table with ptes.
+	 * @vm: VM being updated
+	 * @tile: The tile for the current operation.
+	 * @map: struct iosys_map into the memory to be populated.
+	 * @pos: If @map is NULL, map into the memory to be populated.
+	 * @ofs: qword offset into @map, unused if @map is NULL.
+	 * @num_qwords: Number of qwords to write.
+	 * @update: Information about the PTEs to be inserted.
+	 *
+	 * This interface is intended to be used as a callback into the
+	 * page-table system to populate command buffers or shared
+	 * page-tables with PTEs.
+	 */
+	void (*clear)(struct xe_vm *vm, struct xe_tile *tile,
+		      struct iosys_map *map, void *pos, u32 ofs,
+		      u32 num_qwords,
+		      const struct xe_vm_pgtable_update *update);
 
 	/**
 	 * @pre_commit: Callback to be called just before arming the
@@ -67,14 +84,10 @@ struct xe_migrate_pt_update_ops {
 struct xe_migrate_pt_update {
 	/** @ops: Pointer to the struct xe_migrate_pt_update_ops callbacks */
 	const struct xe_migrate_pt_update_ops *ops;
-	/** @vma: The vma we're updating the pagetable for. */
-	struct xe_vma *vma;
+	/** @vops: VMA operations */
+	struct xe_vma_ops *vops;
 	/** @job: The job if a GPU page-table update. NULL otherwise */
 	struct xe_sched_job *job;
-	/** @start: Start of update for the range fence */
-	u64 start;
-	/** @last: Last of update for the range fence */
-	u64 last;
 	/** @tile_id: Tile ID of the update */
 	u8 tile_id;
 };
@@ -94,17 +107,18 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 
 struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m);
 
+void __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
+				      const struct xe_migrate_pt_update_ops *ops,
+				      struct xe_vm_pgtable_update_op *pt_op,
+				      int num_ops);
+
 struct dma_fence *
 xe_migrate_update_pgtables(struct xe_migrate *m,
-			   struct xe_vm *vm,
-			   struct xe_bo *bo,
-			   struct xe_exec_queue *q,
-			   const struct xe_vm_pgtable_update *updates,
-			   u32 num_updates,
-			   struct xe_sync_entry *syncs, u32 num_syncs,
 			   struct xe_migrate_pt_update *pt_update);
 
 void xe_migrate_wait(struct xe_migrate *m);
 
-struct xe_exec_queue *xe_tile_migrate_engine(struct xe_tile *tile);
+struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile);
+struct xe_exec_queue *xe_tile_migrate_bind_exec_queue(struct xe_tile *tile);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index c401d4890386..99968762306c 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -375,6 +375,7 @@ static const struct pci_device_id pciidlist[] = {
 	XE_DG1_IDS(INTEL_VGA_DEVICE, &dg1_desc),
 	XE_ATS_M_IDS(INTEL_VGA_DEVICE, &ats_m_desc),
 	XE_DG2_IDS(INTEL_VGA_DEVICE, &dg2_desc),
+	XE_PVC_IDS(INTEL_VGA_DEVICE, &pvc_desc),
 	XE_MTL_IDS(INTEL_VGA_DEVICE, &mtl_desc),
 	XE_LNL_IDS(INTEL_VGA_DEVICE, &lnl_desc),
 	{ }
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 7f54bc3e389d..1ff01d616dac 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -8,12 +8,14 @@
 #include "xe_bo.h"
 #include "xe_device.h"
 #include "xe_drm_client.h"
+#include "xe_exec_queue.h"
 #include "xe_gt.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
 #include "xe_pt_types.h"
 #include "xe_pt_walk.h"
 #include "xe_res_cursor.h"
+#include "xe_sync.h"
 #include "xe_trace.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -324,6 +326,7 @@ xe_pt_new_shared(struct xe_walk_update *wupd, struct xe_pt *parent,
 	entry->pt = parent;
 	entry->flags = 0;
 	entry->qwords = 0;
+	entry->level = parent->level;
 
 	if (alloc_entries) {
 		entry->pt_entries = kmalloc_array(XE_PDES,
@@ -791,9 +794,8 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma)
 }
 
 static void
-xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *tile,
-		       struct iosys_map *map, void *data,
-		       u32 qword_ofs, u32 num_qwords,
+xe_vm_populate_pgtable(struct xe_tile *tile, struct iosys_map *map,
+		       void *data, u32 qword_ofs, u32 num_qwords,
 		       const struct xe_vm_pgtable_update *update)
 {
 	struct xe_pt_entry *ptes = update->pt_entries;
@@ -809,19 +811,27 @@ xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *t
 	}
 }
 
-static void xe_pt_abort_bind(struct xe_vma *vma,
-			     struct xe_vm_pgtable_update *entries,
-			     u32 num_entries)
+static void xe_pt_cancel_bind(struct xe_vma *vma,
+			      struct xe_vm_pgtable_update *entries,
+			      u32 num_entries)
 {
 	u32 i, j;
 
 	for (i = 0; i < num_entries; i++) {
-		if (!entries[i].pt_entries)
+		struct xe_pt *pt = entries[i].pt;
+
+		if (!pt)
 			continue;
 
-		for (j = 0; j < entries[i].qwords; j++)
-			xe_pt_destroy(entries[i].pt_entries[j].pt, xe_vma_vm(vma)->flags, NULL);
+		if (pt->level) {
+			for (j = 0; j < entries[i].qwords; j++)
+				xe_pt_destroy(entries[i].pt_entries[j].pt,
+					      xe_vma_vm(vma)->flags, NULL);
+		}
+
 		kfree(entries[i].pt_entries);
+		entries[i].pt_entries = NULL;
+		entries[i].qwords = 0;
 	}
 }
 
@@ -831,18 +841,15 @@ static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 
 	lockdep_assert_held(&vm->lock);
 
-	if (xe_vma_is_userptr(vma))
-		lockdep_assert_held_read(&vm->userptr.notifier_lock);
-	else if (!xe_vma_is_null(vma))
+	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
 		dma_resv_assert_held(xe_vma_bo(vma)->ttm.base.resv);
 
 	xe_vm_assert_held(vm);
 }
 
-static void xe_pt_commit_bind(struct xe_vma *vma,
-			      struct xe_vm_pgtable_update *entries,
-			      u32 num_entries, bool rebind,
-			      struct llist_head *deferred)
+static void xe_pt_commit(struct xe_vma *vma,
+			 struct xe_vm_pgtable_update *entries,
+			 u32 num_entries, struct llist_head *deferred)
 {
 	u32 i, j;
 
@@ -850,31 +857,90 @@ static void xe_pt_commit_bind(struct xe_vma *vma,
 
 	for (i = 0; i < num_entries; i++) {
 		struct xe_pt *pt = entries[i].pt;
+
+		if (!pt->level)
+			continue;
+
+		for (j = 0; j < entries[i].qwords; j++) {
+			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
+
+			xe_pt_destroy(oldpte, xe_vma_vm(vma)->flags, deferred);
+		}
+	}
+}
+
+static void xe_pt_abort_bind(struct xe_vma *vma,
+			     struct xe_vm_pgtable_update *entries,
+			     u32 num_entries, bool rebind)
+{
+	int i, j;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (i = num_entries - 1; i >= 0; --i) {
+		struct xe_pt *pt = entries[i].pt;
 		struct xe_pt_dir *pt_dir;
 
 		if (!rebind)
-			pt->num_live += entries[i].qwords;
+			pt->num_live -= entries[i].qwords;
 
-		if (!pt->level) {
-			kfree(entries[i].pt_entries);
+		if (!pt->level)
 			continue;
+
+		pt_dir = as_xe_pt_dir(pt);
+		for (j = 0; j < entries[i].qwords; j++) {
+			u32 j_ = j + entries[i].ofs;
+			struct xe_pt *newpte = xe_pt_entry(pt_dir, j_);
+			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
+
+			pt_dir->children[j_] = oldpte ? &oldpte->base : 0;
+			xe_pt_destroy(newpte, xe_vma_vm(vma)->flags, NULL);
 		}
+	}
+}
+
+static void xe_pt_commit_prepare_bind(struct xe_vma *vma,
+				      struct xe_vm_pgtable_update *entries,
+				      u32 num_entries, bool rebind)
+{
+	u32 i, j;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (i = 0; i < num_entries; i++) {
+		struct xe_pt *pt = entries[i].pt;
+		struct xe_pt_dir *pt_dir;
+
+		if (!rebind)
+			pt->num_live += entries[i].qwords;
+
+		if (!pt->level)
+			continue;
 
 		pt_dir = as_xe_pt_dir(pt);
 		for (j = 0; j < entries[i].qwords; j++) {
 			u32 j_ = j + entries[i].ofs;
 			struct xe_pt *newpte = entries[i].pt_entries[j].pt;
+			struct xe_pt *oldpte = NULL;
 
 			if (xe_pt_entry(pt_dir, j_))
-				xe_pt_destroy(xe_pt_entry(pt_dir, j_),
-					      xe_vma_vm(vma)->flags, deferred);
+				oldpte = xe_pt_entry(pt_dir, j_);
 
 			pt_dir->children[j_] = &newpte->base;
+			entries[i].pt_entries[j].pt = oldpte;
 		}
-		kfree(entries[i].pt_entries);
 	}
 }
 
+static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
+			    u32 num_entries)
+{
+	u32 i;
+
+	for (i = 0; i < num_entries; i++)
+		kfree(entries[i].pt_entries);
+}
+
 static int
 xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
 		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
@@ -885,20 +951,19 @@ xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
 	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
 	if (!err)
 		xe_tile_assert(tile, *num_entries);
-	else /* abort! */
-		xe_pt_abort_bind(vma, entries, *num_entries);
 
 	return err;
 }
 
 static void xe_vm_dbg_print_entries(struct xe_device *xe,
 				    const struct xe_vm_pgtable_update *entries,
-				    unsigned int num_entries)
+				    unsigned int num_entries, bool bind)
 #if (IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM))
 {
 	unsigned int i;
 
-	vm_dbg(&xe->drm, "%u entries to update\n", num_entries);
+	vm_dbg(&xe->drm, "%s: %u entries to update\n", bind ? "bind" : "unbind",
+	       num_entries);
 	for (i = 0; i < num_entries; i++) {
 		const struct xe_vm_pgtable_update *entry = &entries[i];
 		struct xe_pt *xe_pt = entry->pt;
@@ -919,66 +984,122 @@ static void xe_vm_dbg_print_entries(struct xe_device *xe,
 {}
 #endif
 
-#ifdef CONFIG_DRM_XE_USERPTR_INVAL_INJECT
+static int job_add_deps(struct xe_sched_job *job, struct dma_resv *resv,
+			enum dma_resv_usage usage)
+{
+	return drm_sched_job_add_resv_dependencies(&job->drm, resv, usage);
+}
 
-static int xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+static bool no_in_syncs(struct xe_sync_entry *syncs, u32 num_syncs)
 {
-	u32 divisor = uvma->userptr.divisor ? uvma->userptr.divisor : 2;
-	static u32 count;
+	int i;
 
-	if (count++ % divisor == divisor - 1) {
-		struct xe_vm *vm = xe_vma_vm(&uvma->vma);
+	for (i = 0; i < num_syncs; i++) {
+		struct dma_fence *fence = syncs[i].fence;
 
-		uvma->userptr.divisor = divisor << 1;
-		spin_lock(&vm->userptr.invalidated_lock);
-		list_move_tail(&uvma->userptr.invalidate_link,
-			       &vm->userptr.invalidated);
-		spin_unlock(&vm->userptr.invalidated_lock);
-		return true;
+		if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
+				       &fence->flags))
+			return false;
 	}
 
-	return false;
+	return true;
 }
 
-#else
-
-static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+static int vma_add_deps(struct xe_vma *vma, struct xe_sched_job *job)
 {
-	return false;
+	struct xe_bo *bo = xe_vma_bo(vma);
+
+	xe_bo_assert_held(bo);
+
+	if (bo && !bo->vm) {
+		if (!job) {
+			if (!dma_resv_test_signaled(bo->ttm.base.resv,
+						    DMA_RESV_USAGE_KERNEL))
+				return -ETIME;
+		} else {
+			return job_add_deps(job, bo->ttm.base.resv,
+					    DMA_RESV_USAGE_KERNEL);
+		}
+	}
+
+	return 0;
 }
 
-#endif
+static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
+		       struct xe_sched_job *job)
+{
+	int err = 0;
 
-/**
- * struct xe_pt_migrate_pt_update - Callback argument for pre-commit callbacks
- * @base: Base we derive from.
- * @bind: Whether this is a bind or an unbind operation. A bind operation
- *        makes the pre-commit callback error with -EAGAIN if it detects a
- *        pending invalidation.
- * @locked: Whether the pre-commit callback locked the userptr notifier lock
- *          and it needs unlocking.
- */
-struct xe_pt_migrate_pt_update {
-	struct xe_migrate_pt_update base;
-	bool bind;
-	bool locked;
-};
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		err = vma_add_deps(op->map.vma, job);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			err = vma_add_deps(op->remap.prev, job);
+		if (!err && op->remap.next)
+			err = vma_add_deps(op->remap.next, job);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		err = vma_add_deps(gpuva_to_vma(op->base.prefetch.va), job);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
 
-/*
- * This function adds the needed dependencies to a page-table update job
- * to make sure racing jobs for separate bind engines don't race writing
- * to the same page-table range, wreaking havoc. Initially use a single
- * fence for the entire VM. An optimization would use smaller granularity.
- */
 static int xe_pt_vm_dependencies(struct xe_sched_job *job,
-				 struct xe_range_fence_tree *rftree,
-				 u64 start, u64 last)
+				 struct xe_vm *vm,
+				 struct xe_vma_ops *vops,
+				 struct xe_vm_pgtable_update_ops *pt_update_ops,
+				 struct xe_range_fence_tree *rftree)
 {
 	struct xe_range_fence *rtfence;
 	struct dma_fence *fence;
-	int err;
+	struct xe_vma_op *op;
+	int err = 0, i;
+
+	xe_vm_assert_held(vm);
 
-	rtfence = xe_range_fence_tree_first(rftree, start, last);
+	if (!job && !no_in_syncs(vops->syncs, vops->num_syncs))
+		return -ETIME;
+
+	if (!job && !xe_exec_queue_is_idle(pt_update_ops->q))
+		return -ETIME;
+
+	if (pt_update_ops->wait_vm_bookkeep) {
+		if (!job) {
+			if (!dma_resv_test_signaled(xe_vm_resv(vm),
+						    DMA_RESV_USAGE_BOOKKEEP))
+				return -ETIME;
+		} else {
+			err = job_add_deps(job, xe_vm_resv(vm),
+					   DMA_RESV_USAGE_BOOKKEEP);
+			if (err)
+				return err;
+		}
+	} else if (pt_update_ops->wait_vm_kernel) {
+		if (!job) {
+			if (!dma_resv_test_signaled(xe_vm_resv(vm),
+						    DMA_RESV_USAGE_KERNEL))
+				return -ETIME;
+		} else {
+			err = job_add_deps(job, xe_vm_resv(vm),
+					   DMA_RESV_USAGE_KERNEL);
+			if (err)
+				return err;
+		}
+	}
+
+	rtfence = xe_range_fence_tree_first(rftree, pt_update_ops->start,
+					    pt_update_ops->last);
 	while (rtfence) {
 		fence = rtfence->fence;
 
@@ -996,88 +1117,152 @@ static int xe_pt_vm_dependencies(struct xe_sched_job *job,
 				return err;
 		}
 
-		rtfence = xe_range_fence_tree_next(rtfence, start, last);
+		rtfence = xe_range_fence_tree_next(rtfence,
+						   pt_update_ops->start,
+						   pt_update_ops->last);
 	}
 
-	return 0;
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_add_deps(vm, op, job);
+		if (err)
+			return err;
+	}
+
+	for (i = 0; job && !err && i < vops->num_syncs; i++)
+		err = xe_sync_entry_add_deps(&vops->syncs[i], job);
+
+	return err;
 }
 
 static int xe_pt_pre_commit(struct xe_migrate_pt_update *pt_update)
 {
-	struct xe_range_fence_tree *rftree =
-		&xe_vma_vm(pt_update->vma)->rftree[pt_update->tile_id];
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vm *vm = vops->vm;
+	struct xe_range_fence_tree *rftree = &vm->rftree[pt_update->tile_id];
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[pt_update->tile_id];
+
+	return xe_pt_vm_dependencies(pt_update->job, vm, pt_update->vops,
+				     pt_update_ops, rftree);
+}
+
+#ifdef CONFIG_DRM_XE_USERPTR_INVAL_INJECT
+
+static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+{
+	u32 divisor = uvma->userptr.divisor ? uvma->userptr.divisor : 2;
+	static u32 count;
+
+	if (count++ % divisor == divisor - 1) {
+		uvma->userptr.divisor = divisor << 1;
+		return true;
+	}
 
-	return xe_pt_vm_dependencies(pt_update->job, rftree,
-				     pt_update->start, pt_update->last);
+	return false;
 }
 
-static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
+#else
+
+static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
 {
-	struct xe_pt_migrate_pt_update *userptr_update =
-		container_of(pt_update, typeof(*userptr_update), base);
-	struct xe_userptr_vma *uvma = to_userptr_vma(pt_update->vma);
-	unsigned long notifier_seq = uvma->userptr.notifier_seq;
-	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
-	int err = xe_pt_vm_dependencies(pt_update->job,
-					&vm->rftree[pt_update->tile_id],
-					pt_update->start,
-					pt_update->last);
+	return false;
+}
 
-	if (err)
-		return err;
+#endif
 
-	userptr_update->locked = false;
+static void vma_check_userptr(struct xe_vm *vm, struct xe_vma *vma)
+{
+	struct xe_userptr_vma *uvma;
+	unsigned long notifier_seq;
 
-	/*
-	 * Wait until nobody is running the invalidation notifier, and
-	 * since we're exiting the loop holding the notifier lock,
-	 * nobody can proceed invalidating either.
-	 *
-	 * Note that we don't update the vma->userptr.notifier_seq since
-	 * we don't update the userptr pages.
-	 */
-	do {
-		down_read(&vm->userptr.notifier_lock);
-		if (!mmu_interval_read_retry(&uvma->userptr.notifier,
-					     notifier_seq))
-			break;
+	lockdep_assert_held_read(&vm->userptr.notifier_lock);
 
-		up_read(&vm->userptr.notifier_lock);
+	if (!xe_vma_is_userptr(vma))
+		return;
 
-		if (userptr_update->bind)
-			return -EAGAIN;
+	uvma = to_userptr_vma(vma);
+	notifier_seq = uvma->userptr.notifier_seq;
 
-		notifier_seq = mmu_interval_read_begin(&uvma->userptr.notifier);
-	} while (true);
+	if (uvma->userptr.initial_bind || xe_vm_in_fault_mode(vm))
+		return;
 
-	/* Inject errors to test_whether they are handled correctly */
-	if (userptr_update->bind && xe_pt_userptr_inject_eagain(uvma)) {
-		up_read(&vm->userptr.notifier_lock);
-		return -EAGAIN;
+	if (!mmu_interval_read_retry(&uvma->userptr.notifier,
+				     notifier_seq) &&
+	    !xe_pt_userptr_inject_eagain(uvma))
+		return;
+
+	spin_lock(&vm->userptr.invalidated_lock);
+	list_move_tail(&uvma->userptr.invalidate_link,
+		       &vm->userptr.invalidated);
+	spin_unlock(&vm->userptr.invalidated_lock);
+
+	if (xe_vm_in_preempt_fence_mode(vm)) {
+		struct dma_resv_iter cursor;
+		struct dma_fence *fence;
+
+		dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
+				    DMA_RESV_USAGE_BOOKKEEP);
+		dma_resv_for_each_fence_unlocked(&cursor, fence)
+			dma_fence_enable_sw_signaling(fence);
+		dma_resv_iter_end(&cursor);
 	}
+}
 
-	userptr_update->locked = true;
+static void op_check_userptr(struct xe_vm *vm, struct xe_vma_op *op)
+{
+	lockdep_assert_held_read(&vm->userptr.notifier_lock);
 
-	return 0;
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		vma_check_userptr(vm, op->map.vma);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			vma_check_userptr(vm, op->remap.prev);
+		if (op->remap.next)
+			vma_check_userptr(vm, op->remap.next);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		vma_check_userptr(vm, gpuva_to_vma(op->base.prefetch.va));
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
 }
 
-static const struct xe_migrate_pt_update_ops bind_ops = {
-	.populate = xe_vm_populate_pgtable,
-	.pre_commit = xe_pt_pre_commit,
-};
+static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
+{
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vma_op *op;
+	int err;
 
-static const struct xe_migrate_pt_update_ops userptr_bind_ops = {
-	.populate = xe_vm_populate_pgtable,
-	.pre_commit = xe_pt_userptr_pre_commit,
-};
+	err = xe_pt_pre_commit(pt_update);
+	if (err)
+		return err;
+
+	down_read(&vm->userptr.notifier_lock);
+
+	list_for_each_entry(op, &vops->list, link)
+		op_check_userptr(vm, op);
+
+	return 0;
+}
 
 struct invalidation_fence {
 	struct xe_gt_tlb_invalidation_fence base;
 	struct xe_gt *gt;
-	struct xe_vma *vma;
 	struct dma_fence *fence;
 	struct dma_fence_cb cb;
 	struct work_struct work;
+	u64 start;
+	u64 end;
+	u32 asid;
 };
 
 static const char *
@@ -1105,7 +1290,7 @@ static void invalidation_fence_cb(struct dma_fence *fence,
 
 	trace_xe_gt_tlb_invalidation_fence_cb(&ifence->base);
 	if (!ifence->fence->error) {
-		queue_work(system_wq, &ifence->work);
+		queue_work(ifence->gt->ordered_wq, &ifence->work);
 	} else {
 		ifence->base.base.error = ifence->fence->error;
 		dma_fence_signal(&ifence->base.base);
@@ -1120,13 +1305,14 @@ static void invalidation_fence_work_func(struct work_struct *w)
 		container_of(w, struct invalidation_fence, work);
 
 	trace_xe_gt_tlb_invalidation_fence_work_func(&ifence->base);
-	xe_gt_tlb_invalidation_vma(ifence->gt, &ifence->base, ifence->vma);
+	xe_gt_tlb_invalidation_range(ifence->gt, &ifence->base, ifence->start,
+				     ifence->end, ifence->asid);
 }
 
 static int invalidation_fence_init(struct xe_gt *gt,
 				   struct invalidation_fence *ifence,
 				   struct dma_fence *fence,
-				   struct xe_vma *vma)
+				   u64 start, u64 end, u32 asid)
 {
 	int ret;
 
@@ -1144,7 +1330,9 @@ static int invalidation_fence_init(struct xe_gt *gt,
 	dma_fence_get(&ifence->base.base);	/* Ref for caller */
 	ifence->fence = fence;
 	ifence->gt = gt;
-	ifence->vma = vma;
+	ifence->start = start;
+	ifence->end = end;
+	ifence->asid = asid;
 
 	INIT_WORK(&ifence->work, invalidation_fence_work_func);
 	ret = dma_fence_add_callback(fence, &ifence->cb, invalidation_fence_cb);
@@ -1161,178 +1349,6 @@ static int invalidation_fence_init(struct xe_gt *gt,
 	return ret && ret != -ENOENT ? ret : 0;
 }
 
-static void xe_pt_calc_rfence_interval(struct xe_vma *vma,
-				       struct xe_pt_migrate_pt_update *update,
-				       struct xe_vm_pgtable_update *entries,
-				       u32 num_entries)
-{
-	int i, level = 0;
-
-	for (i = 0; i < num_entries; i++) {
-		const struct xe_vm_pgtable_update *entry = &entries[i];
-
-		if (entry->pt->level > level)
-			level = entry->pt->level;
-	}
-
-	/* Greedy (non-optimal) calculation but simple */
-	update->base.start = ALIGN_DOWN(xe_vma_start(vma),
-					0x1ull << xe_pt_shift(level));
-	update->base.last = ALIGN(xe_vma_end(vma),
-				  0x1ull << xe_pt_shift(level)) - 1;
-}
-
-/**
- * __xe_pt_bind_vma() - Build and connect a page-table tree for the vma
- * address range.
- * @tile: The tile to bind for.
- * @vma: The vma to bind.
- * @q: The exec_queue with which to do pipelined page-table updates.
- * @syncs: Entries to sync on before binding the built tree to the live vm tree.
- * @num_syncs: Number of @sync entries.
- * @rebind: Whether we're rebinding this vma to the same address range without
- * an unbind in-between.
- *
- * This function builds a page-table tree (see xe_pt_stage_bind() for more
- * information on page-table building), and the xe_vm_pgtable_update entries
- * abstracting the operations needed to attach it to the main vm tree. It
- * then takes the relevant locks and updates the metadata side of the main
- * vm tree and submits the operations for pipelined attachment of the
- * gpu page-table to the vm main tree, (which can be done either by the
- * cpu and the GPU).
- *
- * Return: A valid dma-fence representing the pipelined attachment operation
- * on success, an error pointer on error.
- */
-struct dma_fence *
-__xe_pt_bind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool rebind)
-{
-	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
-	struct xe_pt_migrate_pt_update bind_pt_update = {
-		.base = {
-			.ops = xe_vma_is_userptr(vma) ? &userptr_bind_ops : &bind_ops,
-			.vma = vma,
-			.tile_id = tile->id,
-		},
-		.bind = true,
-	};
-	struct xe_vm *vm = xe_vma_vm(vma);
-	u32 num_entries;
-	struct dma_fence *fence;
-	struct invalidation_fence *ifence = NULL;
-	struct xe_range_fence *rfence;
-	int err;
-
-	bind_pt_update.locked = false;
-	xe_bo_assert_held(xe_vma_bo(vma));
-	xe_vm_assert_held(vm);
-
-	vm_dbg(&xe_vma_vm(vma)->xe->drm,
-	       "Preparing bind, with range [%llx...%llx) engine %p.\n",
-	       xe_vma_start(vma), xe_vma_end(vma), q);
-
-	err = xe_pt_prepare_bind(tile, vma, entries, &num_entries);
-	if (err)
-		goto err;
-	xe_tile_assert(tile, num_entries <= ARRAY_SIZE(entries));
-
-	xe_vm_dbg_print_entries(tile_to_xe(tile), entries, num_entries);
-	xe_pt_calc_rfence_interval(vma, &bind_pt_update, entries,
-				   num_entries);
-
-	/*
-	 * If rebind, we have to invalidate TLB on !LR vms to invalidate
-	 * cached PTEs point to freed memory. on LR vms this is done
-	 * automatically when the context is re-enabled by the rebind worker,
-	 * or in fault mode it was invalidated on PTE zapping.
-	 *
-	 * If !rebind, and scratch enabled VMs, there is a chance the scratch
-	 * PTE is already cached in the TLB so it needs to be invalidated.
-	 * on !LR VMs this is done in the ring ops preceding a batch, but on
-	 * non-faulting LR, in particular on user-space batch buffer chaining,
-	 * it needs to be done here.
-	 */
-	if ((rebind && !xe_vm_in_lr_mode(vm) && !vm->batch_invalidate_tlb) ||
-	    (!rebind && xe_vm_has_scratch(vm) && xe_vm_in_preempt_fence_mode(vm))) {
-		ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
-		if (!ifence)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
-	if (!rfence) {
-		kfree(ifence);
-		return ERR_PTR(-ENOMEM);
-	}
-
-	fence = xe_migrate_update_pgtables(tile->migrate,
-					   vm, xe_vma_bo(vma), q,
-					   entries, num_entries,
-					   syncs, num_syncs,
-					   &bind_pt_update.base);
-	if (!IS_ERR(fence)) {
-		bool last_munmap_rebind = vma->gpuva.flags & XE_VMA_LAST_REBIND;
-		LLIST_HEAD(deferred);
-		int err;
-
-		err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
-					    &xe_range_fence_kfree_ops,
-					    bind_pt_update.base.start,
-					    bind_pt_update.base.last, fence);
-		if (err)
-			dma_fence_wait(fence, false);
-
-		/* TLB invalidation must be done before signaling rebind */
-		if (ifence) {
-			int err = invalidation_fence_init(tile->primary_gt, ifence, fence,
-							  vma);
-			if (err) {
-				dma_fence_put(fence);
-				kfree(ifence);
-				return ERR_PTR(err);
-			}
-			fence = &ifence->base.base;
-		}
-
-		/* add shared fence now for pagetable delayed destroy */
-		dma_resv_add_fence(xe_vm_resv(vm), fence, !rebind &&
-				   last_munmap_rebind ?
-				   DMA_RESV_USAGE_KERNEL :
-				   DMA_RESV_USAGE_BOOKKEEP);
-
-		if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
-					   DMA_RESV_USAGE_BOOKKEEP);
-		xe_pt_commit_bind(vma, entries, num_entries, rebind,
-				  bind_pt_update.locked ? &deferred : NULL);
-
-		/* This vma is live (again?) now */
-		vma->tile_present |= BIT(tile->id);
-
-		if (bind_pt_update.locked) {
-			to_userptr_vma(vma)->userptr.initial_bind = true;
-			up_read(&vm->userptr.notifier_lock);
-			xe_bo_put_commit(&deferred);
-		}
-		if (!rebind && last_munmap_rebind &&
-		    xe_vm_in_preempt_fence_mode(vm))
-			xe_vm_queue_rebind_worker(vm);
-	} else {
-		kfree(rfence);
-		kfree(ifence);
-		if (bind_pt_update.locked)
-			up_read(&vm->userptr.notifier_lock);
-		xe_pt_abort_bind(vma, entries, num_entries);
-	}
-
-	return fence;
-
-err:
-	return ERR_PTR(err);
-}
-
 struct xe_pt_stage_unbind_walk {
 	/** @base: The pagewalk base-class. */
 	struct xe_pt_walk base;
@@ -1430,7 +1446,7 @@ xe_pt_stage_unbind_post_descend(struct xe_ptw *parent, pgoff_t offset,
 				     &end_offset))
 		return 0;
 
-	(void)xe_pt_new_shared(&xe_walk->wupd, xe_child, offset, false);
+	(void)xe_pt_new_shared(&xe_walk->wupd, xe_child, offset, true);
 	xe_walk->wupd.updates[level].update->qwords = end_offset - offset;
 
 	return 0;
@@ -1478,13 +1494,12 @@ static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct xe_vma *vma,
 }
 
 static void
-xe_migrate_clear_pgtable_callback(struct xe_migrate_pt_update *pt_update,
-				  struct xe_tile *tile, struct iosys_map *map,
-				  void *ptr, u32 qword_ofs, u32 num_qwords,
+xe_migrate_clear_pgtable_callback(struct xe_vm *vm, struct xe_tile *tile,
+				  struct iosys_map *map, void *ptr,
+				  u32 qword_ofs, u32 num_qwords,
 				  const struct xe_vm_pgtable_update *update)
 {
-	struct xe_vma *vma = pt_update->vma;
-	u64 empty = __xe_pt_empty_pte(tile, xe_vma_vm(vma), update->pt->level);
+	u64 empty = __xe_pt_empty_pte(tile, vm, update->level);
 	int i;
 
 	if (map && map->is_iomem)
@@ -1498,171 +1513,556 @@ xe_migrate_clear_pgtable_callback(struct xe_migrate_pt_update *pt_update,
 		memset64(ptr, empty, num_qwords);
 }
 
+static void xe_pt_abort_unbind(struct xe_vma *vma,
+			       struct xe_vm_pgtable_update *entries,
+			       u32 num_entries)
+{
+	int j, i;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (j = num_entries - 1; j >= 0; --j) {
+		struct xe_vm_pgtable_update *entry = &entries[j];
+		struct xe_pt *pt = entry->pt;
+		struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
+
+		pt->num_live += entry->qwords;
+
+		if (!pt->level)
+			continue;
+
+		for (i = entry->ofs; i < entry->ofs + entry->qwords; i++)
+			pt_dir->children[i] =
+				entries[j].pt_entries[i - entry->ofs].pt ?
+				&entries[j].pt_entries[i - entry->ofs].pt->base : 0;
+	}
+}
+
 static void
-xe_pt_commit_unbind(struct xe_vma *vma,
-		    struct xe_vm_pgtable_update *entries, u32 num_entries,
-		    struct llist_head *deferred)
+xe_pt_commit_prepare_unbind(struct xe_vma *vma,
+			    struct xe_vm_pgtable_update *entries,
+			    u32 num_entries)
 {
-	u32 j;
+	int j, i;
 
 	xe_pt_commit_locks_assert(vma);
 
 	for (j = 0; j < num_entries; ++j) {
 		struct xe_vm_pgtable_update *entry = &entries[j];
 		struct xe_pt *pt = entry->pt;
+		struct xe_pt_dir *pt_dir;
 
 		pt->num_live -= entry->qwords;
-		if (pt->level) {
-			struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
-			u32 i;
+		if (!pt->level)
+			continue;
 
-			for (i = entry->ofs; i < entry->ofs + entry->qwords;
-			     i++) {
-				if (xe_pt_entry(pt_dir, i))
-					xe_pt_destroy(xe_pt_entry(pt_dir, i),
-						      xe_vma_vm(vma)->flags, deferred);
+		pt_dir = as_xe_pt_dir(pt);
+		for (i = entry->ofs; i < entry->ofs + entry->qwords; i++) {
+			if (xe_pt_entry(pt_dir, i))
+				entries[j].pt_entries[i - entry->ofs].pt =
+					xe_pt_entry(pt_dir, i);
+			else
+				entries[j].pt_entries[i - entry->ofs].pt = NULL;
+			pt_dir->children[i] = NULL;
+		}
+	}
+}
 
-				pt_dir->children[i] = NULL;
-			}
+static void
+xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
+				 struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	int i, level = 0;
+	u64 start, last;
+
+	for (i = 0; i < pt_op->num_entries; i++) {
+		const struct xe_vm_pgtable_update *entry = &pt_op->entries[i];
+
+		if (entry->pt->level > level)
+			level = entry->pt->level;
+	}
+
+	/* Greedy (non-optimal) calculation but simple */
+	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull << xe_pt_shift(level));
+	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level)) - 1;
+
+	if (start < pt_update_ops->start)
+		pt_update_ops->start = start;
+	if (last > pt_update_ops->last)
+		pt_update_ops->last = last;
+}
+
+static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
+			   struct xe_vm_pgtable_update_ops *pt_update_ops,
+			   struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	int err;
+
+	xe_bo_assert_held(xe_vma_bo(vma));
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "Preparing bind, with range [%llx...%llx)\n",
+	       xe_vma_start(vma), xe_vma_end(vma) - 1);
+
+	pt_op->vma = NULL;
+	pt_op->bind = true;
+	pt_op->rebind = BIT(tile->id) & vma->tile_present;
+
+	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
+				 &pt_op->num_entries);
+	if (!err) {
+		xe_tile_assert(tile, pt_op->num_entries <=
+			       ARRAY_SIZE(pt_op->entries));
+		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+					pt_op->num_entries, true);
+
+		xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+		++pt_update_ops->current_op;
+		pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
+
+		/*
+		 * If rebind, we have to invalidate TLB on !LR vms to invalidate
+		 * cached PTEs point to freed memory. on LR vms this is done
+		 * automatically when the context is re-enabled by the rebind
+		 * worker, or in fault mode it was invalidated on PTE zapping.
+		 *
+		 * If !rebind, and scratch enabled VMs, there is a chance the
+		 * scratch PTE is already cached in the TLB so it needs to be
+		 * invalidated. on !LR VMs this is done in the ring ops
+		 * preceding a batch, but on non-faulting LR, in particular on
+		 * user-space batch buffer chaining, it needs to be done here.
+		 */
+		pt_update_ops->needs_invalidation |=
+			(pt_op->rebind && xe_vm_in_lr_mode(vm) &&
+			!vm->batch_invalidate_tlb) ||
+			(!pt_op->rebind && vm->scratch_pt[tile->id] &&
+			 xe_vm_in_preempt_fence_mode(vm));
+
+		pt_op->vma = vma;
+		xe_pt_commit_prepare_bind(vma, pt_op->entries,
+					  pt_op->num_entries, pt_op->rebind);
+	} else {
+		xe_pt_cancel_bind(vma, pt_op->entries, pt_op->num_entries);
+	}
+
+	return err;
+}
+
+static int unbind_op_prepare(struct xe_tile *tile,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops,
+			     struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+
+	xe_bo_assert_held(xe_vma_bo(vma));
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "Preparing unbind, with range [%llx...%llx)\n",
+	       xe_vma_start(vma), xe_vma_end(vma) - 1);
+
+	pt_op->vma = vma;
+	pt_op->bind = false;
+	pt_op->rebind = false;
+
+	pt_op->num_entries = xe_pt_stage_unbind(tile, vma, pt_op->entries);
+
+	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+				pt_op->num_entries, false);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+	++pt_update_ops->current_op;
+	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
+	pt_update_ops->needs_invalidation = true;
+
+	xe_pt_commit_prepare_unbind(vma, pt_op->entries, pt_op->num_entries);
+
+	return 0;
+}
+
+static int op_prepare(struct xe_vm *vm,
+		      struct xe_tile *tile,
+		      struct xe_vm_pgtable_update_ops *pt_update_ops,
+		      struct xe_vma_op *op)
+{
+	int err = 0;
+
+	xe_vm_assert_held(vm);
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
+		pt_update_ops->wait_vm_kernel = true;
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		err = unbind_op_prepare(tile, pt_update_ops,
+					gpuva_to_vma(op->base.remap.unmap->va));
+
+		if (!err && op->remap.prev) {
+			err = bind_op_prepare(vm, tile, pt_update_ops,
+					      op->remap.prev);
+			pt_update_ops->wait_vm_bookkeep = true;
+		}
+		if (!err && op->remap.next) {
+			err = bind_op_prepare(vm, tile, pt_update_ops,
+					      op->remap.next);
+			pt_update_ops->wait_vm_bookkeep = true;
+		}
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		err = unbind_op_prepare(tile, pt_update_ops,
+					gpuva_to_vma(op->base.unmap.va));
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		err = bind_op_prepare(vm, tile, pt_update_ops,
+				      gpuva_to_vma(op->base.prefetch.va));
+		pt_update_ops->wait_vm_kernel = true;
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
+
+static void
+xe_pt_update_ops_init(struct xe_vm_pgtable_update_ops *pt_update_ops)
+{
+	init_llist_head(&pt_update_ops->deferred);
+	pt_update_ops->start = ~0x0ull;
+	pt_update_ops->last = 0x0ull;
+}
+
+/**
+ * xe_pt_update_ops_prepare() - Prepare PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
+ *
+ * Prepare PT update operations which includes updating internal PT state,
+ * allocate memory for page tables, populate page table being pruned in, and
+ * create PT update operations for leaf insertion / removal.
+ *
+ * Return: 0 on success, negative error code on error.
+ */
+int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	struct xe_vma_op *op;
+	int err;
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	xe_pt_update_ops_init(pt_update_ops);
+
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_prepare(vops->vm, tile, pt_update_ops, op);
+
+		if (err)
+			return err;
+	}
+
+	xe_tile_assert(tile, pt_update_ops->current_op ==
+		       pt_update_ops->num_ops);
+
+#ifdef TEST_VM_OPS_ERROR
+	if (vops->inject_error &&
+	    vops->vm->xe->vm_inject_error_position == FORCE_OP_ERROR_PREPARE)
+		return -ENOSPC;
+#endif
+
+	return 0;
+}
+
+static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
+			   struct xe_vm_pgtable_update_ops *pt_update_ops,
+			   struct xe_vma *vma, struct dma_fence *fence)
+{
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+	vma->tile_present |= BIT(tile->id);
+	if (xe_vma_is_userptr(vma)) {
+		lockdep_assert_held_read(&vm->userptr.notifier_lock);
+		to_userptr_vma(vma)->userptr.initial_bind = true;
+	}
+
+	/*
+	 * Kick rebind worker if this bind triggers preempt fences and not in
+	 * the rebind worker
+	 */
+	if (pt_update_ops->wait_vm_bookkeep &&
+	    xe_vm_in_preempt_fence_mode(vm) &&
+	    !current->mm)
+		xe_vm_queue_rebind_worker(vm);
+}
+
+static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops,
+			     struct xe_vma *vma, struct dma_fence *fence)
+{
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+	vma->tile_present &= ~BIT(tile->id);
+	if (!vma->tile_present) {
+		list_del_init(&vma->combined_links.rebind);
+		if (xe_vma_is_userptr(vma)) {
+			lockdep_assert_held_read(&vm->userptr.notifier_lock);
+
+			spin_lock(&vm->userptr.invalidated_lock);
+			list_del_init(&to_userptr_vma(vma)->userptr.invalidate_link);
+			spin_unlock(&vm->userptr.invalidated_lock);
 		}
 	}
 }
 
-static const struct xe_migrate_pt_update_ops unbind_ops = {
-	.populate = xe_migrate_clear_pgtable_callback,
+static void op_commit(struct xe_vm *vm,
+		      struct xe_tile *tile,
+		      struct xe_vm_pgtable_update_ops *pt_update_ops,
+		      struct xe_vma_op *op, struct dma_fence *fence)
+{
+	xe_vm_assert_held(vm);
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		unbind_op_commit(vm, tile, pt_update_ops,
+				 gpuva_to_vma(op->base.remap.unmap->va), fence);
+
+		if (op->remap.prev)
+			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
+				       fence);
+		if (op->remap.next)
+			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
+				       fence);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		unbind_op_commit(vm, tile, pt_update_ops,
+				 gpuva_to_vma(op->base.unmap.va), fence);
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		bind_op_commit(vm, tile, pt_update_ops,
+			       gpuva_to_vma(op->base.prefetch.va), fence);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+}
+
+static const struct xe_migrate_pt_update_ops migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
 	.pre_commit = xe_pt_pre_commit,
 };
 
-static const struct xe_migrate_pt_update_ops userptr_unbind_ops = {
-	.populate = xe_migrate_clear_pgtable_callback,
+static const struct xe_migrate_pt_update_ops userptr_migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
 	.pre_commit = xe_pt_userptr_pre_commit,
 };
 
 /**
- * __xe_pt_unbind_vma() - Disconnect and free a page-table tree for the vma
- * address range.
- * @tile: The tile to unbind for.
- * @vma: The vma to unbind.
- * @q: The exec_queue with which to do pipelined page-table updates.
- * @syncs: Entries to sync on before disconnecting the tree to be destroyed.
- * @num_syncs: Number of @sync entries.
+ * xe_pt_update_ops_run() - Run PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
  *
- * This function builds a the xe_vm_pgtable_update entries abstracting the
- * operations needed to detach the page-table tree to be destroyed from the
- * man vm tree.
- * It then takes the relevant locks and submits the operations for
- * pipelined detachment of the gpu page-table from  the vm main tree,
- * (which can be done either by the cpu and the GPU), Finally it frees the
- * detached page-table tree.
+ * Run PT update operations which includes committing internal PT state changes,
+ * creating job for PT update operations for leaf insertion / removal, and
+ * installing job fence in various places.
  *
- * Return: A valid dma-fence representing the pipelined detachment operation
- * on success, an error pointer on error.
+ * Return: fence on success, negative ERR_PTR on error.
  */
 struct dma_fence *
-__xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		   struct xe_sync_entry *syncs, u32 num_syncs)
+xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 {
-	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
-	struct xe_pt_migrate_pt_update unbind_pt_update = {
-		.base = {
-			.ops = xe_vma_is_userptr(vma) ? &userptr_unbind_ops :
-			&unbind_ops,
-			.vma = vma,
-			.tile_id = tile->id,
-		},
-	};
-	struct xe_vm *vm = xe_vma_vm(vma);
-	u32 num_entries;
-	struct dma_fence *fence = NULL;
-	struct invalidation_fence *ifence;
+	struct xe_vm *vm = vops->vm;
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	struct dma_fence *fence;
+	struct invalidation_fence *ifence = NULL;
 	struct xe_range_fence *rfence;
+	struct xe_vma_op *op;
+	int err = 0, i;
+	struct xe_migrate_pt_update update = {
+		.ops = pt_update_ops->needs_userptr_lock ?
+			&userptr_migrate_ops :
+			&migrate_ops,
+		.vops = vops,
+		.tile_id = tile->id
+	};
 
-	LLIST_HEAD(deferred);
-
-	xe_bo_assert_held(xe_vma_bo(vma));
+	lockdep_assert_held(&vm->lock);
 	xe_vm_assert_held(vm);
 
-	vm_dbg(&xe_vma_vm(vma)->xe->drm,
-	       "Preparing unbind, with range [%llx...%llx) engine %p.\n",
-	       xe_vma_start(vma), xe_vma_end(vma), q);
-
-	num_entries = xe_pt_stage_unbind(tile, vma, entries);
-	xe_tile_assert(tile, num_entries <= ARRAY_SIZE(entries));
-
-	xe_vm_dbg_print_entries(tile_to_xe(tile), entries, num_entries);
-	xe_pt_calc_rfence_interval(vma, &unbind_pt_update, entries,
-				   num_entries);
+#ifdef TEST_VM_OPS_ERROR
+	if (vops->inject_error &&
+	    vm->xe->vm_inject_error_position == FORCE_OP_ERROR_RUN)
+		return ERR_PTR(-ENOSPC);
+#endif
 
-	ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
-	if (!ifence)
-		return ERR_PTR(-ENOMEM);
+	if (pt_update_ops->needs_invalidation) {
+		ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
+		if (!ifence) {
+			err = -ENOMEM;
+			goto kill_vm_tile1;
+		}
+	}
 
 	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
 	if (!rfence) {
-		kfree(ifence);
-		return ERR_PTR(-ENOMEM);
+		err = -ENOMEM;
+		goto free_ifence;
 	}
 
-	/*
-	 * Even if we were already evicted and unbind to destroy, we need to
-	 * clear again here. The eviction may have updated pagetables at a
-	 * lower level, because it needs to be more conservative.
-	 */
-	fence = xe_migrate_update_pgtables(tile->migrate,
-					   vm, NULL, q ? q :
-					   vm->q[tile->id],
-					   entries, num_entries,
-					   syncs, num_syncs,
-					   &unbind_pt_update.base);
-	if (!IS_ERR(fence)) {
-		int err;
-
-		err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
-					    &xe_range_fence_kfree_ops,
-					    unbind_pt_update.base.start,
-					    unbind_pt_update.base.last, fence);
+	/* Point of no return - VM killed if failure after this */
+	for (i = 0; i < pt_update_ops->num_ops; ++i) {
+		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
+
+		xe_pt_commit(pt_op->vma, pt_op->entries,
+			     pt_op->num_entries, &pt_update_ops->deferred);
+		pt_op->vma = NULL;	/* skip in xe_pt_update_ops_abort */
+	}
+
+	fence = xe_migrate_update_pgtables(tile->migrate, &update);
+	if (IS_ERR(fence)) {
+		err = PTR_ERR(fence);
+		goto kill_vm_tile0;
+	}
+
+	err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
+				    &xe_range_fence_kfree_ops,
+				    pt_update_ops->start,
+				    pt_update_ops->last, fence);
+	if (err)
+		dma_fence_wait(fence, false);
+
+	/* tlb invalidation must be done before signaling rebind */
+	if (ifence) {
+		err = invalidation_fence_init(tile->primary_gt, ifence, fence,
+					      pt_update_ops->start,
+					      pt_update_ops->last,
+					      vm->usm.asid);
 		if (err)
-			dma_fence_wait(fence, false);
-
-		/* TLB invalidation must be done before signaling unbind */
-		err = invalidation_fence_init(tile->primary_gt, ifence, fence, vma);
-		if (err) {
-			dma_fence_put(fence);
-			kfree(ifence);
-			return ERR_PTR(err);
-		}
+			goto put_fence;
 		fence = &ifence->base.base;
+	}
 
-		/* add shared fence now for pagetable delayed destroy */
-		dma_resv_add_fence(xe_vm_resv(vm), fence,
-				   DMA_RESV_USAGE_BOOKKEEP);
+	dma_resv_add_fence(xe_vm_resv(vm), fence,
+			   pt_update_ops->wait_vm_bookkeep ?
+			   DMA_RESV_USAGE_KERNEL :
+			   DMA_RESV_USAGE_BOOKKEEP);
 
-		/* This fence will be installed by caller when doing eviction */
-		if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
-					   DMA_RESV_USAGE_BOOKKEEP);
-		xe_pt_commit_unbind(vma, entries, num_entries,
-				    unbind_pt_update.locked ? &deferred : NULL);
-		vma->tile_present &= ~BIT(tile->id);
-	} else {
-		kfree(rfence);
-		kfree(ifence);
-	}
+	list_for_each_entry(op, &vops->list, link)
+		op_commit(vops->vm, tile, pt_update_ops, op, fence);
 
-	if (!vma->tile_present)
-		list_del_init(&vma->combined_links.rebind);
+	if (pt_update_ops->needs_userptr_lock)
+		up_read(&vm->userptr.notifier_lock);
 
-	if (unbind_pt_update.locked) {
-		xe_tile_assert(tile, xe_vma_is_userptr(vma));
+	return fence;
 
-		if (!vma->tile_present) {
-			spin_lock(&vm->userptr.invalidated_lock);
-			list_del_init(&to_userptr_vma(vma)->userptr.invalidate_link);
-			spin_unlock(&vm->userptr.invalidated_lock);
-		}
+put_fence:
+	if (pt_update_ops->needs_userptr_lock)
 		up_read(&vm->userptr.notifier_lock);
-		xe_bo_put_commit(&deferred);
+	dma_fence_put(fence);
+kill_vm_tile0:
+	if (!tile->id)
+		xe_vm_kill(vops->vm, false);
+	kfree(rfence);
+free_ifence:
+	kfree(ifence);
+kill_vm_tile1:
+	if (tile->id)
+		xe_vm_kill(vops->vm, false);
+
+	return ERR_PTR(err);
+}
+
+/**
+ * xe_pt_update_ops_free() - Free PT update operations
+ * @pt_op: Array of PT update operations
+ * @num_ops: Number of PT update operations
+ *
+ * Free PT update operations
+ */
+void xe_pt_update_ops_free(struct xe_vm_pgtable_update_op *pt_op, u32 num_ops)
+{
+	u32 i;
+
+	for (i = 0; i < num_ops; ++i, ++pt_op)
+		xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
+}
+
+/**
+ * xe_pt_update_ops_fini() - Finish PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operations
+ *
+ * Finish PT update operations by committing to destroy page table memory
+ */
+void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	xe_bo_put_commit(tile_to_xe(tile), &pt_update_ops->deferred);
+	if (!pt_update_ops->skip_free)
+		xe_pt_update_ops_free(pt_update_ops->ops,
+				      pt_update_ops->num_ops);
+	else
+		pt_update_ops->ops = NULL;
+}
+
+/**
+ * xe_pt_update_ops_abort() - Abort PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
+ *
+ *  Abort PT update operations by unwinding internal PT state
+ */
+void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	int i;
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	for (i = pt_update_ops->num_ops - 1; i >= 0; --i) {
+		struct xe_vm_pgtable_update_op *pt_op =
+			&pt_update_ops->ops[i];
+
+		if (!pt_op->vma || i >= pt_update_ops->current_op)
+			continue;
+
+		if (pt_op->bind)
+			xe_pt_abort_bind(pt_op->vma, pt_op->entries,
+					 pt_op->num_entries,
+					 pt_op->rebind);
+		else
+			xe_pt_abort_unbind(pt_op->vma, pt_op->entries,
+					   pt_op->num_entries);
 	}
 
-	return fence;
+	xe_pt_update_ops_fini(tile, vops);
 }
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 71a4fbfcff43..989c9b190fa0 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -17,6 +17,7 @@ struct xe_sync_entry;
 struct xe_tile;
 struct xe_vm;
 struct xe_vma;
+struct xe_vma_ops;
 
 /* Largest huge pte is currently 1GiB. May become device dependent. */
 #define MAX_HUGEPTE_LEVEL 2
@@ -34,14 +35,12 @@ void xe_pt_populate_empty(struct xe_tile *tile, struct xe_vm *vm,
 
 void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head *deferred);
 
-struct dma_fence *
-__xe_pt_bind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool rebind);
-
-struct dma_fence *
-__xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		   struct xe_sync_entry *syncs, u32 num_syncs);
+int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops);
+struct dma_fence *xe_pt_update_ops_run(struct xe_tile *tile,
+				       struct xe_vma_ops *vops);
+void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops);
+void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
+void xe_pt_update_ops_free(struct xe_vm_pgtable_update_op *pt_op, u32 num_ops);
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 
diff --git a/drivers/gpu/drm/xe/xe_pt_exec_queue.c b/drivers/gpu/drm/xe/xe_pt_exec_queue.c
new file mode 100644
index 000000000000..2a6ae6267594
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_pt_exec_queue.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <drm/gpu_scheduler.h>
+
+#include "xe_bo.h"
+#include "xe_device.h"
+#include "xe_exec_queue.h"
+#include "xe_migrate.h"
+#include "xe_pt.h"
+#include "xe_pt_exec_queue.h"
+#include "xe_sched_job.h"
+#include "xe_trace.h"
+
+/**
+ * struct xe_pt_exec_queue - PT specific state for an xe_exec_queue
+ */
+struct xe_pt_exec_queue {
+	/** @q: Backpointer to parent xe_exec_queue */
+	struct xe_exec_queue *q;
+	/** @sched: GPU scheduler for this xe_exec_queue */
+	struct drm_gpu_scheduler sched;
+	/** @entity: Scheduler entity for this xe_exec_queue */
+	struct drm_sched_entity entity;
+	/** @fini_async: do final fini async from this worker */
+	struct work_struct fini_async;
+};
+
+static bool is_pt_job(struct xe_sched_job *job)
+{
+	return test_bit(JOB_FLAG_PT, &job->fence->flags);
+}
+
+static void cleanup_pt_job(struct xe_device *xe, struct xe_sched_job *job)
+{
+	xe_pt_update_ops_free(job->pt_update[0].pt_op,
+			      job->pt_update[0].num_ops);
+	xe_bo_put_commit(xe, &job->pt_update[0].deferred);
+	kfree(job->pt_update[0].pt_op);
+}
+
+static void run_pt_job(struct xe_device *xe, struct xe_sched_job *job)
+{
+	__xe_migrate_update_pgtables_cpu(job->pt_update[0].vm,
+					 job->pt_update[0].tile,
+					 job->pt_update[0].ops,
+					 job->pt_update[0].pt_op,
+					 job->pt_update[0].num_ops);
+	cleanup_pt_job(xe, job);
+}
+
+static struct dma_fence *
+pt_exec_queue_run_job(struct drm_sched_job *drm_job)
+{
+	struct xe_sched_job *job = to_xe_sched_job(drm_job);
+	struct xe_exec_queue *q = job->q;
+	struct xe_device *xe = q->xe;
+
+	xe_assert(xe, is_pt_job(job));
+	xe_assert(xe, q->flags & EXEC_QUEUE_FLAG_PT);
+
+	trace_xe_sched_job_run(job);
+	run_pt_job(xe, job);
+
+	return NULL;
+}
+
+static void pt_exec_queue_free_job(struct drm_sched_job *drm_job)
+{
+	struct xe_sched_job *job = to_xe_sched_job(drm_job);
+
+	trace_xe_sched_job_free(job);
+	xe_sched_job_put(job);
+}
+
+static const struct drm_sched_backend_ops drm_sched_ops = {
+	.run_job = pt_exec_queue_run_job,
+	.free_job = pt_exec_queue_free_job,
+};
+
+static void pt_exec_queue_kill(struct xe_exec_queue *q)
+{
+}
+
+static void __pt_exec_queue_fini_async(struct work_struct *w)
+{
+	struct xe_pt_exec_queue *pe =
+		container_of(w, struct xe_pt_exec_queue, fini_async);
+	struct xe_exec_queue *q = pe->q;
+
+	trace_xe_exec_queue_destroy(q);
+
+	drm_sched_entity_fini(&pe->entity);
+	drm_sched_fini(&pe->sched);
+
+	kfree(pe);
+
+	xe_device_mem_access_put(q->xe);
+	xe_exec_queue_fini(q);
+}
+
+static void pt_exec_queue_fini(struct xe_exec_queue *q)
+{
+	INIT_WORK(&q->pt->fini_async, __pt_exec_queue_fini_async);
+	queue_work(system_wq, &q->pt->fini_async);
+}
+
+static bool pt_exec_queue_reset_status(struct xe_exec_queue *q)
+{
+	return false;
+}
+
+static const struct xe_exec_queue_ops pt_exec_queue_ops = {
+	.kill = pt_exec_queue_kill,
+	.fini = pt_exec_queue_fini,
+	.reset_status = pt_exec_queue_reset_status,
+};
+
+struct xe_exec_queue *xe_pt_exec_queue_create(struct xe_device *xe)
+{
+	struct drm_gpu_scheduler *sched;
+	struct xe_exec_queue *q;
+	struct xe_pt_exec_queue *pe;
+	int err;
+
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&q->refcount);
+	q->flags = EXEC_QUEUE_FLAG_PT;
+	q->ops = &pt_exec_queue_ops;
+
+	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
+	if (!pe) {
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	err = drm_sched_init(&pe->sched, &drm_sched_ops, system_wq, 1, 64, 64,
+			     MAX_SCHEDULE_TIMEOUT, system_wq, NULL,
+			     q->name, xe->drm.dev);
+	if (err)
+		goto err_free;
+
+	sched = &pe->sched;
+	err = drm_sched_entity_init(&pe->entity, 0, &sched, 1, NULL);
+	if (err)
+		goto err_sched;
+
+	q->xe = xe;
+	q->pt = pe;
+	pe->q = q;
+	q->entity = &pe->entity;
+
+	xe_exec_queue_assign_name(q, 0);
+	trace_xe_exec_queue_create(q);
+
+	/*
+	 * Normally the user vm holds an rpm ref to keep the device
+	 * awake, and the context holds a ref for the vm, however for
+	 * some engines we use the kernels migrate vm underneath which offers no
+	 * such rpm ref, or we lack a vm. Make sure we keep a ref here, so we
+	 * can perform GuC CT actions when needed. Caller is expected to have
+	 * already grabbed the rpm ref outside any sensitive locks.
+	 */
+	drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));
+
+	return q;
+
+err_sched:
+	drm_sched_fini(&pe->sched);
+err_free:
+	kfree(pe);
+	kfree(q);
+
+	return ERR_PTR(err);
+}
diff --git a/drivers/gpu/drm/xe/xe_pt_exec_queue.h b/drivers/gpu/drm/xe/xe_pt_exec_queue.h
new file mode 100644
index 000000000000..a4d16b845418
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_pt_exec_queue.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _XE_PT_EXEC_QUEUE_H_
+#define _XE_PT_EXEC_QUEUE_H_
+
+struct xe_device;
+struct xe_exec_queue;
+
+struct xe_exec_queue *xe_pt_exec_queue_create(struct xe_device *xe);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index cee70cb0f014..cfd0d35408a5 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -70,8 +70,61 @@ struct xe_vm_pgtable_update {
 	/** @pt_entries: Newly added pagetable entries */
 	struct xe_pt_entry *pt_entries;
 
+	/** @level: level of update */
+	unsigned int level;
+
 	/** @flags: Target flags */
 	u32 flags;
 };
 
+/** struct xe_vm_pgtable_update_op - Page table update operation */
+struct xe_vm_pgtable_update_op {
+	/** @entries: entries to update for this operation */
+	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
+	/** @vma: VMA for operation, operation not valid if NULL */
+	struct xe_vma *vma;
+	/** @num_entries: number of entries for this update operation */
+	u32 num_entries;
+	/** @bind: is a bind */
+	bool bind;
+	/** @rebind: is a rebind */
+	bool rebind;
+};
+
+/** struct xe_vm_pgtable_update_ops: page table update operations */
+struct xe_vm_pgtable_update_ops {
+	/** @ops: operations */
+	struct xe_vm_pgtable_update_op *ops;
+	/** @deferred: deferred list to destroy PT entries */
+	struct llist_head deferred;
+	/** @q: exec queue for PT operations */
+	struct xe_exec_queue *q;
+	/** @start: start address of ops */
+	u64 start;
+	/** @last: last address of ops */
+	u64 last;
+	/** @num_ops: number of operations */
+	u32 num_ops;
+	/** @current_op: current operations */
+	u32 current_op;
+	/** @needs_userptr_lock: Needs userptr lock */
+	bool needs_userptr_lock;
+	/** @needs_invalidation: Needs invalidation */
+	bool needs_invalidation;
+	/**
+	 * @wait_vm_bookkeep: PT operations need to wait until VM is idle
+	 * (bookkeep dma-resv slots are idle) and stage all future VM activity
+	 * behind these operations (install PT operations into VM kernel
+	 * dma-resv slot).
+	 */
+	bool wait_vm_bookkeep;
+	/**
+	 * @wait_vm_kernel: PT operations need to wait until VM kernel dma-resv
+	 * slots are idle.
+	 */
+	bool wait_vm_kernel;
+	/** @skip_free: Free @ops in submission backend rather than in IOCTL */
+	bool skip_free;
+};
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
index 8151ddafb940..fc24e675f922 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -23,19 +23,22 @@ static struct kmem_cache *xe_sched_job_parallel_slab;
 
 int __init xe_sched_job_module_init(void)
 {
+	struct xe_sched_job *job;
+	size_t size;
+
+	size = struct_size(job, batch_addr, 1);
 	xe_sched_job_slab =
-		kmem_cache_create("xe_sched_job",
-				  sizeof(struct xe_sched_job) +
-				  sizeof(u64), 0,
+		kmem_cache_create("xe_sched_job", size, 0,
 				  SLAB_HWCACHE_ALIGN, NULL);
 	if (!xe_sched_job_slab)
 		return -ENOMEM;
 
+	size = max_t(size_t,
+		     struct_size(job, batch_addr,
+				 XE_HW_ENGINE_MAX_INSTANCE),
+		     struct_size(job, pt_update, 1));
 	xe_sched_job_parallel_slab =
-		kmem_cache_create("xe_sched_job_parallel",
-				  sizeof(struct xe_sched_job) +
-				  sizeof(u64) *
-				  XE_HW_ENGINE_MAX_INSTANCE, 0,
+		kmem_cache_create("xe_sched_job_parallel", size, 0,
 				  SLAB_HWCACHE_ALIGN, NULL);
 	if (!xe_sched_job_parallel_slab) {
 		kmem_cache_destroy(xe_sched_job_slab);
@@ -62,18 +65,21 @@ bool xe_sched_job_is_migration(struct xe_exec_queue *q)
 	return q->vm && (q->vm->flags & XE_VM_FLAG_MIGRATION);
 }
 
-static void job_free(struct xe_sched_job *job)
+static bool parallel_slab(struct xe_exec_queue *q)
 {
-	struct xe_exec_queue *q = job->q;
-	bool is_migration = xe_sched_job_is_migration(q);
+	return !q->width || xe_exec_queue_is_parallel(q) ||
+		xe_sched_job_is_migration(q);
+}
 
-	kmem_cache_free(xe_exec_queue_is_parallel(job->q) || is_migration ?
-			xe_sched_job_parallel_slab : xe_sched_job_slab, job);
+static void job_free(struct xe_sched_job *job)
+{
+	kmem_cache_free(parallel_slab(job->q) ? xe_sched_job_parallel_slab :
+			xe_sched_job_slab, job);
 }
 
 static struct xe_device *job_to_xe(struct xe_sched_job *job)
 {
-	return gt_to_xe(job->q->gt);
+	return job->q->xe;
 }
 
 struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
@@ -86,17 +92,19 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 	int i, j;
 	u32 width;
 
-	/* only a kernel context can submit a vm-less job */
-	XE_WARN_ON(!q->vm && !(q->flags & EXEC_QUEUE_FLAG_KERNEL));
+	/* only a kernel and pt exec queue can submit a vm-less job */
+	XE_WARN_ON(!q->vm && !(q->flags & EXEC_QUEUE_FLAG_KERNEL) &&
+		   !(q->flags & EXEC_QUEUE_FLAG_PT));
 
-	/* Migration and kernel engines have their own locking */
-	if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL | EXEC_QUEUE_FLAG_VM))) {
+	/* Kernel and pt exec queues have their own locking */
+	if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL) &&
+	    !(q->flags & EXEC_QUEUE_FLAG_PT)) {
 		lockdep_assert_held(&q->vm->lock);
 		if (!xe_vm_in_lr_mode(q->vm))
 			xe_vm_assert_held(q->vm);
 	}
 
-	job = job_alloc(xe_exec_queue_is_parallel(q) || is_migration);
+	job = job_alloc(parallel_slab(q));
 	if (!job)
 		return ERR_PTR(-ENOMEM);
 
@@ -108,7 +116,15 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 	if (err)
 		goto err_free;
 
-	if (!xe_exec_queue_is_parallel(q)) {
+	if (!batch_addr) {
+		xe_assert(q->xe, q->flags & EXEC_QUEUE_FLAG_PT);
+
+		job->fence = dma_fence_allocate_private_stub(ktime_get());
+		if (!job->fence) {
+			err = -ENOMEM;
+			goto err_sched_job;
+		}
+	} else if (!xe_exec_queue_is_parallel(q)) {
 		job->fence = xe_lrc_create_seqno_fence(q->lrc);
 		if (IS_ERR(job->fence)) {
 			err = PTR_ERR(job->fence);
@@ -148,12 +164,14 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 		job->fence = &cf->base;
 	}
 
-	width = q->width;
-	if (is_migration)
-		width = 2;
+	if (batch_addr) {
+		width = q->width;
+		if (is_migration)
+			width = 2;
 
-	for (i = 0; i < width; ++i)
-		job->batch_addr[i] = batch_addr[i];
+		for (i = 0; i < width; ++i)
+			job->batch_addr[i] = batch_addr[i];
+	}
 
 	/* All other jobs require a VM to be open which has a ref */
 	if (unlikely(q->flags & EXEC_QUEUE_FLAG_KERNEL))
@@ -282,7 +300,7 @@ struct xe_sched_job_snapshot *
 xe_sched_job_snapshot_capture(struct xe_sched_job *job)
 {
 	struct xe_exec_queue *q = job->q;
-	struct xe_device *xe = q->gt->tile->xe;
+	struct xe_device *xe = job_to_xe(job);
 	struct xe_sched_job_snapshot *snapshot;
 	size_t len = sizeof(*snapshot) + (sizeof(u64) * q->width);
 	u16 i;
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index b1d83da50a53..29ca43d1eb65 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -11,6 +11,28 @@
 #include <drm/gpu_scheduler.h>
 
 struct xe_exec_queue;
+struct xe_migrate_pt_update_ops;
+struct xe_tile;
+struct xe_vm;
+struct xe_vm_pgtable_update_op;
+
+/**
+ * struct pt_update_args - PT update arguments
+ */
+struct pt_update_args {
+	/** @vm: VM */
+	struct xe_vm *vm;
+	/** @tile: Tile */
+	struct xe_tile *tile;
+	/** @ops: Migrate PT update ops */
+	const struct xe_migrate_pt_update_ops *ops;
+	/** @pt_op: PT update ops */
+	struct xe_vm_pgtable_update_op *pt_op;
+	/** @deferred: deferred list to destroy PT entries */
+	struct llist_head deferred;
+	/** @num_ops: number of PT update ops */
+	int num_ops;
+};
 
 /**
  * struct xe_sched_job - XE schedule job (batch buffer tracking)
@@ -27,6 +49,7 @@ struct xe_sched_job {
 	 * can safely reference fence, fence cannot safely reference job.
 	 */
 #define JOB_FLAG_SUBMIT		DMA_FENCE_FLAG_USER_BITS
+#define JOB_FLAG_PT		(DMA_FENCE_FLAG_USER_BITS << 1)
 	struct dma_fence *fence;
 	/** @user_fence: write back value when BB is complete */
 	struct {
@@ -39,8 +62,12 @@ struct xe_sched_job {
 	} user_fence;
 	/** @migrate_flush_flags: Additional flush flags for migration jobs */
 	u32 migrate_flush_flags;
-	/** @batch_addr: batch buffer address of job */
-	u64 batch_addr[];
+	union {
+		/** @batch_addr: batch buffer address of job */
+		DECLARE_FLEX_ARRAY(u64, batch_addr);
+		/** @pt_update: PT update arguments */
+		DECLARE_FLEX_ARRAY(struct pt_update_args, pt_update);
+	};
 };
 
 struct xe_sched_job_snapshot {
diff --git a/drivers/gpu/drm/xe/xe_sync.c b/drivers/gpu/drm/xe/xe_sync.c
index 02c9577fe418..07aa65d9bcab 100644
--- a/drivers/gpu/drm/xe/xe_sync.c
+++ b/drivers/gpu/drm/xe/xe_sync.c
@@ -343,6 +343,21 @@ xe_sync_in_fence_get(struct xe_sync_entry *sync, int num_sync,
 	return ERR_PTR(-ENOMEM);
 }
 
+/**
+ * __xe_sync_ufence_get() - Get user fence from user fence
+ * @ufence: input user fence
+ *
+ * Get a user fence reference from user fence
+ *
+ * Return: xe_user_fence pointer with reference
+ */
+struct xe_user_fence *__xe_sync_ufence_get(struct xe_user_fence *ufence)
+{
+	user_fence_get(ufence);
+
+	return ufence;
+}
+
 /**
  * xe_sync_ufence_get() - Get user fence from sync
  * @sync: input sync
diff --git a/drivers/gpu/drm/xe/xe_sync.h b/drivers/gpu/drm/xe/xe_sync.h
index 0fd0d51208e6..26e9ec9de1a8 100644
--- a/drivers/gpu/drm/xe/xe_sync.h
+++ b/drivers/gpu/drm/xe/xe_sync.h
@@ -38,6 +38,7 @@ static inline bool xe_sync_is_ufence(struct xe_sync_entry *sync)
 	return !!sync->ufence;
 }
 
+struct xe_user_fence *__xe_sync_ufence_get(struct xe_user_fence *ufence);
 struct xe_user_fence *xe_sync_ufence_get(struct xe_sync_entry *sync);
 void xe_sync_ufence_put(struct xe_user_fence *ufence);
 int xe_sync_ufence_get_status(struct xe_user_fence *ufence);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 4ddc55527f9a..c4704c5f3c72 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -147,8 +147,9 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
 			   __entry->logical_mask = q->logical_mask;
 			   __entry->gt_id = q->gt->info.id;
 			   __entry->width = q->width;
-			   __entry->guc_id = q->guc->id;
-			   __entry->guc_state = atomic_read(&q->guc->state);
+			   __entry->guc_id = q->guc ? q->guc->id : 0;
+			   __entry->guc_state = q->guc ?
+			   atomic_read(&q->guc->state) : 0;
 			   __entry->flags = q->flags;
 			   ),
 
@@ -264,9 +265,9 @@ DECLARE_EVENT_CLASS(xe_sched_job,
 
 		    TP_fast_assign(
 			   __entry->seqno = xe_sched_job_seqno(job);
-			   __entry->guc_id = job->q->guc->id;
-			   __entry->guc_state =
-			   atomic_read(&job->q->guc->state);
+			   __entry->guc_id = job->q->guc ? job->q->guc->id : 0;
+			   __entry->guc_state = job->q->guc ?
+			   atomic_read(&job->q->guc->state) : 0;
 			   __entry->flags = job->q->flags;
 			   __entry->error = job->fence->error;
 			   __entry->fence = (unsigned long)job->fence;
@@ -423,11 +424,6 @@ DEFINE_EVENT(xe_vma, xe_vma_acc,
 	     TP_ARGS(vma)
 );
 
-DEFINE_EVENT(xe_vma, xe_vma_fail,
-	     TP_PROTO(struct xe_vma *vma),
-	     TP_ARGS(vma)
-);
-
 DEFINE_EVENT(xe_vma, xe_vma_bind,
 	     TP_PROTO(struct xe_vma *vma),
 	     TP_ARGS(vma)
@@ -541,6 +537,11 @@ DEFINE_EVENT(xe_vm, xe_vm_rebind_worker_exit,
 	     TP_ARGS(vm)
 );
 
+DEFINE_EVENT(xe_vm, xe_vm_ops_fail,
+	     TP_PROTO(struct xe_vm *vm),
+	     TP_ARGS(vm)
+);
+
 /* GuC */
 DECLARE_EVENT_CLASS(xe_guc_ct_flow_control,
 		    TP_PROTO(u32 _head, u32 _tail, u32 size, u32 space, u32 len),
diff --git a/drivers/gpu/drm/xe/xe_uc_fw.c b/drivers/gpu/drm/xe/xe_uc_fw.c
index a9d25b3fa67c..d6f788a42979 100644
--- a/drivers/gpu/drm/xe/xe_uc_fw.c
+++ b/drivers/gpu/drm/xe/xe_uc_fw.c
@@ -105,6 +105,7 @@ struct fw_blobs_by_type {
 #define XE_GUC_FIRMWARE_DEFS(fw_def, mmp_ver, major_ver)			\
 	fw_def(LUNARLAKE,	major_ver(xe,	guc,	lnl,	70, 19, 2))	\
 	fw_def(METEORLAKE,	major_ver(i915,	guc,	mtl,	70, 19, 2))	\
+	fw_def(PVC,		major_ver(i915,	guc,	pvc,	70, 19, 2))	\
 	fw_def(DG2,		major_ver(i915,	guc,	dg2,	70, 19, 2))	\
 	fw_def(DG1,		major_ver(i915,	guc,	dg1,	70, 19, 2))	\
 	fw_def(ALDERLAKE_N,	major_ver(i915,	guc,	tgl,	70, 19, 2))	\
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 643b3701a738..8ba037e7ce5c 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -34,6 +34,7 @@
 #include "xe_pm.h"
 #include "xe_preempt_fence.h"
 #include "xe_pt.h"
+#include "xe_pt_exec_queue.h"
 #include "xe_res_cursor.h"
 #include "xe_sync.h"
 #include "xe_trace.h"
@@ -413,19 +414,23 @@ int __xe_vm_userptr_needs_repin(struct xe_vm *vm)
 
 #define XE_VM_REBIND_RETRY_TIMEOUT_MS 1000
 
-static void xe_vm_kill(struct xe_vm *vm)
+void xe_vm_kill(struct xe_vm *vm, bool unlocked)
 {
 	struct xe_exec_queue *q;
 
 	lockdep_assert_held(&vm->lock);
 
-	xe_vm_lock(vm, false);
+	if (unlocked)
+		xe_vm_lock(vm, false);
+
 	vm->flags |= XE_VM_FLAG_BANNED;
 	trace_xe_vm_kill(vm);
 
 	list_for_each_entry(q, &vm->preempt.exec_queues, compute.link)
 		q->ops->kill(q);
-	xe_vm_unlock(vm);
+
+	if (unlocked)
+		xe_vm_unlock(vm);
 
 	/* TODO: Inform user the VM is banned */
 }
@@ -515,14 +520,19 @@ static int xe_preempt_work_begin(struct drm_exec *exec, struct xe_vm *vm,
 	if (err)
 		return err;
 
-	return drm_gpuvm_validate(&vm->gpuvm, exec);
+	err = drm_gpuvm_validate(&vm->gpuvm, exec);
+	if (err)
+		return err;
+
+	err = xe_vm_rebind(vm, true);
+
+	return err;
 }
 
 static void preempt_rebind_work_func(struct work_struct *w)
 {
 	struct xe_vm *vm = container_of(w, struct xe_vm, preempt.rebind_work);
 	struct drm_exec exec;
-	struct dma_fence *rebind_fence;
 	unsigned int fence_count = 0;
 	LIST_HEAD(preempt_fences);
 	ktime_t end = 0;
@@ -568,18 +578,7 @@ static void preempt_rebind_work_func(struct work_struct *w)
 	if (err)
 		goto out_unlock;
 
-	rebind_fence = xe_vm_rebind(vm, true);
-	if (IS_ERR(rebind_fence)) {
-		err = PTR_ERR(rebind_fence);
-		goto out_unlock;
-	}
-
-	if (rebind_fence) {
-		dma_fence_wait(rebind_fence, false);
-		dma_fence_put(rebind_fence);
-	}
-
-	/* Wait on munmap style VM unbinds */
+	/* Wait on rebinds */
 	wait = dma_resv_wait_timeout(xe_vm_resv(vm),
 				     DMA_RESV_USAGE_KERNEL,
 				     false, MAX_SCHEDULE_TIMEOUT);
@@ -621,7 +620,7 @@ static void preempt_rebind_work_func(struct work_struct *w)
 
 	if (err) {
 		drm_warn(&vm->xe->drm, "VM worker error: %d\n", err);
-		xe_vm_kill(vm);
+		xe_vm_kill(vm, true);
 	}
 	up_write(&vm->lock);
 
@@ -751,19 +750,103 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm)
 		list_empty_careful(&vm->userptr.invalidated)) ? 0 : -EAGAIN;
 }
 
-static struct dma_fence *
-xe_vm_bind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-	       struct xe_sync_entry *syncs, u32 num_syncs,
-	       bool first_op, bool last_op);
+static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm *vm,
+			    struct xe_exec_queue *q,
+			    struct xe_sync_entry *syncs, u32 num_syncs)
+{
+	memset(vops, 0, sizeof(*vops));
+	INIT_LIST_HEAD(&vops->list);
+	vops->vm = vm;
+	vops->q = q;
+	vops->syncs = syncs;
+	vops->num_syncs = num_syncs;
+}
+
+static int xe_vma_ops_alloc(struct xe_vma_ops *vops)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i) {
+		if (!vops->pt_update_ops[i].num_ops)
+			continue;
+
+		vops->pt_update_ops[i].ops =
+			kmalloc_array(vops->pt_update_ops[i].num_ops,
+				      sizeof(*vops->pt_update_ops[i].ops),
+				      GFP_KERNEL);
+		if (!vops->pt_update_ops[i].ops)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void xe_vma_ops_free(struct xe_vma_ops *vops)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
+		kfree(vops->pt_update_ops[i].ops);
+}
+
+/**
+ * xe_vm_populate_dummy_rebind() - Populate dummy rebind VMA ops
+ * @vm: The VM.
+ * @vma: VMA to populate dummy VMA ops
+ * @tile_mask: tile mask for VMA ops
+ *
+ * Populate dummy VMA ops which can be used to issue a rebind for the VMA
+ *
+ * Return: 0 on success, -ENOMEM on failure
+ */
+int xe_vm_populate_dummy_rebind(struct xe_vm *vm, struct xe_vma *vma,
+				u8 tile_mask)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i) {
+		if (BIT(i) & tile_mask) {
+			struct xe_vm_pgtable_update_op *pt_op =
+				vm->dummy_ops.vops.pt_update_ops[i].ops;
+
+			memset(&vm->dummy_ops.vops.pt_update_ops[i], 0,
+			       sizeof(vm->dummy_ops.vops.pt_update_ops[i]));
+			vm->dummy_ops.vops.pt_update_ops[i].ops = pt_op;
+			vm->dummy_ops.vops.pt_update_ops[i].num_ops = 1;
+
+			/*
+			 * Wait for VM to be idle / schedule execs + resume
+			 * behind rebinds
+			 */
+			vm->dummy_ops.vops.pt_update_ops[i].wait_vm_bookkeep =
+				true;
+		} else {
+			vm->dummy_ops.vops.pt_update_ops[i].num_ops = 0;
+		}
+	}
+	vm->dummy_ops.op.base.op = DRM_GPUVA_OP_MAP;
+	vm->dummy_ops.op.base.map.va.addr = vma->gpuva.va.addr;
+	vm->dummy_ops.op.base.map.va.range = vma->gpuva.va.range;
+	vm->dummy_ops.op.base.map.gem.obj = vma->gpuva.gem.obj;
+	vm->dummy_ops.op.base.map.gem.offset = vma->gpuva.gem.offset;
+	vm->dummy_ops.op.tile_mask = vma->tile_mask;
+	vm->dummy_ops.op.map.vma = vma;
+	vm->dummy_ops.op.map.immediate = true;
+	vm->dummy_ops.op.map.dumpable = vma->gpuva.flags & XE_VMA_DUMPABLE;
+	vm->dummy_ops.op.map.is_null = xe_vma_is_null(vma);
+
+	return xe_vma_ops_alloc(&vm->dummy_ops.vops);
+}
 
-struct dma_fence *xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
+int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 {
 	struct dma_fence *fence = NULL;
 	struct xe_vma *vma, *next;
+	int err;
 
 	lockdep_assert_held(&vm->lock);
 	if (xe_vm_in_lr_mode(vm) && !rebind_worker)
-		return NULL;
+		return 0;
 
 	xe_vm_assert_held(vm);
 	list_for_each_entry_safe(vma, next, &vm->rebind_list,
@@ -776,12 +859,19 @@ struct dma_fence *xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 			trace_xe_vma_rebind_worker(vma);
 		else
 			trace_xe_vma_rebind_exec(vma);
-		fence = xe_vm_bind_vma(vma, NULL, NULL, 0, false, false);
+
+		err = xe_vm_populate_dummy_rebind(vm, vma, vma->tile_present);
+		if (err)
+			return err;
+
+		fence = xe_vm_ops_execute(vm, &vm->dummy_ops.vops);
+		xe_vma_ops_free(&vm->dummy_ops.vops);
 		if (IS_ERR(fence))
-			return fence;
+			return PTR_ERR(fence);
 	}
 
-	return fence;
+	dma_fence_put(fence);
+	return 0;
 }
 
 static void xe_vma_free(struct xe_vma *vma)
@@ -1285,6 +1375,15 @@ static void xe_vm_free_scratch(struct xe_vm *vm)
 	}
 }
 
+static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops, u8 tile_mask)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
+		if (BIT(i) & tile_mask)
+			++vops->pt_update_ops[i].num_ops;
+}
+
 struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 {
 	struct drm_gem_object *vm_resv_obj;
@@ -1306,6 +1405,12 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	init_rwsem(&vm->lock);
 	mutex_init(&vm->snap_mutex);
 
+	xe_vma_ops_init(&vm->dummy_ops.vops, vm, NULL, NULL, 0);
+	INIT_LIST_HEAD(&vm->dummy_ops.op.link);
+	list_add(&vm->dummy_ops.op.link, &vm->dummy_ops.vops.list);
+	for (id = 0; id < XE_MAX_TILES_PER_DEVICE; ++id)
+		vm->dummy_ops.vops.pt_update_ops[id].num_ops = 1;
+
 	INIT_LIST_HEAD(&vm->rebind_list);
 
 	INIT_LIST_HEAD(&vm->userptr.repin_list);
@@ -1381,32 +1486,20 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 			continue;
 
 		xe_pt_populate_empty(tile, vm, vm->pt_root[id]);
+		number_tiles++;
 	}
 	dma_resv_unlock(xe_vm_resv(vm));
 
 	/* Kernel migration VM shouldn't have a circular loop.. */
 	if (!(flags & XE_VM_FLAG_MIGRATION)) {
-		for_each_tile(tile, xe, id) {
-			struct xe_gt *gt = tile->primary_gt;
-			struct xe_vm *migrate_vm;
-			struct xe_exec_queue *q;
-			u32 create_flags = EXEC_QUEUE_FLAG_VM;
+		struct xe_exec_queue *q;
 
-			if (!vm->pt_root[id])
-				continue;
-
-			migrate_vm = xe_migrate_get_vm(tile->migrate);
-			q = xe_exec_queue_create_class(xe, gt, migrate_vm,
-						       XE_ENGINE_CLASS_COPY,
-						       create_flags);
-			xe_vm_put(migrate_vm);
-			if (IS_ERR(q)) {
-				err = PTR_ERR(q);
-				goto err_close;
-			}
-			vm->q[id] = q;
-			number_tiles++;
+		q = xe_pt_exec_queue_create(xe);
+		if (IS_ERR(q)) {
+			err = PTR_ERR(q);
+			goto err_close;
 		}
+		vm->q = q;
 	}
 
 	if (number_tiles > 1)
@@ -1430,12 +1523,12 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	return ERR_PTR(err);
 
 err_no_resv:
-	mutex_destroy(&vm->snap_mutex);
+	if (!(flags & XE_VM_FLAG_MIGRATION))
+		xe_device_mem_access_put(xe);
 	for_each_tile(tile, xe, id)
 		xe_range_fence_tree_fini(&vm->rftree[id]);
+	mutex_destroy(&vm->snap_mutex);
 	kfree(vm);
-	if (!(flags & XE_VM_FLAG_MIGRATION))
-		xe_device_mem_access_put(xe);
 	return ERR_PTR(err);
 }
 
@@ -1461,19 +1554,13 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	if (xe_vm_in_preempt_fence_mode(vm))
 		flush_work(&vm->preempt.rebind_work);
 
-	down_write(&vm->lock);
-	for_each_tile(tile, xe, id) {
-		if (vm->q[id])
-			xe_exec_queue_last_fence_put(vm->q[id], vm);
-	}
-	up_write(&vm->lock);
+	if (vm->q) {
+		down_write(&vm->lock);
+		xe_exec_queue_last_fence_put(vm->q, vm);
+		up_write(&vm->lock);
 
-	for_each_tile(tile, xe, id) {
-		if (vm->q[id]) {
-			xe_exec_queue_kill(vm->q[id]);
-			xe_exec_queue_put(vm->q[id]);
-			vm->q[id] = NULL;
-		}
+		xe_exec_queue_kill(vm->q);
+		xe_exec_queue_put(vm->q);
 	}
 
 	down_write(&vm->lock);
@@ -1572,7 +1659,6 @@ static void vm_destroy_work_func(struct work_struct *w)
 		XE_WARN_ON(vm->pt_root[id]);
 
 	trace_xe_vm_free(vm);
-	dma_fence_put(vm->rebind_fence);
 	kfree(vm);
 }
 
@@ -1606,168 +1692,7 @@ u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
 static struct xe_exec_queue *
 to_wait_exec_queue(struct xe_vm *vm, struct xe_exec_queue *q)
 {
-	return q ? q : vm->q[0];
-}
-
-static struct dma_fence *
-xe_vm_unbind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool first_op, bool last_op)
-{
-	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-	struct xe_tile *tile;
-	struct dma_fence *fence = NULL;
-	struct dma_fence **fences = NULL;
-	struct dma_fence_array *cf = NULL;
-	int cur_fence = 0, i;
-	int number_tiles = hweight8(vma->tile_present);
-	int err;
-	u8 id;
-
-	trace_xe_vma_unbind(vma);
-
-	if (vma->ufence) {
-		struct xe_user_fence * const f = vma->ufence;
-
-		if (!xe_sync_ufence_get_status(f))
-			return ERR_PTR(-EBUSY);
-
-		vma->ufence = NULL;
-		xe_sync_ufence_put(f);
-	}
-
-	if (number_tiles > 1) {
-		fences = kmalloc_array(number_tiles, sizeof(*fences),
-				       GFP_KERNEL);
-		if (!fences)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	for_each_tile(tile, vm->xe, id) {
-		if (!(vma->tile_present & BIT(id)))
-			goto next;
-
-		fence = __xe_pt_unbind_vma(tile, vma, q ? q : vm->q[id],
-					   first_op ? syncs : NULL,
-					   first_op ? num_syncs : 0);
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			goto err_fences;
-		}
-
-		if (fences)
-			fences[cur_fence++] = fence;
-
-next:
-		if (q && vm->pt_root[id] && !list_empty(&q->multi_gt_list))
-			q = list_next_entry(q, multi_gt_list);
-	}
-
-	if (fences) {
-		cf = dma_fence_array_create(number_tiles, fences,
-					    vm->composite_fence_ctx,
-					    vm->composite_fence_seqno++,
-					    false);
-		if (!cf) {
-			--vm->composite_fence_seqno;
-			err = -ENOMEM;
-			goto err_fences;
-		}
-	}
-
-	fence = cf ? &cf->base : !fence ?
-		xe_exec_queue_last_fence_get(wait_exec_queue, vm) : fence;
-	if (last_op) {
-		for (i = 0; i < num_syncs; i++)
-			xe_sync_entry_signal(&syncs[i], NULL, fence);
-	}
-
-	return fence;
-
-err_fences:
-	if (fences) {
-		while (cur_fence)
-			dma_fence_put(fences[--cur_fence]);
-		kfree(fences);
-	}
-
-	return ERR_PTR(err);
-}
-
-static struct dma_fence *
-xe_vm_bind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-	       struct xe_sync_entry *syncs, u32 num_syncs,
-	       bool first_op, bool last_op)
-{
-	struct xe_tile *tile;
-	struct dma_fence *fence;
-	struct dma_fence **fences = NULL;
-	struct dma_fence_array *cf = NULL;
-	struct xe_vm *vm = xe_vma_vm(vma);
-	int cur_fence = 0, i;
-	int number_tiles = hweight8(vma->tile_mask);
-	int err;
-	u8 id;
-
-	trace_xe_vma_bind(vma);
-
-	if (number_tiles > 1) {
-		fences = kmalloc_array(number_tiles, sizeof(*fences),
-				       GFP_KERNEL);
-		if (!fences)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	for_each_tile(tile, vm->xe, id) {
-		if (!(vma->tile_mask & BIT(id)))
-			goto next;
-
-		fence = __xe_pt_bind_vma(tile, vma, q ? q : vm->q[id],
-					 first_op ? syncs : NULL,
-					 first_op ? num_syncs : 0,
-					 vma->tile_present & BIT(id));
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			goto err_fences;
-		}
-
-		if (fences)
-			fences[cur_fence++] = fence;
-
-next:
-		if (q && vm->pt_root[id] && !list_empty(&q->multi_gt_list))
-			q = list_next_entry(q, multi_gt_list);
-	}
-
-	if (fences) {
-		cf = dma_fence_array_create(number_tiles, fences,
-					    vm->composite_fence_ctx,
-					    vm->composite_fence_seqno++,
-					    false);
-		if (!cf) {
-			--vm->composite_fence_seqno;
-			err = -ENOMEM;
-			goto err_fences;
-		}
-	}
-
-	if (last_op) {
-		for (i = 0; i < num_syncs; i++)
-			xe_sync_entry_signal(&syncs[i], NULL,
-					     cf ? &cf->base : fence);
-	}
-
-	return cf ? &cf->base : fence;
-
-err_fences:
-	if (fences) {
-		while (cur_fence)
-			dma_fence_put(fences[--cur_fence]);
-		kfree(fences);
-	}
-
-	return ERR_PTR(err);
+	return q ? q : vm->q;
 }
 
 static struct xe_user_fence *
@@ -1785,89 +1710,6 @@ find_ufence_get(struct xe_sync_entry *syncs, u32 num_syncs)
 	return NULL;
 }
 
-static int __xe_vm_bind(struct xe_vm *vm, struct xe_vma *vma,
-			struct xe_exec_queue *q, struct xe_sync_entry *syncs,
-			u32 num_syncs, bool immediate, bool first_op,
-			bool last_op)
-{
-	struct dma_fence *fence;
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-	struct xe_user_fence *ufence;
-
-	xe_vm_assert_held(vm);
-
-	ufence = find_ufence_get(syncs, num_syncs);
-	if (vma->ufence && ufence)
-		xe_sync_ufence_put(vma->ufence);
-
-	vma->ufence = ufence ?: vma->ufence;
-
-	if (immediate) {
-		fence = xe_vm_bind_vma(vma, q, syncs, num_syncs, first_op,
-				       last_op);
-		if (IS_ERR(fence))
-			return PTR_ERR(fence);
-	} else {
-		int i;
-
-		xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
-
-		fence = xe_exec_queue_last_fence_get(wait_exec_queue, vm);
-		if (last_op) {
-			for (i = 0; i < num_syncs; i++)
-				xe_sync_entry_signal(&syncs[i], NULL, fence);
-		}
-	}
-
-	if (last_op)
-		xe_exec_queue_last_fence_set(wait_exec_queue, vm, fence);
-	dma_fence_put(fence);
-
-	return 0;
-}
-
-static int xe_vm_bind(struct xe_vm *vm, struct xe_vma *vma, struct xe_exec_queue *q,
-		      struct xe_bo *bo, struct xe_sync_entry *syncs,
-		      u32 num_syncs, bool immediate, bool first_op,
-		      bool last_op)
-{
-	int err;
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(bo);
-
-	if (bo && immediate) {
-		err = xe_bo_validate(bo, vm, true);
-		if (err)
-			return err;
-	}
-
-	return __xe_vm_bind(vm, vma, q, syncs, num_syncs, immediate, first_op,
-			    last_op);
-}
-
-static int xe_vm_unbind(struct xe_vm *vm, struct xe_vma *vma,
-			struct xe_exec_queue *q, struct xe_sync_entry *syncs,
-			u32 num_syncs, bool first_op, bool last_op)
-{
-	struct dma_fence *fence;
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(xe_vma_bo(vma));
-
-	fence = xe_vm_unbind_vma(vma, q, syncs, num_syncs, first_op, last_op);
-	if (IS_ERR(fence))
-		return PTR_ERR(fence);
-
-	xe_vma_destroy(vma, fence);
-	if (last_op)
-		xe_exec_queue_last_fence_set(wait_exec_queue, vm, fence);
-	dma_fence_put(fence);
-
-	return 0;
-}
-
 #define ALL_DRM_XE_VM_CREATE_FLAGS (DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE | \
 				    DRM_XE_VM_CREATE_FLAG_LR_MODE | \
 				    DRM_XE_VM_CREATE_FLAG_FAULT_MODE)
@@ -2008,43 +1850,6 @@ static const u32 region_to_mem_type[] = {
 	XE_PL_VRAM1,
 };
 
-static int xe_vm_prefetch(struct xe_vm *vm, struct xe_vma *vma,
-			  struct xe_exec_queue *q, u32 region,
-			  struct xe_sync_entry *syncs, u32 num_syncs,
-			  bool first_op, bool last_op)
-{
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-	int err;
-
-	xe_assert(vm->xe, region <= ARRAY_SIZE(region_to_mem_type));
-
-	if (!xe_vma_has_no_bo(vma)) {
-		err = xe_bo_migrate(xe_vma_bo(vma), region_to_mem_type[region]);
-		if (err)
-			return err;
-	}
-
-	if (vma->tile_mask != (vma->tile_present & ~vma->usm.tile_invalidated)) {
-		return xe_vm_bind(vm, vma, q, xe_vma_bo(vma), syncs, num_syncs,
-				  true, first_op, last_op);
-	} else {
-		int i;
-
-		/* Nothing to do, signal fences now */
-		if (last_op) {
-			for (i = 0; i < num_syncs; i++) {
-				struct dma_fence *fence =
-					xe_exec_queue_last_fence_get(wait_exec_queue, vm);
-
-				xe_sync_entry_signal(&syncs[i], NULL, fence);
-				dma_fence_put(fence);
-			}
-		}
-
-		return 0;
-	}
-}
-
 static void prep_vma_destroy(struct xe_vm *vm, struct xe_vma *vma,
 			     bool post_commit)
 {
@@ -2168,6 +1973,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
 
 		if (__op->op == DRM_GPUVA_OP_MAP) {
+			op->map.immediate = !xe_vm_in_fault_mode(vm);
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
@@ -2329,35 +2135,30 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-
 static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 				   struct drm_gpuva_ops *ops,
 				   struct xe_sync_entry *syncs, u32 num_syncs,
-				   struct list_head *ops_list, bool last)
+				   struct xe_vma_ops *vops, bool last)
 {
 	struct xe_device *xe = vm->xe;
-	struct xe_vma_op *last_op = NULL;
 	struct drm_gpuva_op *__op;
+	struct xe_tile *tile;
+	u8 id, tile_mask = 0;
 	int err = 0;
 
 	lockdep_assert_held_write(&vm->lock);
 
+	for_each_tile(tile, vm->xe, id)
+		tile_mask |= 0x1 << id;
+
 	drm_gpuva_for_each_op(__op, ops) {
 		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
 		struct xe_vma *vma;
-		bool first = list_empty(ops_list);
 		unsigned int flags = 0;
 
 		INIT_LIST_HEAD(&op->link);
-		list_add_tail(&op->link, ops_list);
-
-		if (first) {
-			op->flags |= XE_VMA_OP_FIRST;
-			op->num_syncs = num_syncs;
-			op->syncs = syncs;
-		}
-
-		op->q = q;
+		list_add_tail(&op->link, &vops->list);
+		op->tile_mask = tile_mask;
 
 		switch (op->base.op) {
 		case DRM_GPUVA_OP_MAP:
@@ -2373,6 +2174,9 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
+			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+				xe_vma_ops_incr_pt_update_ops(vops,
+							      op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_REMAP:
@@ -2417,6 +2221,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 					vm_dbg(&xe->drm, "REMAP:SKIP_PREV: addr=0x%016llx, range=0x%016llx",
 					       (ULL)op->remap.start,
 					       (ULL)op->remap.range);
+				} else {
+					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
 
@@ -2453,228 +2259,30 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 					vm_dbg(&xe->drm, "REMAP:SKIP_NEXT: addr=0x%016llx, range=0x%016llx",
 					       (ULL)op->remap.start,
 					       (ULL)op->remap.range);
+				} else {
+					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
 		case DRM_GPUVA_OP_PREFETCH:
-			/* Nothing to do */
+			/* FIXME: Need to skip some prefetch ops */
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
 
-		last_op = op;
-
 		err = xe_vma_op_commit(vm, op);
 		if (err)
 			return err;
 	}
 
-	/* FIXME: Unhandled corner case */
-	XE_WARN_ON(!last_op && last && !list_empty(ops_list));
-
-	if (!last_op)
-		return 0;
-
-	last_op->ops = ops;
-	if (last) {
-		last_op->flags |= XE_VMA_OP_LAST;
-		last_op->num_syncs = num_syncs;
-		last_op->syncs = syncs;
-	}
-
 	return 0;
 }
 
-static int op_execute(struct drm_exec *exec, struct xe_vm *vm,
-		      struct xe_vma *vma, struct xe_vma_op *op)
-{
-	int err;
-
-	lockdep_assert_held_write(&vm->lock);
-
-	err = xe_vm_prepare_vma(exec, vma, 1);
-	if (err)
-		return err;
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(xe_vma_bo(vma));
-
-	switch (op->base.op) {
-	case DRM_GPUVA_OP_MAP:
-		err = xe_vm_bind(vm, vma, op->q, xe_vma_bo(vma),
-				 op->syncs, op->num_syncs,
-				 !xe_vm_in_fault_mode(vm),
-				 op->flags & XE_VMA_OP_FIRST,
-				 op->flags & XE_VMA_OP_LAST);
-		break;
-	case DRM_GPUVA_OP_REMAP:
-	{
-		bool prev = !!op->remap.prev;
-		bool next = !!op->remap.next;
-
-		if (!op->remap.unmap_done) {
-			if (prev || next)
-				vma->gpuva.flags |= XE_VMA_FIRST_REBIND;
-			err = xe_vm_unbind(vm, vma, op->q, op->syncs,
-					   op->num_syncs,
-					   op->flags & XE_VMA_OP_FIRST,
-					   op->flags & XE_VMA_OP_LAST &&
-					   !prev && !next);
-			if (err)
-				break;
-			op->remap.unmap_done = true;
-		}
-
-		if (prev) {
-			op->remap.prev->gpuva.flags |= XE_VMA_LAST_REBIND;
-			err = xe_vm_bind(vm, op->remap.prev, op->q,
-					 xe_vma_bo(op->remap.prev), op->syncs,
-					 op->num_syncs, true, false,
-					 op->flags & XE_VMA_OP_LAST && !next);
-			op->remap.prev->gpuva.flags &= ~XE_VMA_LAST_REBIND;
-			if (err)
-				break;
-			op->remap.prev = NULL;
-		}
-
-		if (next) {
-			op->remap.next->gpuva.flags |= XE_VMA_LAST_REBIND;
-			err = xe_vm_bind(vm, op->remap.next, op->q,
-					 xe_vma_bo(op->remap.next),
-					 op->syncs, op->num_syncs,
-					 true, false,
-					 op->flags & XE_VMA_OP_LAST);
-			op->remap.next->gpuva.flags &= ~XE_VMA_LAST_REBIND;
-			if (err)
-				break;
-			op->remap.next = NULL;
-		}
-
-		break;
-	}
-	case DRM_GPUVA_OP_UNMAP:
-		err = xe_vm_unbind(vm, vma, op->q, op->syncs,
-				   op->num_syncs, op->flags & XE_VMA_OP_FIRST,
-				   op->flags & XE_VMA_OP_LAST);
-		break;
-	case DRM_GPUVA_OP_PREFETCH:
-		err = xe_vm_prefetch(vm, vma, op->q, op->prefetch.region,
-				     op->syncs, op->num_syncs,
-				     op->flags & XE_VMA_OP_FIRST,
-				     op->flags & XE_VMA_OP_LAST);
-		break;
-	default:
-		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
-	}
-
-	if (err)
-		trace_xe_vma_fail(vma);
-
-	return err;
-}
-
-static int __xe_vma_op_execute(struct xe_vm *vm, struct xe_vma *vma,
-			       struct xe_vma_op *op)
-{
-	struct drm_exec exec;
-	int err;
-
-retry_userptr:
-	drm_exec_init(&exec, DRM_EXEC_INTERRUPTIBLE_WAIT, 0);
-	drm_exec_until_all_locked(&exec) {
-		err = op_execute(&exec, vm, vma, op);
-		drm_exec_retry_on_contention(&exec);
-		if (err)
-			break;
-	}
-	drm_exec_fini(&exec);
-
-	if (err == -EAGAIN) {
-		lockdep_assert_held_write(&vm->lock);
-
-		if (op->base.op == DRM_GPUVA_OP_REMAP) {
-			if (!op->remap.unmap_done)
-				vma = gpuva_to_vma(op->base.remap.unmap->va);
-			else if (op->remap.prev)
-				vma = op->remap.prev;
-			else
-				vma = op->remap.next;
-		}
-
-		if (xe_vma_is_userptr(vma)) {
-			err = xe_vma_userptr_pin_pages(to_userptr_vma(vma));
-			if (!err)
-				goto retry_userptr;
-
-			trace_xe_vma_fail(vma);
-		}
-	}
-
-	return err;
-}
-
-static int xe_vma_op_execute(struct xe_vm *vm, struct xe_vma_op *op)
-{
-	int ret = 0;
-
-	lockdep_assert_held_write(&vm->lock);
-
-	switch (op->base.op) {
-	case DRM_GPUVA_OP_MAP:
-		ret = __xe_vma_op_execute(vm, op->map.vma, op);
-		break;
-	case DRM_GPUVA_OP_REMAP:
-	{
-		struct xe_vma *vma;
-
-		if (!op->remap.unmap_done)
-			vma = gpuva_to_vma(op->base.remap.unmap->va);
-		else if (op->remap.prev)
-			vma = op->remap.prev;
-		else
-			vma = op->remap.next;
-
-		ret = __xe_vma_op_execute(vm, vma, op);
-		break;
-	}
-	case DRM_GPUVA_OP_UNMAP:
-		ret = __xe_vma_op_execute(vm, gpuva_to_vma(op->base.unmap.va),
-					  op);
-		break;
-	case DRM_GPUVA_OP_PREFETCH:
-		ret = __xe_vma_op_execute(vm,
-					  gpuva_to_vma(op->base.prefetch.va),
-					  op);
-		break;
-	default:
-		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
-	}
-
-	return ret;
-}
-
-static void xe_vma_op_cleanup(struct xe_vm *vm, struct xe_vma_op *op)
-{
-	bool last = op->flags & XE_VMA_OP_LAST;
-
-	if (last) {
-		while (op->num_syncs--)
-			xe_sync_entry_cleanup(&op->syncs[op->num_syncs]);
-		kfree(op->syncs);
-		if (op->q)
-			xe_exec_queue_put(op->q);
-	}
-	if (!list_empty(&op->link))
-		list_del(&op->link);
-	if (op->ops)
-		drm_gpuva_ops_free(&vm->gpuvm, op->ops);
-	if (last)
-		xe_vm_put(vm);
-}
-
 static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
 			     bool post_commit, bool prev_post_commit,
 			     bool next_post_commit)
@@ -2751,38 +2359,354 @@ static void vm_bind_ioctl_ops_unwind(struct xe_vm *vm,
 					 op->flags & XE_VMA_OP_PREV_COMMITTED,
 					 op->flags & XE_VMA_OP_NEXT_COMMITTED);
 		}
+	}
+}
+
+static int vma_lock(struct drm_exec *exec, struct xe_vma *vma, bool validate)
+{
+	struct xe_bo *bo = xe_vma_bo(vma);
+	int err = 0;
+
+	if (bo) {
+		if (!bo->vm)
+			err = drm_exec_prepare_obj(exec, &bo->ttm.base, 1);
+		if (!err && validate)
+			err = xe_bo_validate(bo, xe_vma_vm(vma), true);
+	}
+
+	return err;
+}
+
+static int check_ufence(struct xe_vma *vma)
+{
+	if (vma->ufence) {
+		struct xe_user_fence * const f = vma->ufence;
+
+		if (!xe_sync_ufence_get_status(f))
+			return -EBUSY;
+
+		vma->ufence = NULL;
+		xe_sync_ufence_put(f);
+	}
+
+	return 0;
+}
+
+static int op_lock(struct drm_exec *exec, struct xe_vm *vm,
+		   struct xe_vma_op *op)
+{
+	int err = 0;
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		err = vma_lock(exec, op->map.vma, !xe_vm_in_fault_mode(vm));
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		err = check_ufence(gpuva_to_vma(op->base.remap.unmap->va));
+		if (err)
+			break;
+
+		err = vma_lock(exec, gpuva_to_vma(op->base.remap.unmap->va),
+			       false);
+		if (!err && op->remap.prev)
+			err = vma_lock(exec, op->remap.prev, true);
+		if (!err && op->remap.next)
+			err = vma_lock(exec, op->remap.next, true);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		err = check_ufence(gpuva_to_vma(op->base.unmap.va));
+		if (err)
+			break;
+
+		err = vma_lock(exec, gpuva_to_vma(op->base.unmap.va), false);
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+		u32 region = op->prefetch.region;
+
+		xe_assert(vm->xe, region <= ARRAY_SIZE(region_to_mem_type));
+
+		err = vma_lock(exec, vma, false);
+		if (!err && !xe_vma_has_no_bo(vma))
+			err = xe_bo_migrate(xe_vma_bo(vma), region);
+		break;
+	}
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
+
+static int vm_bind_ioctl_ops_lock(struct drm_exec *exec,
+				  struct xe_vm *vm,
+				  struct xe_vma_ops *vops)
+{
+	struct xe_vma_op *op;
+	int err;
+
+	err = drm_exec_prepare_obj(exec, xe_vm_obj(vm), 1);
+	if (err)
+		return err;
+
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_lock(exec, vm, op);
+		if (err)
+			return err;
+	}
+
+#ifdef TEST_VM_OPS_ERROR
+	if (vops->inject_error &&
+	    vm->xe->vm_inject_error_position == FORCE_OP_ERROR_LOCK)
+		return -ENOSPC;
+#endif
+
+	return 0;
+}
+
+static void op_trace(struct xe_vma_op *op)
+{
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		trace_xe_vma_bind(op->map.vma);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		trace_xe_vma_unbind(gpuva_to_vma(op->base.remap.unmap->va));
+		if (op->remap.prev)
+			trace_xe_vma_bind(op->remap.prev);
+		if (op->remap.next)
+			trace_xe_vma_bind(op->remap.next);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		trace_xe_vma_unbind(gpuva_to_vma(op->base.unmap.va));
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		trace_xe_vma_bind(gpuva_to_vma(op->base.prefetch.va));
+		break;
+	default:
+		XE_WARN_ON("NOT POSSIBLE");
+	}
+}
+
+static void trace_xe_vm_ops_execute(struct xe_vma_ops *vops)
+{
+	struct xe_vma_op *op;
+
+	list_for_each_entry(op, &vops->list, link)
+		op_trace(op);
+}
+
+static int vm_ops_setup_tile_args(struct xe_vm *vm, struct xe_vma_ops *vops)
+{
+	struct xe_tile *tile;
+	int number_tiles = 0;
+	u8 id;
+
+	for_each_tile(tile, vm->xe, id) {
+		if (vops->pt_update_ops[id].num_ops)
+			++number_tiles;
+
+		if (vops->pt_update_ops[id].q)
+			continue;
+
+		vops->pt_update_ops[id].q = vops->q ?: vm->q;
+	}
+
+	return number_tiles;
+}
+
+/**
+ * xe_vm_ops_execute() - Execute VMA ops
+ * @vm: The VM.
+ * @vops: VMA ops to execute
+ *
+ * Execute VMA ops binding / unbinding VMAs
+ *
+ * Return: A fence for VMA ops on success, ERR_PTR on failure
+ */
+struct dma_fence *xe_vm_ops_execute(struct xe_vm *vm, struct xe_vma_ops *vops)
+{
+	struct xe_tile *tile;
+	struct dma_fence *fence = NULL;
+	struct dma_fence **fences = NULL;
+	struct dma_fence_array *cf = NULL;
+	int number_tiles = 0, current_fence = 0, err;
+	u8 id;
+
+	number_tiles = vm_ops_setup_tile_args(vm, vops);
+	if (number_tiles == 0)
+		return ERR_PTR(-ENODATA);
+
+	if (number_tiles > 1) {
+		fences = kmalloc_array(number_tiles, sizeof(*fences),
+				       GFP_KERNEL);
+		if (!fences) {
+			fence = ERR_PTR(-ENOMEM);
+			goto err_trace;
+		}
+	}
 
-		drm_gpuva_ops_free(&vm->gpuvm, __ops);
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		err = xe_pt_update_ops_prepare(tile, vops);
+		if (err) {
+			fence = ERR_PTR(err);
+			goto err_out;
+		}
+	}
+
+	trace_xe_vm_ops_execute(vops);
+
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		fence = xe_pt_update_ops_run(tile, vops);
+		if (IS_ERR(fence))
+			goto err_out;
+
+		if (fences)
+			fences[current_fence++] = fence;
+	}
+
+	if (fences) {
+		cf = dma_fence_array_create(number_tiles, fences,
+					    vm->composite_fence_ctx,
+					    vm->composite_fence_seqno++,
+					    false);
+		if (!cf) {
+			--vm->composite_fence_seqno;
+			fence = ERR_PTR(-ENOMEM);
+			goto err_out;
+		}
+		fence = &cf->base;
 	}
+
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		xe_pt_update_ops_fini(tile, vops);
+	}
+
+	return fence;
+
+err_out:
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		xe_pt_update_ops_abort(tile, vops);
+	}
+	while (current_fence)
+		dma_fence_put(fences[--current_fence]);
+	kfree(fences);
+	kfree(cf);
+
+err_trace:
+	trace_xe_vm_ops_fail(vm);
+	return fence;
+}
+
+static void vma_add_ufence(struct xe_vma *vma, struct xe_user_fence *ufence)
+{
+	if (vma->ufence)
+		xe_sync_ufence_put(vma->ufence);
+	vma->ufence = __xe_sync_ufence_get(ufence);
+}
+
+static void op_add_ufence(struct xe_vm *vm, struct xe_vma_op *op,
+			  struct xe_user_fence *ufence)
+{
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		vma_add_ufence(op->map.vma, ufence);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			vma_add_ufence(op->remap.prev, ufence);
+		if (op->remap.next)
+			vma_add_ufence(op->remap.next, ufence);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		vma_add_ufence(gpuva_to_vma(op->base.prefetch.va), ufence);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+}
+
+static void vm_bind_ioctl_ops_install_fences(struct xe_vm *vm,
+					     struct xe_vma_ops *vops,
+					     struct dma_fence *fence)
+{
+	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, vops->q);
+	struct xe_user_fence *ufence;
+	struct xe_vma_op *op;
+	int i;
+
+	ufence = find_ufence_get(vops->syncs, vops->num_syncs);
+	list_for_each_entry(op, &vops->list, link) {
+		if (ufence)
+			op_add_ufence(vm, op, ufence);
+
+		if (op->base.op == DRM_GPUVA_OP_UNMAP)
+			xe_vma_destroy(gpuva_to_vma(op->base.unmap.va), fence);
+		else if (op->base.op == DRM_GPUVA_OP_REMAP)
+			xe_vma_destroy(gpuva_to_vma(op->base.remap.unmap->va),
+				       fence);
+	}
+	if (ufence)
+		xe_sync_ufence_put(ufence);
+	for (i = 0; i < vops->num_syncs; i++)
+		xe_sync_entry_signal(vops->syncs + i, NULL, fence);
+	xe_exec_queue_last_fence_set(wait_exec_queue, vm, fence);
+	dma_fence_put(fence);
 }
 
 static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
-				     struct list_head *ops_list)
+				     struct xe_vma_ops *vops)
 {
-	struct xe_vma_op *op, *next;
+	struct drm_exec exec;
+	struct dma_fence *fence;
 	int err;
 
 	lockdep_assert_held_write(&vm->lock);
 
-	list_for_each_entry_safe(op, next, ops_list, link) {
-		err = xe_vma_op_execute(vm, op);
-		if (err) {
-			drm_warn(&vm->xe->drm, "VM op(%d) failed with %d",
-				 op->base.op, err);
-			/*
-			 * FIXME: Killing VM rather than proper error handling
-			 */
-			xe_vm_kill(vm);
-			return -ENOSPC;
+	drm_exec_init(&exec, DRM_EXEC_INTERRUPTIBLE_WAIT |
+		      DRM_EXEC_IGNORE_DUPLICATES, 0);
+	drm_exec_until_all_locked(&exec) {
+		err = vm_bind_ioctl_ops_lock(&exec, vm, vops);
+		drm_exec_retry_on_contention(&exec);
+		if (err)
+			goto unlock;
+
+		fence = xe_vm_ops_execute(vm, vops);
+		if (IS_ERR(fence)) {
+			err = PTR_ERR(fence);
+			goto unlock;
 		}
-		xe_vma_op_cleanup(vm, op);
+
+		vm_bind_ioctl_ops_install_fences(vm, vops, fence);
 	}
 
-	return 0;
+unlock:
+	drm_exec_fini(&exec);
+	return err;
 }
 
+#ifdef TEST_VM_OPS_ERROR
+#define SUPPORTED_FLAGS	(FORCE_OP_ERROR | DRM_XE_VM_BIND_FLAG_NULL | \
+	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+#else
 #define SUPPORTED_FLAGS	(DRM_XE_VM_BIND_FLAG_NULL | \
 	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+#endif
 #define XE_64K_PAGE_MASK 0xffffull
 #define ALL_DRM_XE_SYNCS_FLAGS (DRM_XE_SYNCS_FLAG_WAIT_FOR_OP)
 
@@ -2936,7 +2860,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	u32 num_syncs, num_ufence = 0;
 	struct xe_sync_entry *syncs = NULL;
 	struct drm_xe_vm_bind_op *bind_ops;
-	LIST_HEAD(ops_list);
+	struct xe_vma_ops vops;
 	int err;
 	int i;
 
@@ -2951,7 +2875,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto free_objs;
 		}
 
-		if (XE_IOCTL_DBG(xe, !(q->flags & EXEC_QUEUE_FLAG_VM))) {
+		if (XE_IOCTL_DBG(xe, !(q->flags & EXEC_QUEUE_FLAG_PT))) {
 			err = -EINVAL;
 			goto put_exec_queue;
 		}
@@ -3087,6 +3011,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto free_syncs;
 	}
 
+	xe_vma_ops_init(&vops, vm, q, syncs, num_syncs);
 	for (i = 0; i < args->num_binds; ++i) {
 		u64 range = bind_ops[i].range;
 		u64 addr = bind_ops[i].addr;
@@ -3106,42 +3031,39 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		}
 
 		err = vm_bind_ioctl_ops_parse(vm, q, ops[i], syncs, num_syncs,
-					      &ops_list,
-					      i == args->num_binds - 1);
+					      &vops, i == args->num_binds - 1);
 		if (err)
 			goto unwind_ops;
+
+#ifdef TEST_VM_OPS_ERROR
+		if (flags & FORCE_OP_ERROR) {
+			vops.inject_error = true;
+			vm->xe->vm_inject_error_position =
+				(vm->xe->vm_inject_error_position + 1) %
+				FORCE_OP_ERROR_COUNT;
+		}
+#endif
 	}
 
 	/* Nothing to do */
-	if (list_empty(&ops_list)) {
+	if (list_empty(&vops.list)) {
 		err = -ENODATA;
 		goto unwind_ops;
 	}
 
-	xe_vm_get(vm);
-	if (q)
-		xe_exec_queue_get(q);
-
-	err = vm_bind_ioctl_ops_execute(vm, &ops_list);
-
-	up_write(&vm->lock);
-
-	if (q)
-		xe_exec_queue_put(q);
-	xe_vm_put(vm);
-
-	for (i = 0; bos && i < args->num_binds; ++i)
-		xe_bo_put(bos[i]);
-
-	kvfree(bos);
-	kvfree(ops);
-	if (args->num_binds > 1)
-		kvfree(bind_ops);
+	err = xe_vma_ops_alloc(&vops);
+	if (err)
+		goto unwind_ops;
 
-	return err;
+	err = vm_bind_ioctl_ops_execute(vm, &vops);
 
 unwind_ops:
-	vm_bind_ioctl_ops_unwind(vm, ops, args->num_binds);
+	if (err && err != -ENODATA)
+		vm_bind_ioctl_ops_unwind(vm, ops, args->num_binds);
+	xe_vma_ops_free(&vops);
+	for (i = args->num_binds - 1; i >= 0; --i)
+		if (ops[i])
+			drm_gpuva_ops_free(&vm->gpuvm, ops[i]);
 free_syncs:
 	if (err == -ENODATA)
 		err = vm_bind_ioctl_signal_fences(vm, q, syncs, num_syncs);
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 6df1f1c7f85d..492237b60341 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -207,7 +207,7 @@ int __xe_vm_userptr_needs_repin(struct xe_vm *vm);
 
 int xe_vm_userptr_check_repin(struct xe_vm *vm);
 
-struct dma_fence *xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
+int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
@@ -262,6 +262,13 @@ static inline struct dma_resv *xe_vm_resv(struct xe_vm *vm)
  */
 #define xe_vm_assert_held(vm) dma_resv_assert_held(xe_vm_resv(vm))
 
+int xe_vm_populate_dummy_rebind(struct xe_vm *vm, struct xe_vma *vma,
+				u8 tile_mask);
+void xe_vma_ops_free(struct xe_vma_ops *vops);
+struct dma_fence *xe_vm_ops_execute(struct xe_vm *vm, struct xe_vma_ops *vops);
+
+void xe_vm_kill(struct xe_vm *vm, bool unlocked);
+
 #if IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM)
 #define vm_dbg drm_dbg
 #else
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 79b5cab57711..d0a08e927db7 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -18,9 +18,21 @@
 #include "xe_range_fence.h"
 
 struct xe_bo;
+struct xe_device;
 struct xe_sync_entry;
 struct xe_user_fence;
 struct xe_vm;
+struct xe_vm_pgtable_update_op;
+
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+#define TEST_VM_OPS_ERROR
+#define FORCE_OP_ERROR	BIT(31)
+
+#define FORCE_OP_ERROR_LOCK	0
+#define FORCE_OP_ERROR_PREPARE	1
+#define FORCE_OP_ERROR_RUN	2
+#define FORCE_OP_ERROR_COUNT	3
+#endif
 
 #define XE_VMA_READ_ONLY	DRM_GPUVA_USERBITS
 #define XE_VMA_DESTROYED	(DRM_GPUVA_USERBITS << 1)
@@ -124,7 +136,96 @@ struct xe_userptr_vma {
 	struct xe_userptr userptr;
 };
 
-struct xe_device;
+/** struct xe_vma_op_map - VMA map operation */
+struct xe_vma_op_map {
+	/** @vma: VMA to map */
+	struct xe_vma *vma;
+	/** @immediate: Immediate bind */
+	bool immediate;
+	/** @is_null: is NULL binding */
+	bool is_null;
+	/** @dumpable: whether BO is dumped on GPU hang */
+	bool dumpable;
+	/** @pat_index: The pat index to use for this operation. */
+	u16 pat_index;
+};
+
+/** struct xe_vma_op_remap - VMA remap operation */
+struct xe_vma_op_remap {
+	/** @prev: VMA preceding part of a split mapping */
+	struct xe_vma *prev;
+	/** @next: VMA subsequent part of a split mapping */
+	struct xe_vma *next;
+	/** @start: start of the VMA unmap */
+	u64 start;
+	/** @range: range of the VMA unmap */
+	u64 range;
+	/** @skip_prev: skip prev rebind */
+	bool skip_prev;
+	/** @skip_next: skip next rebind */
+	bool skip_next;
+	/** @unmap_done: unmap operation in done */
+	bool unmap_done;
+};
+
+/** struct xe_vma_op_prefetch - VMA prefetch operation */
+struct xe_vma_op_prefetch {
+	/** @region: memory region to prefetch to */
+	u32 region;
+};
+
+/** enum xe_vma_op_flags - flags for VMA operation */
+enum xe_vma_op_flags {
+	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
+	XE_VMA_OP_COMMITTED		= BIT(0),
+	/** @XE_VMA_OP_PREV_COMMITTED: Previous VMA operation committed */
+	XE_VMA_OP_PREV_COMMITTED	= BIT(1),
+	/** @XE_VMA_OP_NEXT_COMMITTED: Next VMA operation committed */
+	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
+};
+
+/** struct xe_vma_op - VMA operation */
+struct xe_vma_op {
+	/** @base: GPUVA base operation */
+	struct drm_gpuva_op base;
+	/** @num_syncs: number of syncs */
+	u32 num_syncs;
+	/** @link: async operation link */
+	struct list_head link;
+	/** @flags: operation flags */
+	enum xe_vma_op_flags flags;
+	/** @tile_mask: Tile mask for operation */
+	u8 tile_mask;
+
+	union {
+		/** @map: VMA map operation specific data */
+		struct xe_vma_op_map map;
+		/** @remap: VMA remap operation specific data */
+		struct xe_vma_op_remap remap;
+		/** @prefetch: VMA prefetch operation specific data */
+		struct xe_vma_op_prefetch prefetch;
+	};
+};
+
+/** struct xe_vma_ops - VMA operations */
+struct xe_vma_ops {
+	/** @list: list of VMA operations */
+	struct list_head list;
+	/** @vm: VM */
+	struct xe_vm *vm;
+	/** @q: exec queue for VMA operations */
+	struct xe_exec_queue *q;
+	/** @syncs: syncs these operation */
+	struct xe_sync_entry *syncs;
+	/** @num_syncs: number of syncs */
+	u32 num_syncs;
+	/** @pt_update_ops: page table update operations */
+	struct xe_vm_pgtable_update_ops pt_update_ops[XE_MAX_TILES_PER_DEVICE];
+#ifdef TEST_VM_OPS_ERROR
+	/** @inject_error: inject error to test error handling */
+	bool inject_error;
+#endif
+};
 
 struct xe_vm {
 	/** @gpuvm: base GPUVM used to track VMAs */
@@ -133,7 +234,7 @@ struct xe_vm {
 	struct xe_device *xe;
 
 	/* exec queue used for (un)binding vma's */
-	struct xe_exec_queue *q[XE_MAX_TILES_PER_DEVICE];
+	struct xe_exec_queue *q;
 
 	/** @lru_bulk_move: Bulk LRU move list for this VM's BOs */
 	struct ttm_lru_bulk_move lru_bulk_move;
@@ -180,9 +281,6 @@ struct xe_vm {
 	 */
 	struct list_head rebind_list;
 
-	/** @rebind_fence: rebind fence from execbuf */
-	struct dma_fence *rebind_fence;
-
 	/**
 	 * @destroy_work: worker to destroy VM, needed as a dma_fence signaling
 	 * from an irq context can be last put and the destroy needs to be able
@@ -267,92 +365,18 @@ struct xe_vm {
 		bool capture_once;
 	} error_capture;
 
+	/** @dummy_ops: dummy VMA ops to issue rebinds */
+	struct {
+		/** @dummy_ops.ops: dummy VMA ops */
+		struct xe_vma_ops vops;
+		/** @dummy_ops.op: dummy VMA op */
+		struct xe_vma_op op;
+	} dummy_ops;
+
 	/** @batch_invalidate_tlb: Always invalidate TLB before batch start */
 	bool batch_invalidate_tlb;
 	/** @xef: XE file handle for tracking this VM's drm client */
 	struct xe_file *xef;
 };
 
-/** struct xe_vma_op_map - VMA map operation */
-struct xe_vma_op_map {
-	/** @vma: VMA to map */
-	struct xe_vma *vma;
-	/** @is_null: is NULL binding */
-	bool is_null;
-	/** @dumpable: whether BO is dumped on GPU hang */
-	bool dumpable;
-	/** @pat_index: The pat index to use for this operation. */
-	u16 pat_index;
-};
-
-/** struct xe_vma_op_remap - VMA remap operation */
-struct xe_vma_op_remap {
-	/** @prev: VMA preceding part of a split mapping */
-	struct xe_vma *prev;
-	/** @next: VMA subsequent part of a split mapping */
-	struct xe_vma *next;
-	/** @start: start of the VMA unmap */
-	u64 start;
-	/** @range: range of the VMA unmap */
-	u64 range;
-	/** @skip_prev: skip prev rebind */
-	bool skip_prev;
-	/** @skip_next: skip next rebind */
-	bool skip_next;
-	/** @unmap_done: unmap operation in done */
-	bool unmap_done;
-};
-
-/** struct xe_vma_op_prefetch - VMA prefetch operation */
-struct xe_vma_op_prefetch {
-	/** @region: memory region to prefetch to */
-	u32 region;
-};
-
-/** enum xe_vma_op_flags - flags for VMA operation */
-enum xe_vma_op_flags {
-	/** @XE_VMA_OP_FIRST: first VMA operation for a set of syncs */
-	XE_VMA_OP_FIRST			= BIT(0),
-	/** @XE_VMA_OP_LAST: last VMA operation for a set of syncs */
-	XE_VMA_OP_LAST			= BIT(1),
-	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
-	XE_VMA_OP_COMMITTED		= BIT(2),
-	/** @XE_VMA_OP_PREV_COMMITTED: Previous VMA operation committed */
-	XE_VMA_OP_PREV_COMMITTED	= BIT(3),
-	/** @XE_VMA_OP_NEXT_COMMITTED: Next VMA operation committed */
-	XE_VMA_OP_NEXT_COMMITTED	= BIT(4),
-};
-
-/** struct xe_vma_op - VMA operation */
-struct xe_vma_op {
-	/** @base: GPUVA base operation */
-	struct drm_gpuva_op base;
-	/**
-	 * @ops: GPUVA ops, when set call drm_gpuva_ops_free after this
-	 * operations is processed
-	 */
-	struct drm_gpuva_ops *ops;
-	/** @q: exec queue for this operation */
-	struct xe_exec_queue *q;
-	/**
-	 * @syncs: syncs for this operation, only used on first and last
-	 * operation
-	 */
-	struct xe_sync_entry *syncs;
-	/** @num_syncs: number of syncs */
-	u32 num_syncs;
-	/** @link: async operation link */
-	struct list_head link;
-	/** @flags: operation flags */
-	enum xe_vma_op_flags flags;
-
-	union {
-		/** @map: VMA map operation specific data */
-		struct xe_vma_op_map map;
-		/** @remap: VMA remap operation specific data */
-		struct xe_vma_op_remap remap;
-		/** @prefetch: VMA prefetch operation specific data */
-		struct xe_vma_op_prefetch prefetch;
-	};
-};
 #endif
diff --git a/include/drm/xe_pciids.h b/include/drm/xe_pciids.h
index bc7cbef6e9d8..7b62be9bb86e 100644
--- a/include/drm/xe_pciids.h
+++ b/include/drm/xe_pciids.h
@@ -173,6 +173,22 @@
 	XE_ATS_M150_IDS(MACRO__, ## __VA_ARGS__),\
 	XE_ATS_M75_IDS(MACRO__, ## __VA_ARGS__)
 
+/* PVC */
+#define XE_PVC_IDS(MACRO__, ...)		\
+	MACRO__(0x0B69, ## __VA_ARGS__),	\
+	MACRO__(0x0B6E, ## __VA_ARGS__),	\
+	MACRO__(0x0BD5, ## __VA_ARGS__),	\
+	MACRO__(0x0BD4, ## __VA_ARGS__),	\
+	MACRO__(0x0BD6, ## __VA_ARGS__),	\
+	MACRO__(0x0BD7, ## __VA_ARGS__),	\
+	MACRO__(0x0BD8, ## __VA_ARGS__),	\
+	MACRO__(0x0BD9, ## __VA_ARGS__),	\
+	MACRO__(0x0BDA, ## __VA_ARGS__),	\
+	MACRO__(0x0BDB, ## __VA_ARGS__),	\
+	MACRO__(0x0BE0, ## __VA_ARGS__),	\
+	MACRO__(0x0BE1, ## __VA_ARGS__),	\
+	MACRO__(0x0BE5, ## __VA_ARGS__)
+
 /* MTL / ARL */
 #define XE_MTL_IDS(MACRO__, ...)		\
 	MACRO__(0x7D40, ## __VA_ARGS__),	\
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 02/31] drm/xe/svm: Add SVM document
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
  2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault Oak Zeng
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Add shared virtual memory document.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 Documentation/gpu/xe/index.rst  |   1 +
 Documentation/gpu/xe/xe_svm.rst |   8 +++
 drivers/gpu/drm/xe/xe_svm_doc.h | 121 ++++++++++++++++++++++++++++++++
 3 files changed, 130 insertions(+)
 create mode 100644 Documentation/gpu/xe/xe_svm.rst
 create mode 100644 drivers/gpu/drm/xe/xe_svm_doc.h

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index c224ecaee81e..106b60aba1f0 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
    xe_firmware
    xe_tile
    xe_debugging
+   xe_svm
diff --git a/Documentation/gpu/xe/xe_svm.rst b/Documentation/gpu/xe/xe_svm.rst
new file mode 100644
index 000000000000..62954ba1c6f8
--- /dev/null
+++ b/Documentation/gpu/xe/xe_svm.rst
@@ -0,0 +1,8 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+=============
+Shared virtual memory
+=============
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_svm_doc.h
+   :doc: Shared virtual memory
diff --git a/drivers/gpu/drm/xe/xe_svm_doc.h b/drivers/gpu/drm/xe/xe_svm_doc.h
new file mode 100644
index 000000000000..de38ee3585e4
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_doc.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef _XE_SVM_DOC_H_
+#define _XE_SVM_DOC_H_
+
+/**
+ * DOC: Shared virtual memory
+ *
+ * Shared Virtual Memory (SVM) allows the programmer to use a single virtual
+ * address space shared between threads executing on CPUs and GPUs. It abstracts
+ * away from the user the location of the backing memory, and hence simplifies
+ * the user programming model. In a non-SVM memory model, user need to explicitly
+ * decide memory placement such as device or system memory, also user need to
+ * explicitly migrate memory b/t device and system memory.
+ *
+ * Interface
+ * =========
+ *
+ * SVM makes use of default OS memory allocation and mapping interface such as
+ * malloc() and mmap(). The pointer returned from malloc() and mmap() can be
+ * directly used on both CPU and GPU program.
+ *
+ * SVM also provides API to set virtual address range based memory attributes
+ * such as preferred memory location, memory migration granularity, and memory
+ * atomic attributes etc. This is similar to Linux madvise API.
+ *
+ * Basic implementation
+ * ==============
+ *
+ * XeKMD implementation is based on Linux kernel Heterogeneous Memory Management
+ * (HMM) framework. HMM’s address space mirroring support allows sharing of the
+ * address space by duplicating sections of CPU page tables in the device page
+ * tables. This enables both CPU and GPU access a physical memory location using
+ * the same virtual address.
+ *
+ * Linux kernel also provides the ability to plugin device memory to the system
+ * (as a special ZONE_DEVICE type) and allocates struct page for each device memory
+ * page.
+ *
+ * HMM also provides a mechanism to migrate pages from host to device memory and
+ * vice versa.
+ *
+ * More information on HMM can be found here.
+ * https://www.kernel.org/doc/Documentation/vm/hmm.rst
+ *
+ * Unlike the non-SVM memory allocator (such as gem_create, vm_bind etc), there
+ * is no buffer object (BO, such as struct ttm_buffer_object, struct drm_gem_object),
+ * in our SVM implementation. We delibrately choose this implementation option
+ * to achieve page granularity memory placement, validation, eviction and migration.
+ *
+ * The SVM layer directly allocate device memory from drm buddy subsystem. The
+ * memory is organized as many blocks each of which has 2^n pages. SVM subsystem
+ * then mark the usage of each page using a simple bitmap. When all pages in a
+ * block are not used anymore, SVM return this block back to drm buddy subsystem.
+ *
+ * There are 3 events which can trigger SVM subsystem in actions:
+ *
+ * 1. A mmu notifier callback
+ *
+ * Since SVM need to mirror the program's CPU virtual address space from GPU side,
+ * when program's CPU address space changes, SVM need to make an identical change
+ * from GPU side. SVM/hmm use mmu interval notifier to achieve this. SVM register
+ * a mmu interval notifier call back function to core mm, and whenever a CPU side
+ * virtual address space is changed (i.e., when a virtual address range is unmapped
+ * from CPU calling munmap), the registered callback function will be called from
+ * core mm. SVM then mirror the CPU address space change from GPU side, i.e., unmap
+ * or invalidate the virtual address range from GPU page table.
+ *
+ * 2. A GPU page fault
+ *
+ * At the very beginning of a process's life, no virtual address of the process
+ * is mapped on GPU page table. So when GPU access any virtual address of the process
+ * a GPU page fault is triggered. SVM then decide the best memory location of the
+ * fault address (mainly from performance consideration. Some times also consider
+ * correctness requirement such as whether GPU can perform atomics operation to
+ * certain memory location), migrate memory if necessary, and map the fault address
+ * to GPU page table.
+ *
+ * 3. A CPU page fault
+ *
+ * A CPU page fault is usually managed by Linux core mm. But in a CPU and GPU
+ * mix programming environment, the backing store of a virtual address range
+ * can be in GPU's local memory which is not visible to CPU (DEVICE_PRIVATE),
+ * so CPU page fault handler need to migrate such pages to system memory for
+ * CPU to be able to access them. Such memory migration is device specific.
+ * HMM has a callback function (migrate_to_ram function of the dev_pagemap_ops)
+ * for device driver to implement.
+ *
+ *
+ * Memory hints: TBD
+ * =================
+ *
+ * Memory eviction: TBD
+ * ===============
+ *
+ * Lock design
+ * ===========
+ *
+ * https://www.kernel.org/doc/Documentation/vm/hmm.rst section "Address space mirroring
+ * implemenation and API" described the locking scheme that driver writer has to
+ * respect. There are 3 lock mechanism involved in this scheme:
+ *
+ * 1. Use mmp_read/write_lock to protect VMA, cpu page table operations.  Operation such
+ * as munmap/mmap, page table update during numa balance must hold this lock. Hmm_range_fault
+ * is a helper function provided by HMM to populate the CPU page table, so it must be called
+ * with this lock
+ *
+ * 2. Use xe_svm::mutex to protect device side page table operation. Any attempt to bind an
+ * address range to GPU, or invalidate an address range from GPU, should hold this device lock
+ *
+ * 3. In the GPU page fault handler, during device page table update, we hold a xe_svm::mutex,
+ * but we don't hold the mmap_read/write_lock. So programm's address space can change during
+ * the GPU page table update. mmu notifier seq# is used to determine whether unmap happened
+ * during during device page table update, if yes, then retry.
+ *
+ */
+
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
  2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
  2024-04-09 20:17 ` [v2 02/31] drm/xe/svm: Add SVM document Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse Oak Zeng
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

Rather than return an error to the user or ban the VM when userptr VMA
page pin fails with -EFAULT, invalidate VMA mappings. This supports the
UMD use case of freeing userptr while still having bindings.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  4 ++--
 drivers/gpu/drm/xe/xe_trace.h        |  2 +-
 drivers/gpu/drm/xe/xe_vm.c           | 20 +++++++++++++-------
 drivers/gpu/drm/xe/xe_vm_types.h     |  7 ++-----
 4 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index e4f5a80a46fc..c49b1409e168 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -68,7 +68,7 @@ static bool access_is_atomic(enum access_type access_type)
 static bool vma_is_valid(struct xe_tile *tile, struct xe_vma *vma)
 {
 	return BIT(tile->id) & vma->tile_present &&
-		!(BIT(tile->id) & vma->usm.tile_invalidated);
+		!(BIT(tile->id) & vma->tile_invalidated);
 }
 
 static bool vma_matches(struct xe_vma *vma, u64 page_addr)
@@ -230,7 +230,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 
 	if (xe_vma_is_userptr(vma))
 		ret = xe_vma_userptr_check_repin(to_userptr_vma(vma));
-	vma->usm.tile_invalidated &= ~BIT(tile->id);
+	vma->tile_invalidated &= ~BIT(tile->id);
 
 unlock_dma_resv:
 	drm_exec_fini(&exec);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index c4704c5f3c72..5f7d26bf4cd7 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -464,7 +464,7 @@ DEFINE_EVENT(xe_vma, xe_vma_userptr_invalidate,
 	     TP_ARGS(vma)
 );
 
-DEFINE_EVENT(xe_vma, xe_vma_usm_invalidate,
+DEFINE_EVENT(xe_vma, xe_vma_invalidate,
 	     TP_PROTO(struct xe_vma *vma),
 	     TP_ARGS(vma)
 );
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 8ba037e7ce5c..e1c1c18825ff 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -723,11 +723,18 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
 				 userptr.repin_link) {
 		err = xe_vma_userptr_pin_pages(uvma);
-		if (err < 0)
-			return err;
-
 		list_del_init(&uvma->userptr.repin_link);
-		list_move_tail(&uvma->vma.combined_links.rebind, &vm->rebind_list);
+		if (err == -EFAULT) {
+			err = xe_vm_invalidate_vma(&uvma->vma);
+			if (err)
+				return err;
+		} else {
+			if (err < 0)
+				return err;
+
+			list_move_tail(&uvma->vma.combined_links.rebind,
+				       &vm->rebind_list);
+		}
 	}
 
 	return 0;
@@ -3136,9 +3143,8 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	u8 id;
 	int ret;
 
-	xe_assert(xe, xe_vm_in_fault_mode(xe_vma_vm(vma)));
 	xe_assert(xe, !xe_vma_is_null(vma));
-	trace_xe_vma_usm_invalidate(vma);
+	trace_xe_vma_invalidate(vma);
 
 	/* Check that we don't race with page-table updates */
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
@@ -3176,7 +3182,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 		}
 	}
 
-	vma->usm.tile_invalidated = vma->tile_mask;
+	vma->tile_invalidated = vma->tile_mask;
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index d0a08e927db7..2bb76adf66a1 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -96,11 +96,8 @@ struct xe_vma {
 		struct work_struct destroy_work;
 	};
 
-	/** @usm: unified shared memory state */
-	struct {
-		/** @tile_invalidated: VMA has been invalidated */
-		u8 tile_invalidated;
-	} usm;
+	/** @tile_invalidated: VMA has been invalidated */
+	u8 tile_invalidated;
 
 	/** @tile_mask: Tile mask of where to create binding for this VMA */
 	u8 tile_mask;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (2 preceding siblings ...)
  2024-04-09 20:17 ` [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 05/31] drm/xe: Fix op->tile_mask for fault mode Oak Zeng
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

Drop exec queue and last arguments from vm_bind_ioctl_ops_parse as these
are unused.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index e1c1c18825ff..c0c6bb163a9e 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2142,10 +2142,9 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
-				   struct drm_gpuva_ops *ops,
+static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				   struct xe_sync_entry *syncs, u32 num_syncs,
-				   struct xe_vma_ops *vops, bool last)
+				   struct xe_vma_ops *vops)
 {
 	struct xe_device *xe = vm->xe;
 	struct drm_gpuva_op *__op;
@@ -3037,8 +3036,8 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto unwind_ops;
 		}
 
-		err = vm_bind_ioctl_ops_parse(vm, q, ops[i], syncs, num_syncs,
-					      &vops, i == args->num_binds - 1);
+		err = vm_bind_ioctl_ops_parse(vm, ops[i], syncs, num_syncs,
+					      &vops);
 		if (err)
 			goto unwind_ops;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 05/31] drm/xe: Fix op->tile_mask for fault mode
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (3 preceding siblings ...)
  2024-04-09 20:17 ` [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

op->tile_mask might be a subset of all tiles if in fault mode. Fix
unmaps by setting op->tile_mask the unmapped VMA's tile_present field.

FIXME: This should be squashed into an eariler patch

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index c0c6bb163a9e..7ce7dbeb6f0a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2190,6 +2190,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
 
+			op->tile_mask = old->tile_present;
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
 
@@ -2273,6 +2274,13 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+			op->tile_mask = vma->tile_present;
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			break;
+		}
 		case DRM_GPUVA_OP_PREFETCH:
 			/* FIXME: Need to skip some prefetch ops */
 			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (4 preceding siblings ...)
  2024-04-09 20:17 ` [v2 05/31] drm/xe: Fix op->tile_mask for fault mode Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag which is used to create
unpopulated (no memory backing or GPU page tables) VMAs. These VMAs are
referred to as system allocator VMAs. The idea is on page fault the
memory back and GPU page tables will be populated.

FIXME: Only supporting 1 to 1 mapping between user address space and
GPU address space

v1: enforce DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR fo fault mode
VMs

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       |  73 +++++++++++++----
 drivers/gpu/drm/xe/xe_vm.c       | 132 +++++++++++++++++++------------
 drivers/gpu/drm/xe/xe_vm.h       |   8 +-
 drivers/gpu/drm/xe/xe_vm_types.h |   3 +
 include/uapi/drm/xe_drm.h        |  15 +++-
 5 files changed, 161 insertions(+), 70 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 1ff01d616dac..846e896edcb5 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -1030,6 +1030,11 @@ static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
 {
 	int err = 0;
 
+	/*
+	 * No need to check for is_system_allocator here as vma_add_deps is a
+	 * NOP if VMA is_system_allocator
+	 */
+
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
 		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
@@ -1602,6 +1607,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 	int err;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1659,6 +1665,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1694,15 +1701,21 @@ static int op_prepare(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.remap.unmap->va));
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, old);
 
 		if (!err && op->remap.prev) {
 			err = bind_op_prepare(vm, tile, pt_update_ops,
@@ -1715,15 +1728,28 @@ static int op_prepare(struct xe_vm *vm,
 			pt_update_ops->wait_vm_bookkeep = true;
 		}
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.unmap.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, vma);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		err = bind_op_prepare(vm, tile, pt_update_ops,
-				      gpuva_to_vma(op->base.prefetch.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -1785,6 +1811,8 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vm_pgtable_update_ops *pt_update_ops,
 			   struct xe_vma *vma, struct dma_fence *fence)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1810,6 +1838,8 @@ static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			     struct xe_vm_pgtable_update_ops *pt_update_ops,
 			     struct xe_vma *vma, struct dma_fence *fence)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1837,14 +1867,20 @@ static void op_commit(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence);
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.remap.unmap->va), fence);
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		unbind_op_commit(vm, tile, pt_update_ops, old, fence);
 
 		if (op->remap.prev)
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
@@ -1853,14 +1889,23 @@ static void op_commit(struct xe_vm *vm,
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
 				       fence);
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.unmap.va), fence);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			unbind_op_commit(vm, tile, pt_update_ops, vma, fence);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		bind_op_commit(vm, tile, pt_update_ops,
-			       gpuva_to_vma(op->base.prefetch.va), fence);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			bind_op_commit(vm, tile, pt_update_ops, vma, fence);
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 7ce7dbeb6f0a..d31d067d2e8b 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -841,6 +841,8 @@ int xe_vm_populate_dummy_rebind(struct xe_vm *vm, struct xe_vma *vma,
 	vm->dummy_ops.op.map.immediate = true;
 	vm->dummy_ops.op.map.dumpable = vma->gpuva.flags & XE_VMA_DUMPABLE;
 	vm->dummy_ops.op.map.is_null = xe_vma_is_null(vma);
+	vm->dummy_ops.op.map.is_system_allocator =
+		xe_vma_is_system_allocator(vma);
 
 	return xe_vma_ops_alloc(&vm->dummy_ops.vops);
 }
@@ -889,9 +891,10 @@ static void xe_vma_free(struct xe_vma *vma)
 		kfree(vma);
 }
 
-#define VMA_CREATE_FLAG_READ_ONLY	BIT(0)
-#define VMA_CREATE_FLAG_IS_NULL		BIT(1)
-#define VMA_CREATE_FLAG_DUMPABLE	BIT(2)
+#define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
+#define VMA_CREATE_FLAG_IS_NULL			BIT(1)
+#define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
+#define VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR	BIT(3)
 
 static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 				    struct xe_bo *bo,
@@ -905,6 +908,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	bool read_only = (flags & VMA_CREATE_FLAG_READ_ONLY);
 	bool is_null = (flags & VMA_CREATE_FLAG_IS_NULL);
 	bool dumpable = (flags & VMA_CREATE_FLAG_DUMPABLE);
+	bool is_system_allocator =
+		(flags & VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR);
 
 	xe_assert(vm->xe, start < end);
 	xe_assert(vm->xe, end < vm->size);
@@ -913,7 +918,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	 * Allocate and ensure that the xe_vma_is_userptr() return
 	 * matches what was allocated.
 	 */
-	if (!bo && !is_null) {
+	if (!bo && !is_null && !is_system_allocator) {
 		struct xe_userptr_vma *uvma = kzalloc(sizeof(*uvma), GFP_KERNEL);
 
 		if (!uvma)
@@ -925,6 +930,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		if (!vma)
 			return ERR_PTR(-ENOMEM);
 
+		if (is_system_allocator)
+			vma->gpuva.flags |= XE_VMA_SYSTEM_ALLOCATOR;
 		if (is_null)
 			vma->gpuva.flags |= DRM_GPUVA_SPARSE;
 		if (bo)
@@ -967,7 +974,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		drm_gpuva_link(&vma->gpuva, vm_bo);
 		drm_gpuvm_bo_put(vm_bo);
 	} else /* userptr or null */ {
-		if (!is_null) {
+		if (!is_null && !is_system_allocator) {
 			struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
 			u64 size = end - start + 1;
 			int err;
@@ -1024,7 +1031,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		 */
 		mmu_interval_notifier_remove(&userptr->notifier);
 		xe_vm_put(vm);
-	} else if (xe_vma_is_null(vma)) {
+	} else if (xe_vma_is_null(vma) || xe_vma_is_system_allocator(vma)) {
 		xe_vm_put(vm);
 	} else {
 		xe_bo_put(xe_vma_bo(vma));
@@ -1063,7 +1070,7 @@ static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
 		spin_lock(&vm->userptr.invalidated_lock);
 		list_del(&to_userptr_vma(vma)->userptr.invalidate_link);
 		spin_unlock(&vm->userptr.invalidated_lock);
-	} else if (!xe_vma_is_null(vma)) {
+	} else if (!xe_vma_is_null(vma) && !xe_vma_is_system_allocator(vma)) {
 		xe_bo_assert_held(xe_vma_bo(vma));
 
 		drm_gpuva_unlink(&vma->gpuva);
@@ -1982,6 +1989,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 		if (__op->op == DRM_GPUVA_OP_MAP) {
 			op->map.immediate = !xe_vm_in_fault_mode(vm);
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+			op->map.is_system_allocator = flags &
+				DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
 		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
@@ -2173,6 +2182,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				VMA_CREATE_FLAG_IS_NULL : 0;
 			flags |= op->map.dumpable ?
 				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->map.is_system_allocator ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
 			vma = new_vma(vm, &op->base.map, op->map.pat_index,
 				      flags);
@@ -2180,7 +2191,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
-			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+			if ((op->map.immediate || !xe_vm_in_fault_mode(vm)) &&
+			    !op->map.is_system_allocator)
 				xe_vma_ops_incr_pt_update_ops(vops,
 							      op->tile_mask);
 			break;
@@ -2189,22 +2201,25 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 		{
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
+			bool skip = xe_vma_is_system_allocator(old);
 
 			op->tile_mask = old->tile_present;
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
 
-			if (op->base.remap.prev) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_READ_ONLY ?
+				VMA_CREATE_FLAG_READ_ONLY : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				DRM_GPUVA_SPARSE ?
+				VMA_CREATE_FLAG_IS_NULL : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_DUMPABLE ?
+				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= xe_vma_is_system_allocator(old) ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
+			if (op->base.remap.prev) {
 				vma = new_vma(vm, op->base.remap.prev,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2216,9 +2231,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_prev = !xe_vma_is_userptr(old) &&
+				op->remap.skip_prev = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_end(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_prev) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2234,16 +2250,6 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			}
 
 			if (op->base.remap.next) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
-
 				vma = new_vma(vm, op->base.remap.next,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2255,9 +2261,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_next = !xe_vma_is_userptr(old) &&
+				op->remap.skip_next = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_start(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_next) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2270,7 +2277,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!skip)
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
@@ -2278,13 +2286,19 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
 
 			op->tile_mask = vma->tile_present;
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_PREFETCH:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
 			/* FIXME: Need to skip some prefetch ops */
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
+		}
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
@@ -2715,22 +2729,31 @@ static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
 }
 
 #ifdef TEST_VM_OPS_ERROR
-#define SUPPORTED_FLAGS	(FORCE_OP_ERROR | DRM_XE_VM_BIND_FLAG_NULL | \
-	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+#define SUPPORTED_FLAGS	(FORCE_OP_ERROR | \
+			 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR | \
+			 DRM_XE_VM_BIND_FLAG_NULL | \
+			 DRM_XE_VM_BIND_FLAG_DUMPABLE)
 #else
 #define SUPPORTED_FLAGS	(DRM_XE_VM_BIND_FLAG_NULL | \
-	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+			 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR | \
+			 DRM_XE_VM_BIND_FLAG_DUMPABLE)
 #endif
 #define XE_64K_PAGE_MASK 0xffffull
 #define ALL_DRM_XE_SYNCS_FLAGS (DRM_XE_SYNCS_FLAG_WAIT_FOR_OP)
 
 static int vm_bind_ioctl_check_args(struct xe_device *xe,
+				    struct xe_file *xef,
 				    struct drm_xe_vm_bind *args,
-				    struct drm_xe_vm_bind_op **bind_ops)
+				    struct drm_xe_vm_bind_op **bind_ops,
+				    struct xe_vm **vm)
 {
 	int err;
 	int i;
 
+	*vm = xe_vm_lookup(xef, args->vm_id);
+	if (XE_IOCTL_DBG(xe, !*vm))
+		return -EINVAL;
+
 	if (XE_IOCTL_DBG(xe, args->pad || args->pad2) ||
 	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
 		return -EINVAL;
@@ -2768,9 +2791,16 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u64 obj_offset = (*bind_ops)[i].obj_offset;
 		u32 prefetch_region = (*bind_ops)[i].prefetch_mem_region_instance;
 		bool is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+		bool is_system_allocator = flags &
+			DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
+		if (is_system_allocator && !xe_vm_in_fault_mode(*vm)) {
+			err = -EINVAL;
+			goto free_bind_ops;
+		}
+
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
@@ -2791,13 +2821,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 
 		if (XE_IOCTL_DBG(xe, op > DRM_XE_VM_BIND_OP_PREFETCH) ||
 		    XE_IOCTL_DBG(xe, flags & ~SUPPORTED_FLAGS) ||
-		    XE_IOCTL_DBG(xe, obj && is_null) ||
-		    XE_IOCTL_DBG(xe, obj_offset && is_null) ||
+		    XE_IOCTL_DBG(xe, obj && (is_null || is_system_allocator)) ||
+		    XE_IOCTL_DBG(xe, obj_offset &&
+				 (is_null || is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
-				 is_null) ||
+				 (is_null || is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_MAP &&
-				 !is_null) ||
+				 !is_null && !is_system_allocator) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_UNMAP_ALL) ||
 		    XE_IOCTL_DBG(xe, addr &&
@@ -2878,7 +2909,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	int err;
 	int i;
 
-	err = vm_bind_ioctl_check_args(xe, args, &bind_ops);
+	err = vm_bind_ioctl_check_args(xe,xef, args, &bind_ops, &vm);
 	if (err)
 		return err;
 
@@ -2895,12 +2926,6 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		}
 	}
 
-	vm = xe_vm_lookup(xef, args->vm_id);
-	if (XE_IOCTL_DBG(xe, !vm)) {
-		err = -EINVAL;
-		goto put_exec_queue;
-	}
-
 	err = down_write_killable(&vm->lock);
 	if (err)
 		goto put_vm;
@@ -3151,6 +3176,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	int ret;
 
 	xe_assert(xe, !xe_vma_is_null(vma));
+	xe_assert(xe, !xe_vma_is_system_allocator(vma));
 	trace_xe_vma_invalidate(vma);
 
 	/* Check that we don't race with page-table updates */
@@ -3215,8 +3241,9 @@ int xe_analyze_vm(struct drm_printer *p, struct xe_vm *vm, int gt_id)
 		struct xe_vma *vma = gpuva_to_vma(gpuva);
 		bool is_userptr = xe_vma_is_userptr(vma);
 		bool is_null = xe_vma_is_null(vma);
+		bool is_system_allocator = xe_vma_is_system_allocator(vma);
 
-		if (is_null) {
+		if (is_null || is_system_allocator) {
 			addr = 0;
 		} else if (is_userptr) {
 			struct sg_table *sg = to_userptr_vma(vma)->userptr.sg;
@@ -3235,7 +3262,8 @@ int xe_analyze_vm(struct drm_printer *p, struct xe_vm *vm, int gt_id)
 		drm_printf(p, " [%016llx-%016llx] S:0x%016llx A:%016llx %s\n",
 			   xe_vma_start(vma), xe_vma_end(vma) - 1,
 			   xe_vma_size(vma),
-			   addr, is_null ? "NULL" : is_userptr ? "USR" :
+			   addr, is_system_allocator ? "SYSTEM ALLOCATOR" :
+			   is_null ? "NULL" : is_userptr ? "USR" :
 			   is_vram ? "VRAM" : "SYS");
 	}
 	up_read(&vm->lock);
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 492237b60341..6e5470a409fc 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -150,6 +150,11 @@ static inline bool xe_vma_is_null(struct xe_vma *vma)
 	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
 }
 
+static inline bool xe_vma_is_system_allocator(struct xe_vma *vma)
+{
+	return vma->gpuva.flags & XE_VMA_SYSTEM_ALLOCATOR;
+}
+
 static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 {
 	return !xe_vma_bo(vma);
@@ -157,7 +162,8 @@ static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 
 static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 {
-	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma);
+	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma) &&
+		!xe_vma_is_system_allocator(vma);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 2bb76adf66a1..e5d12bf4cf87 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -45,6 +45,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 8)
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 9)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 10)
+#define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 11)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -141,6 +142,8 @@ struct xe_vma_op_map {
 	bool immediate;
 	/** @is_null: is NULL binding */
 	bool is_null;
+	/** @is_system_allocator: is system allocator binding */
+	bool is_system_allocator;
 	/** @dumpable: whether BO is dumped on GPU hang */
 	bool dumpable;
 	/** @pat_index: The pat index to use for this operation. */
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 2fc19177d2b0..50ab31d59fe2 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -869,6 +869,12 @@ struct drm_xe_vm_destroy {
  *    will only be valid for DRM_XE_VM_BIND_OP_MAP operations, the BO
  *    handle MBZ, and the BO offset MBZ. This flag is intended to
  *    implement VK sparse bindings.
+ *  - %DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR - When the system allocator flag is
+ *    set, no mappings are created rather the range is reserved for system
+ *    allocations which will be populated on GPU page faults. Only valid on VMs
+ *    with DRM_XE_VM_CREATE_FLAG_FAULT_MODE set. The system allocator flag are
+ *    only valid for DRM_XE_VM_BIND_OP_MAP operations, the BO handle MBZ, and
+ *    the BO offset MBZ.
  */
 struct drm_xe_vm_bind_op {
 	/** @extensions: Pointer to the first extension struct, if any */
@@ -921,7 +927,9 @@ struct drm_xe_vm_bind_op {
 	 * on the @pat_index. For such mappings there is no actual memory being
 	 * mapped (the address in the PTE is invalid), so the various PAT memory
 	 * attributes likely do not apply.  Simply leaving as zero is one
-	 * option (still a valid pat_index).
+	 * option (still a valid pat_index). Same applies to
+	 * DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR bindings as for such mapping
+	 * there is no actual memory being mapped.
 	 */
 	__u16 pat_index;
 
@@ -955,8 +963,9 @@ struct drm_xe_vm_bind_op {
 	/** @op: Bind operation to perform */
 	__u32 op;
 
-#define DRM_XE_VM_BIND_FLAG_NULL	(1 << 2)
-#define DRM_XE_VM_BIND_FLAG_DUMPABLE	(1 << 3)
+#define DRM_XE_VM_BIND_FLAG_NULL		(1 << 2)
+#define DRM_XE_VM_BIND_FLAG_DUMPABLE		(1 << 3)
+#define DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR	(1 << 4)
 	/** @flags: Bind flags */
 	__u32 flags;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (5 preceding siblings ...)
  2024-04-09 20:17 ` [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector Oak Zeng
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

If a page fault occurs on system_allocator VMA, create a userptr VMA to
replaced fault region and map to GPU.

v1: Pass userptr to the req_offset of sm_map_ops_create function. This
    fix malloc'd memory failure (Oak)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  13 +++
 drivers/gpu/drm/xe/xe_vm.c           | 115 +++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_vm.h           |   2 +
 drivers/gpu/drm/xe/xe_vm_types.h     |   3 +
 4 files changed, 128 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index c49b1409e168..c9c2f15d9f5b 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -166,6 +166,19 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		goto unlock_vm;
 	}
 
+	/*
+	 * Create userptr VMA if fault occurs in a range reserved for system
+	 * allocator.
+	 */
+	if (xe_vma_is_system_allocator(vma)) {
+		vma = xe_vm_fault_userptr(vm, pf->page_addr);
+		if (IS_ERR(vma)) {
+			xe_vm_kill(vm, true);
+			ret = PTR_ERR(vma);
+			goto unlock_vm;
+		}
+	}
+
 	if (!xe_vma_is_userptr(vma) ||
 	    !xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
 		downgrade_write(&vm->lock);
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index d31d067d2e8b..1ae7f4160061 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1411,6 +1411,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		return ERR_PTR(-ENOMEM);
 
 	vm->xe = xe;
+	vm->mm = current->mm;
 
 	vm->size = 1ull << xe->info.va_bits;
 
@@ -2151,9 +2152,11 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
-				   struct xe_sync_entry *syncs, u32 num_syncs,
-				   struct xe_vma_ops *vops)
+static int vm_bind_ioctl_ops_update_gpuvm_state(struct xe_vm *vm,
+						struct drm_gpuva_ops *ops,
+						struct xe_sync_entry *syncs,
+						u32 num_syncs,
+						struct xe_vma_ops *vops)
 {
 	struct xe_device *xe = vm->xe;
 	struct drm_gpuva_op *__op;
@@ -3069,8 +3072,8 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto unwind_ops;
 		}
 
-		err = vm_bind_ioctl_ops_parse(vm, ops[i], syncs, num_syncs,
-					      &vops);
+		err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops[i], syncs,
+							   num_syncs, &vops);
 		if (err)
 			goto unwind_ops;
 
@@ -3438,3 +3441,105 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap)
 	}
 	kvfree(snap);
 }
+
+/**
+ * xe_vm_fault_userptr() - VM fault userptr
+ * @vm: VM
+ * @fault_addr: fault address
+ *
+ * Create userptr VMA from fault address
+ *
+ * Return: newly created userptr VMA on success, ERR_PTR on failure
+ */
+struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
+{
+	struct vm_area_struct *vas;
+	struct mm_struct *mm = vm->mm;
+	struct xe_vma_ops vops;
+	struct drm_gpuva_ops *ops = NULL;
+	struct drm_gpuva_op *__op;
+	struct xe_vma *vma = NULL;
+	u64 start, range;
+	int err;
+
+	vm_dbg(&vm->xe->drm, "FAULT: addr=0x%016llx", fault_addr);
+
+	if (!mmget_not_zero(mm))
+		return ERR_PTR(-EFAULT);
+
+	kthread_use_mm(mm);
+
+	mmap_read_lock(mm);
+	vas = find_vma_intersection(mm, fault_addr, fault_addr + 4);
+	if (!vas) {
+		err = -ENOENT;
+		goto err_unlock;
+	}
+
+	vm_dbg(&vm->xe->drm, "FOUND VAS: vm_start=0x%016lx, vm_end=0x%016lx",
+	       vas->vm_start, vas->vm_end);
+
+	start = vas->vm_start;
+	range = vas->vm_end - vas->vm_start;
+	mmap_read_unlock(mm);
+
+	ops = drm_gpuvm_sm_map_ops_create(&vm->gpuvm, start, range, 0, start);
+	if (IS_ERR(ops)) {
+		err = PTR_ERR(ops);
+		goto err_kthread;
+	}
+
+	drm_gpuva_for_each_op(__op, ops)
+		print_op(vm->xe, __op);
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops, NULL, 0, &vops);
+	if (err)
+		goto err_kthread;
+
+	/*
+	 * No need to execute ops as we just want to update GPUVM state, page
+	 * fault handler will update GPU page tables. Find VMA that needs GPU
+	 * mapping and return to page fault handler.
+	 */
+	xe_vm_lock(vm, false);
+	drm_gpuva_for_each_op(__op, ops) {
+		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+		if (__op->op == DRM_GPUVA_OP_MAP) {
+			xe_assert(vm->xe, !vma);
+			vma = op->map.vma;
+		} else if (__op->op == DRM_GPUVA_OP_UNMAP) {
+			xe_vma_destroy(gpuva_to_vma(op->base.unmap.va), NULL);
+		} else if (__op->op == DRM_GPUVA_OP_REMAP) {
+			xe_vma_destroy(gpuva_to_vma(op->base.remap.unmap->va),
+				       NULL);
+		}
+	}
+	xe_vm_unlock(vm);
+
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return vma;
+
+err_unlock:
+	mmap_read_unlock(mm);
+err_kthread:
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	if (ops) {
+		drm_gpuva_for_each_op_reverse(__op, ops) {
+			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+			xe_vma_op_unwind(vm, op,
+					 op->flags & XE_VMA_OP_COMMITTED,
+					 op->flags & XE_VMA_OP_PREV_COMMITTED,
+					 op->flags & XE_VMA_OP_NEXT_COMMITTED);
+		}
+		drm_gpuva_ops_free(&vm->gpuvm, ops);
+	}
+
+	return ERR_PTR(err);
+}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 6e5470a409fc..97d38daf0e9a 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -244,6 +244,8 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma);
 
 int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma);
 
+struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr);
+
 bool xe_vm_validate_should_retry(struct drm_exec *exec, int err, ktime_t *end);
 
 int xe_analyze_vm(struct drm_printer *p, struct xe_vm *vm, int gt_id);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index e5d12bf4cf87..cb67a3918990 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -233,6 +233,9 @@ struct xe_vm {
 
 	struct xe_device *xe;
 
+	/** @mm: user MM of VM */
+	struct mm_struct *mm;
+
 	/* exec queue used for (un)binding vma's */
 	struct xe_exec_queue *q;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (6 preceding siblings ...)
  2024-04-09 20:17 ` [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 09/31] drm/xe: Introduce helper to populate userptr Oak Zeng
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

When a faulted userptr VMA (allocated by page handler) is invalidated
add to list which a garbage collector will unmap from GPU, destroy
faulted userptr VMA, and replace with system_allocator VMA.

v1: Run gargabe collector only on MMU_NOTIFY_UNMAP event. For other
    events, we just invalidate GPU page table but keep the vma because
    the userptr is still exist. On next GPU access, we will revalidate
    and rebind this userptr to GPU(Oak)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |   6 ++
 drivers/gpu/drm/xe/xe_vm.c           | 151 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h           |   1 +
 drivers/gpu/drm/xe/xe_vm_types.h     |  12 +++
 4 files changed, 170 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index c9c2f15d9f5b..707a3466f36b 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -154,12 +154,18 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		return -EINVAL;
 
 retry_userptr:
+	xe_vm_userptr_garbage_collector(vm);
+
 	/*
 	 * TODO: Avoid exclusive lock if VM doesn't have userptrs, or
 	 * start out read-locked?
 	 */
 	down_write(&vm->lock);
 	write_locked = true;
+	if (xe_vm_is_closed_or_banned(vm)) {
+		ret = -ENOENT;
+		goto unlock_vm;
+	}
 	vma = lookup_vma(vm, pf->page_addr);
 	if (!vma) {
 		ret = -EINVAL;
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 1ae7f4160061..95dda229a9fe 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -692,6 +692,18 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 		XE_WARN_ON(err);
 	}
 
+	if (range->event == MMU_NOTIFY_UNMAP &&
+	    vma->gpuva.flags & XE_VMA_FAULT_USERPTR &&
+	    !xe_vm_is_closed(vm) && !xe_vm_is_banned(vm) &&
+	    !(vma->gpuva.flags & XE_VMA_DESTROYED) && vma->tile_present) {
+		spin_lock(&vm->userptr.invalidated_lock);
+		list_move_tail(&userptr->invalidate_link,
+			       &vm->userptr.fault_invalidated);
+		spin_unlock(&vm->userptr.invalidated_lock);
+
+		queue_work(system_wq, &vm->userptr.garbage_collector);
+	}
+
 	trace_xe_vma_userptr_invalidate_complete(vma);
 
 	return true;
@@ -1398,6 +1410,8 @@ static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops, u8 tile_mask)
 			++vops->pt_update_ops[i].num_ops;
 }
 
+static void vm_userptr_garbage_collector(struct work_struct *w);
+
 struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 {
 	struct drm_gem_object *vm_resv_obj;
@@ -1430,8 +1444,10 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 
 	INIT_LIST_HEAD(&vm->userptr.repin_list);
 	INIT_LIST_HEAD(&vm->userptr.invalidated);
+	INIT_LIST_HEAD(&vm->userptr.fault_invalidated);
 	init_rwsem(&vm->userptr.notifier_lock);
 	spin_lock_init(&vm->userptr.invalidated_lock);
+	INIT_WORK(&vm->userptr.garbage_collector, vm_userptr_garbage_collector);
 
 	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
 
@@ -1568,6 +1584,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	xe_vm_close(vm);
 	if (xe_vm_in_preempt_fence_mode(vm))
 		flush_work(&vm->preempt.rebind_work);
+	if (xe_vm_in_fault_mode(vm))
+		flush_work(&vm->userptr.garbage_collector);
 
 	if (vm->q) {
 		down_write(&vm->lock);
@@ -3509,6 +3527,7 @@ struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
 		if (__op->op == DRM_GPUVA_OP_MAP) {
 			xe_assert(vm->xe, !vma);
 			vma = op->map.vma;
+			vma->gpuva.flags |= XE_VMA_FAULT_USERPTR;
 		} else if (__op->op == DRM_GPUVA_OP_UNMAP) {
 			xe_vma_destroy(gpuva_to_vma(op->base.unmap.va), NULL);
 		} else if (__op->op == DRM_GPUVA_OP_REMAP) {
@@ -3543,3 +3562,135 @@ struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
 
 	return ERR_PTR(err);
 }
+
+static int
+vm_userptr_garbage_collector_destroy_uvma(struct xe_vm *vm,
+					  struct xe_userptr_vma *uvma)
+{
+	struct mm_struct *mm = vm->mm;
+	struct xe_vma_ops vops;
+	struct drm_gpuva_ops *ops = NULL;
+	struct drm_gpuva_op *__op;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
+
+	vm_dbg(&vm->xe->drm, "GARBAGE COLLECTOR: addr=0x%016llx, range=0x%016llx",
+	       xe_vma_start(&uvma->vma), xe_vma_size(&uvma->vma));
+
+	xe_assert(vm->xe, uvma->vma.gpuva.flags & XE_VMA_FAULT_USERPTR);
+	lockdep_assert_held_write(&vm->lock);
+
+	if (!mmget_not_zero(mm))
+		return -EFAULT;
+
+	kthread_use_mm(mm);
+
+	/* Blow away xe_userptr_vma with system_allocator VMA */
+	ops = drm_gpuvm_sm_map_ops_create(&vm->gpuvm,
+					  xe_vma_start(&uvma->vma),
+					  xe_vma_size(&uvma->vma), 0, 0);
+	if (IS_ERR(ops)) {
+		err = PTR_ERR(ops);
+		goto err_kthread;
+	}
+
+	drm_gpuva_for_each_op(__op, ops) {
+		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+		if (__op->op == DRM_GPUVA_OP_MAP) {
+			op->map.immediate = true;
+			op->map.is_system_allocator = true;
+		}
+
+		print_op(vm->xe, __op);
+	}
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops, NULL, 0, &vops);
+	if (err)
+		goto err_kthread;
+
+	/*
+	 * Order behind any user operations and use same exec queue as page
+	 * fault handler.
+	 */
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[tile->id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_bind_exec_queue(tile);
+	}
+
+	err = xe_vma_ops_alloc(&vops);
+	if (err)
+		goto err_kthread;
+
+	err = vm_bind_ioctl_ops_execute(vm, &vops);
+
+	xe_vma_ops_free(&vops);
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return err;
+
+err_kthread:
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	if (ops)
+		drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return err;
+}
+
+static void vm_userptr_garbage_collector(struct work_struct *w)
+{
+	struct xe_vm *vm =
+		container_of(w, struct xe_vm, userptr.garbage_collector);
+	struct xe_userptr_vma *uvma, *next;
+	int err;
+
+	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+
+	down_write(&vm->lock);
+
+	if (xe_vm_is_closed_or_banned(vm))
+		goto unlock;
+
+	/*
+	 * FIXME: Could create 1 set of VMA ops for all VMAs on
+	 * fault_invalidated list
+	 */
+
+	spin_lock(&vm->userptr.invalidated_lock);
+	list_for_each_entry_safe(uvma, next, &vm->userptr.fault_invalidated,
+				 userptr.invalidate_link) {
+		list_del_init(&uvma->userptr.invalidate_link);
+		spin_unlock(&vm->userptr.invalidated_lock);
+
+		err = vm_userptr_garbage_collector_destroy_uvma(vm, uvma);
+		if (err) {
+			XE_WARN_ON("Garbage collection failed, killing VM");
+			xe_vm_kill(vm, true);
+		}
+
+		spin_lock(&vm->userptr.invalidated_lock);
+	}
+	spin_unlock(&vm->userptr.invalidated_lock);
+
+unlock:
+	up_write(&vm->lock);
+}
+
+/**
+ * xe_vm_userptr_garbage_collector() - VM userptr garbage collector
+ * @vm: VM
+ *
+ * For all invalidated faulted userptr VMAs (created by page fault handler)
+ * unmap from GPU, destroy faulted userptr VMA, and replace with
+ * system_allocator VMA.
+ */
+void xe_vm_userptr_garbage_collector(struct xe_vm *vm)
+{
+	vm_userptr_garbage_collector(&vm->userptr.garbage_collector);
+}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 97d38daf0e9a..0b2790f697db 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -276,6 +276,7 @@ void xe_vma_ops_free(struct xe_vma_ops *vops);
 struct dma_fence *xe_vm_ops_execute(struct xe_vm *vm, struct xe_vma_ops *vops);
 
 void xe_vm_kill(struct xe_vm *vm, bool unlocked);
+void xe_vm_userptr_garbage_collector(struct xe_vm *vm);
 
 #if IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM)
 #define vm_dbg drm_dbg
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index cb67a3918990..fbf6bfcf59a8 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -46,6 +46,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 9)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 10)
 #define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 11)
+#define XE_VMA_FAULT_USERPTR	(DRM_GPUVA_USERBITS << 12)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -326,6 +327,17 @@ struct xe_vm {
 		 * write mode.
 		 */
 		struct list_head invalidated;
+		/**
+		 * @userptr.fault_invalidated: List of invalidated userptrs,
+		 * craeted by page fault, which will be destroy by the garbage
+		 * collector. Protected from access with the @invalidated_lock.
+		 */
+		struct list_head fault_invalidated;
+		/**
+		 * @userptr.garbage_collector: worker to implement destroying of
+		 * userptrs on @userptr.fault_invalidated list.
+		 */
+		struct work_struct garbage_collector;
 	} userptr;
 
 	/** @preempt: preempt state */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 09/31] drm/xe: Introduce helper to populate userptr
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (7 preceding siblings ...)
  2024-04-09 20:17 ` [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 10/31] drm/xe: Introduce a helper to free sg table Oak Zeng
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a helper function xe_userptr_populate_range to populate
a a userptr range. This functions calls hmm_range_fault to read
CPU page tables and populate all pfns/pages of this virtual address
range.

If the populated page is system memory page, dma-mapping is performed
to get a dma-address which can be used later for GPU to access pages.

If the populated page is device private page, we calculate the dpa (
device physical address) of the page. This will be handled in future
patches.

The dma-address or dpa is then saved in userptr's sg table. This is
prepare work to replace the get_user_pages_fast code in userptr code
path.

v1: Address review comments:
    separate a npage_in_range function (Matt)
    reparameterize function xe_userptr_populate_range function (Matt)
    move mmu_interval_read_begin() call into while loop (Thomas)
    s/mark_range_accessed/xe_mark_range_accessed (Thomas)
    use set_page_dirty_lock (vs set_page_dirty) (Thomas)
    move a few checking in xe_vma_userptr_pin_pages to hmm.c (Matt)
v2: Remove device private page support. Only support system
    pages for now. use dma-map-sg rather than dma-map-page (Matt/Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Kconfig  |   1 +
 drivers/gpu/drm/xe/Makefile |   2 +
 drivers/gpu/drm/xe/xe_hmm.c | 224 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hmm.h |  17 +++
 4 files changed, 244 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.h

diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
index 1a556d087e63..449a1ecbc92a 100644
--- a/drivers/gpu/drm/xe/Kconfig
+++ b/drivers/gpu/drm/xe/Kconfig
@@ -41,6 +41,7 @@ config DRM_XE
 	select MMU_NOTIFIER
 	select WANT_DEV_COREDUMP
 	select AUXILIARY_BUS
+	select HMM_MIRROR
 	help
 	  Experimental driver for Intel Xe series GPUs
 
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index bf43a3690e13..fff70fc9a09e 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -146,6 +146,8 @@ xe-y += xe_bb.o \
 	xe_wa.o \
 	xe_wopcm.o
 
+xe-$(CONFIG_HMM_MIRROR) += xe_hmm.o
+
 # graphics hardware monitoring (HWMON) support
 xe-$(CONFIG_HWMON) += xe_hwmon.o
 
diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
new file mode 100644
index 000000000000..4011207630a5
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/dma-mapping.h>
+#include <linux/memremap.h>
+#include <linux/swap.h>
+#include <linux/hmm.h>
+#include <linux/mm.h>
+#include "xe_hmm.h"
+#include "xe_vm.h"
+#include "xe_bo.h"
+
+static u64 xe_npages_in_range(unsigned long start, unsigned long end)
+{
+	return (PAGE_ALIGN(end) - PAGE_ALIGN_DOWN(start)) >> PAGE_SHIFT;
+}
+
+/**
+ * xe_mark_range_accessed() - mark a range is accessed, so core mm
+ * have such information for memory eviction or write back to
+ * hard disk
+ *
+ * @range: the range to mark
+ * @write: if write to this range, we mark pages in this range
+ * as dirty
+ */
+static void xe_mark_range_accessed(struct hmm_range *range, bool write)
+{
+	struct page *page;
+	u64 i, npages;
+
+	npages = xe_npages_in_range(range->start, range->end);
+	for (i = 0; i < npages; i++) {
+		page = hmm_pfn_to_page(range->hmm_pfns[i]);
+		if (write)
+			set_page_dirty_lock(page);
+
+		mark_page_accessed(page);
+	}
+}
+
+/**
+ * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
+ * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
+ * and will be used to program GPU page table later.
+ *
+ * @xe: the xe device who will access the dma-address in sg table
+ * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
+ * has the pfn numbers of pages that back up this hmm address range.
+ * @st: pointer to the sg table.
+ * @write: whether we write to this range. This decides dma map direction
+ * for system pages. If write we map it bi-diretional; otherwise
+ * DMA_TO_DEVICE
+ *
+ * All the contiguous pfns will be collapsed into one entry in
+ * the scatter gather table. This is for the purpose of efficiently
+ * programming GPU page table.
+ *
+ * The dma_address in the sg table will later be used by GPU to
+ * access memory. So if the memory is system memory, we need to
+ * do a dma-mapping so it can be accessed by GPU/DMA.
+ *
+ * FIXME: This function currently only support pages in system
+ * memory. If the memory is GPU local memory (of the GPU who
+ * is going to access memory), we need gpu dpa (device physical
+ * address), and there is no need of dma-mapping. This is TBD.
+ *
+ * FIXME: dma-mapping for peer gpu device to access remote gpu's
+ * memory. Add this when you support p2p
+ *
+ * This function allocates the storage of the sg table. It is
+ * caller's responsibility to free it calling sg_free_table.
+ *
+ * Returns 0 if successful; -ENOMEM if fails to allocate memory
+ */
+static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
+			     struct sg_table *st, bool write)
+{
+	struct device *dev = xe->drm.dev;
+	struct page **pages;
+	u64 i, npages;
+	int ret;
+
+	npages = xe_npages_in_range(range->start, range->end);
+	pages = kvmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (i = 0; i < npages; i++) {
+		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
+		xe_assert(xe, !is_device_private_page(pages[i]));
+	}
+
+	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
+			npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
+	if (ret)
+		goto free_pages;
+
+	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
+			DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+
+free_pages:
+	kvfree(pages);
+	return ret;
+}
+
+/**
+ * xe_userptr_populate_range() - Populate physical pages of a virtual
+ * address range
+ *
+ * @uvma: userptr vma which has information of the range to populate.
+ *
+ * This function populate the physical pages of a virtual
+ * address range. The populated physical pages is saved in
+ * userptr's sg table. It is similar to get_user_pages but call
+ * hmm_range_fault.
+ *
+ * This function also read mmu notifier sequence # (
+ * mmu_interval_read_begin), for the purpose of later
+ * comparison (through mmu_interval_read_retry).
+ *
+ * This must be called with mmap read or write lock held.
+ *
+ * This function allocates the storage of the userptr sg table.
+ * It is caller's responsibility to free it calling sg_free_table.
+ *
+ * returns: 0 for succuss; negative error no on failure
+ */
+int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
+{
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
+	struct xe_userptr *userptr;
+	struct xe_vma *vma = &uvma->vma;
+	u64 start = xe_vma_userptr(vma);
+	u64 end = start + xe_vma_size(vma);
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct hmm_range hmm_range;
+	bool write = !xe_vma_read_only(vma);
+	bool in_kthread = !current->mm;
+	unsigned long notifier_seq;
+	u64 npages;
+	int ret;
+
+	userptr = &uvma->userptr;
+	mmap_assert_locked(userptr->notifier.mm);
+
+	if (vma->gpuva.flags & XE_VMA_DESTROYED)
+		return 0;
+
+	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
+	if (notifier_seq == userptr->notifier_seq)
+		return 0;
+
+	npages = xe_npages_in_range(start, end);
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (unlikely(!pfns))
+		return -ENOMEM;
+
+	if (write)
+		flags |= HMM_PFN_REQ_WRITE;
+
+	if (in_kthread) {
+		if (!mmget_not_zero(userptr->notifier.mm)) {
+			ret = -EFAULT;
+			goto free_pfns;
+		}
+		kthread_use_mm(userptr->notifier.mm);
+	}
+
+	memset64((u64 *)pfns, (u64)flags, npages);
+	hmm_range.hmm_pfns = pfns;
+	hmm_range.notifier = &userptr->notifier;
+	hmm_range.start = start;
+	hmm_range.end = end;
+	hmm_range.pfn_flags_mask = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE;
+	/**
+	 * FIXME:
+	 * Set the dev_private_owner can prevent hmm_range_fault to fault
+	 * in the device private pages owned by caller. See function
+	 * hmm_vma_handle_pte. In multiple GPU case, this should be set to the
+	 * device owner of the best migration destination. e.g., device0/vm0
+	 * has a page fault, but we have determined the best placement of
+	 * the fault address should be on device1, we should set below to
+	 * device1 instead of device0.
+	 */
+	hmm_range.dev_private_owner = vm->xe;
+
+	while (true) {
+		hmm_range.notifier_seq = mmu_interval_read_begin(&userptr->notifier);
+		ret = hmm_range_fault(&hmm_range);
+		if (time_after(jiffies, timeout))
+			break;
+
+		if (ret == -EBUSY)
+			continue;
+		break;
+	}
+
+	if (in_kthread) {
+		kthread_unuse_mm(userptr->notifier.mm);
+		mmput(userptr->notifier.mm);
+	}
+
+	if (ret)
+		goto free_pfns;
+
+	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
+	if (ret)
+		goto free_pfns;
+
+	xe_mark_range_accessed(&hmm_range, write);
+	userptr->sg = &userptr->sgt;
+	userptr->notifier_seq = hmm_range.notifier_seq;
+
+free_pfns:
+	kvfree(pfns);
+	return ret;
+}
+
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
new file mode 100644
index 000000000000..91686a751711
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hmm.h
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <linux/types.h>
+
+struct xe_userptr_vma;
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+int xe_userptr_populate_range(struct xe_userptr_vma *uvma);
+#else
+static inline int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
+{
+	return -ENODEV;
+}
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 10/31] drm/xe: Introduce a helper to free sg table
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (8 preceding siblings ...)
  2024-04-09 20:17 ` [v2 09/31] drm/xe: Introduce helper to populate userptr Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce xe_userptr_free_sg helper to dma-unmap all
addresses in userptr's sg table and free sg table.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Suggested by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_hmm.c | 30 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hmm.h |  1 +
 2 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index 4011207630a5..427c6bc49949 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -3,6 +3,7 @@
  * Copyright © 2024 Intel Corporation
  */
 
+#include <linux/scatterlist.h>
 #include <linux/mmu_notifier.h>
 #include <linux/dma-mapping.h>
 #include <linux/memremap.h>
@@ -107,6 +108,32 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
 	return ret;
 }
 
+/**
+ * xe_userptr_free_sg() - Free the scatter gather table of userptr
+ *
+ * @uvma: the userptr vma which hold the scatter gather table
+ *
+ * With function xe_userptr_populate_range, we allocate storage of
+ * the userptr sg table. This is a helper function to free this
+ * sg table, and dma unmap the address in the table.
+ */
+void xe_userptr_free_sg(struct xe_userptr_vma *uvma)
+{
+	struct xe_userptr *userptr = &uvma->userptr;
+	struct xe_vma *vma = &uvma->vma;
+	bool write = !xe_vma_read_only(vma);
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_device *xe = vm->xe;
+	struct device *dev = xe->drm.dev;
+
+	xe_assert(xe, userptr->sg);
+	dma_unmap_sgtable(dev, userptr->sg,
+			write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
+
+	sg_free_table(userptr->sg);
+	userptr->sg = NULL;
+}
+
 /**
  * xe_userptr_populate_range() - Populate physical pages of a virtual
  * address range
@@ -156,6 +183,9 @@ int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
 	if (notifier_seq == userptr->notifier_seq)
 		return 0;
 
+	if (userptr->sg)
+		xe_userptr_free_sg(uvma);
+
 	npages = xe_npages_in_range(start, end);
 	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
 	if (unlikely(!pfns))
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
index 91686a751711..7bb49bbde5a4 100644
--- a/drivers/gpu/drm/xe/xe_hmm.h
+++ b/drivers/gpu/drm/xe/xe_hmm.h
@@ -15,3 +15,4 @@ static inline int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
 	return -ENODEV;
 }
 #endif
+void xe_userptr_free_sg(struct xe_userptr_vma *uvma);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (9 preceding siblings ...)
  2024-04-09 20:17 ` [v2 10/31] drm/xe: Introduce a helper to free sg table Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

This is an effort to unify hmmptr (aka system allocator)
and userptr code. hmm_range_fault is used to populate
a virtual address range for both hmmptr and userptr,
instead of hmmptr using hmm_range_fault and userptr
using get_user_pages_fast.

This also aligns with AMD gpu driver's behavior. In
long term, we plan to put some common helpers in this
area to drm layer so it can be re-used by different
vendors.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 122 ++++---------------------------------
 1 file changed, 12 insertions(+), 110 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 95dda229a9fe..61d336f24a65 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -39,6 +39,7 @@
 #include "xe_sync.h"
 #include "xe_trace.h"
 #include "xe_wa.h"
+#include "xe_hmm.h"
 
 static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
 {
@@ -66,113 +67,21 @@ int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma)
 
 int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 {
-	struct xe_userptr *userptr = &uvma->userptr;
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct xe_device *xe = vm->xe;
-	const unsigned long num_pages = xe_vma_size(vma) >> PAGE_SHIFT;
-	struct page **pages;
-	bool in_kthread = !current->mm;
-	unsigned long notifier_seq;
-	int pinned, ret, i;
-	bool read_only = xe_vma_read_only(vma);
+	struct xe_userptr *userptr;
+	int ret;
 
 	lockdep_assert_held(&vm->lock);
 	xe_assert(xe, xe_vma_is_userptr(vma));
-retry:
-	if (vma->gpuva.flags & XE_VMA_DESTROYED)
-		return 0;
-
-	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
-	if (notifier_seq == userptr->notifier_seq)
-		return 0;
-
-	pages = kvmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
-	if (userptr->sg) {
-		dma_unmap_sgtable(xe->drm.dev,
-				  userptr->sg,
-				  read_only ? DMA_TO_DEVICE :
-				  DMA_BIDIRECTIONAL, 0);
-		sg_free_table(userptr->sg);
-		userptr->sg = NULL;
-	}
-
-	pinned = ret = 0;
-	if (in_kthread) {
-		if (!mmget_not_zero(userptr->notifier.mm)) {
-			ret = -EFAULT;
-			goto mm_closed;
-		}
-		kthread_use_mm(userptr->notifier.mm);
-	}
-
-	while (pinned < num_pages) {
-		ret = get_user_pages_fast(xe_vma_userptr(vma) +
-					  pinned * PAGE_SIZE,
-					  num_pages - pinned,
-					  read_only ? 0 : FOLL_WRITE,
-					  &pages[pinned]);
-		if (ret < 0)
-			break;
-
-		pinned += ret;
-		ret = 0;
-	}
 
-	if (in_kthread) {
-		kthread_unuse_mm(userptr->notifier.mm);
-		mmput(userptr->notifier.mm);
-	}
-mm_closed:
-	if (ret)
-		goto out;
-
-	ret = sg_alloc_table_from_pages_segment(&userptr->sgt, pages,
-						pinned, 0,
-						(u64)pinned << PAGE_SHIFT,
-						xe_sg_segment_size(xe->drm.dev),
-						GFP_KERNEL);
-	if (ret) {
-		userptr->sg = NULL;
-		goto out;
-	}
-	userptr->sg = &userptr->sgt;
-
-	ret = dma_map_sgtable(xe->drm.dev, userptr->sg,
-			      read_only ? DMA_TO_DEVICE :
-			      DMA_BIDIRECTIONAL,
-			      DMA_ATTR_SKIP_CPU_SYNC |
-			      DMA_ATTR_NO_KERNEL_MAPPING);
-	if (ret) {
-		sg_free_table(userptr->sg);
-		userptr->sg = NULL;
-		goto out;
-	}
-
-	for (i = 0; i < pinned; ++i) {
-		if (!read_only) {
-			lock_page(pages[i]);
-			set_page_dirty(pages[i]);
-			unlock_page(pages[i]);
-		}
+	userptr = &uvma->userptr;
+	mmap_read_lock(userptr->notifier.mm);
+	ret = xe_userptr_populate_range(uvma);
+	mmap_read_unlock(userptr->notifier.mm);
 
-		mark_page_accessed(pages[i]);
-	}
-
-out:
-	release_pages(pages, pinned);
-	kvfree(pages);
-
-	if (!(ret < 0)) {
-		userptr->notifier_seq = notifier_seq;
-		if (xe_vma_userptr_check_repin(uvma) == -EAGAIN)
-			goto retry;
-	}
-
-	return ret < 0 ? ret : 0;
+	return ret;
 }
 
 static bool preempt_fences_waiting(struct xe_vm *vm)
@@ -1016,8 +925,6 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 static void xe_vma_destroy_late(struct xe_vma *vma)
 {
 	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_device *xe = vm->xe;
-	bool read_only = xe_vma_read_only(vma);
 
 	if (vma->ufence) {
 		xe_sync_ufence_put(vma->ufence);
@@ -1025,16 +932,11 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 	}
 
 	if (xe_vma_is_userptr(vma)) {
-		struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
+		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		struct xe_userptr *userptr = &uvma->userptr;
 
-		if (userptr->sg) {
-			dma_unmap_sgtable(xe->drm.dev,
-					  userptr->sg,
-					  read_only ? DMA_TO_DEVICE :
-					  DMA_BIDIRECTIONAL, 0);
-			sg_free_table(userptr->sg);
-			userptr->sg = NULL;
-		}
+		if (userptr->sg)
+			xe_userptr_free_sg(uvma);
 
 		/*
 		 * Since userptr pages are not pinned, we can't remove
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (10 preceding siblings ...)
  2024-04-09 20:17 ` [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:09   ` Matthew Brost
  2024-04-16 19:01   ` Matthew Brost
  2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
                   ` (19 subsequent siblings)
  31 siblings, 2 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Memory remap GPU vram using devm_memremap_pages, so each GPU vram
page is backed by a struct page.

Those struct pages are created to allow hmm migrate buffer b/t
GPU vram and CPU system memory using existing Linux migration
mechanism (i.e., migrating b/t CPU system memory and hard disk).

This is prepare work to enable svm (shared virtual memory) through
Linux kernel hmm framework. The memory remap's page map type is set
to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
vram page get a struct page and can be mapped in CPU page table,
but such pages are treated as GPU's private resource, so CPU can't
access them. If CPU access such page, a page fault is triggered
and page will be migrate to system memory.

For GPU device which supports coherent memory protocol b/t CPU and
GPU (such as CXL and CAPI protocol), we can remap device memory as
MEMORY_DEVICE_COHERENT. This is TBD.

v1:
Changes per code review feedback from Matt:
    change .o order in Makefile
    fix indentation
    change code order in mmio_fini
    remove unnecessary header file
    uniform xe_svm_devm_add/_remove parameter
    use tile (vs dev) as pagemap.owner during memremap
    only remap vram for platform that support usm
Changes per review feedback from Brian:
    s/xe_svm_devm_add/xe_devm_add
    s/xe_svm_devm_remove/xe_devm_remove
    move calling of xe_devm_add to xe_tile.c

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Makefile          |  1 +
 drivers/gpu/drm/xe/xe_device_types.h |  8 +++
 drivers/gpu/drm/xe/xe_mmio.c         |  6 ++
 drivers/gpu/drm/xe/xe_svm.h          | 15 +++++
 drivers/gpu/drm/xe/xe_svm_devmem.c   | 89 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_tile.c         |  4 ++
 6 files changed, 123 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index fff70fc9a09e..cd5213ba182b 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -129,6 +129,7 @@ xe-y += xe_bb.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm_devmem.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index e73b9a086718..d6a14327986b 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -103,6 +103,14 @@ struct xe_mem_region {
 	resource_size_t actual_physical_size;
 	/** @mapping: pointer to VRAM mappable space */
 	void __iomem *mapping;
+	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
+	struct dev_pagemap pagemap;
+	/**
+	 * @hpa_base: base host physical address
+	 *
+	 * This is generated when remap device memory as ZONE_DEVICE
+	 */
+	resource_size_t hpa_base;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
index 7ba2477452d7..12923fe6abae 100644
--- a/drivers/gpu/drm/xe/xe_mmio.c
+++ b/drivers/gpu/drm/xe/xe_mmio.c
@@ -22,6 +22,7 @@
 #include "xe_module.h"
 #include "xe_sriov.h"
 #include "xe_tile.h"
+#include "xe_svm.h"
 
 #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
 #define TILE_COUNT		REG_GENMASK(15, 8)
@@ -354,6 +355,11 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
 static void mmio_fini(struct drm_device *drm, void *arg)
 {
 	struct xe_device *xe = arg;
+	struct xe_tile *tile;
+	u8 id;
+
+	for_each_tile(tile, xe, id)
+		xe_devm_remove(tile, &tile->mem.vram);
 
 	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
 	if (xe->mem.vram.mapping)
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..e944971cfc6d
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef __XE_SVM_H
+#define __XE_SVM_H
+
+struct xe_tile;
+struct xe_mem_region;
+
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
+void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
new file mode 100644
index 000000000000..31af56e8285a
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/mm_types.h>
+#include <linux/sched/mm.h>
+
+#include "xe_device_types.h"
+#include "xe_svm.h"
+
+
+static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
+{
+	return 0;
+}
+
+static void xe_devm_page_free(struct page *page)
+{
+}
+
+static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
+	.page_free = xe_devm_page_free,
+	.migrate_to_ram = xe_devm_migrate_to_ram,
+};
+
+/**
+ * xe_devm_add: Remap and provide memmap backing for device memory
+ * @tile: tile that the memory region blongs to
+ * @mr: memory region to remap
+ *
+ * This remap device memory to host physical address space and create
+ * struct page to back device memory
+ *
+ * Return: 0 on success standard error code otherwise
+ */
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct xe_device *xe = tile_to_xe(tile);
+	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
+	struct resource *res;
+	void *addr;
+	int ret;
+
+	res = devm_request_free_mem_region(dev, &iomem_resource,
+					   mr->usable_size);
+	if (IS_ERR(res)) {
+		ret = PTR_ERR(res);
+		return ret;
+	}
+
+	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	mr->pagemap.range.start = res->start;
+	mr->pagemap.range.end = res->end;
+	mr->pagemap.nr_range = 1;
+	mr->pagemap.ops = &xe_devm_pagemap_ops;
+	mr->pagemap.owner = xe;
+	addr = devm_memremap_pages(dev, &mr->pagemap);
+	if (IS_ERR(addr)) {
+		devm_release_mem_region(dev, res->start, resource_size(res));
+		ret = PTR_ERR(addr);
+		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);
+		return ret;
+	}
+	mr->hpa_base = res->start;
+
+	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
+			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
+	return 0;
+}
+
+/**
+ * xe_devm_remove: Unmap device memory and free resources
+ * @tile: xe tile
+ * @mr: memory region to remove
+ */
+void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
+
+	/*FIXME: Does below cause a kernel hange during moduel remove?*/
+	if (mr->hpa_base) {
+		devm_memunmap_pages(dev, &mr->pagemap);
+		devm_release_mem_region(dev, mr->pagemap.range.start,
+			mr->pagemap.range.end - mr->pagemap.range.start + 1);
+	}
+}
+
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 0650b2fa75ef..f1c4f9de51df 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -14,6 +14,7 @@
 #include "xe_tile_sysfs.h"
 #include "xe_ttm_vram_mgr.h"
 #include "xe_wa.h"
+#include "xe_svm.h"
 
 /**
  * DOC: Multi-tile Design
@@ -158,6 +159,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
  */
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
+	struct xe_device *xe = tile_to_xe(tile);
 	int err;
 
 	xe_device_mem_access_get(tile_to_xe(tile));
@@ -175,6 +177,8 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	xe_tile_sysfs_init(tile);
 
+	if (xe->info.has_usm)
+		xe_devm_add(tile, &tile->mem.vram);
 err_mem_access:
 	xe_device_mem_access_put(tile_to_xe(tile));
 	return err;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (11 preceding siblings ...)
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:13   ` Matthew Brost
  2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
                   ` (18 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a DRM_XE_SVM kernel config entry for
xe svm feature. xe svm feature allows share
virtual address space between CPU and GPU program.

v1: Improve commit message (Thomas)
    Avoid using #if directive (Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Kconfig   | 21 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_tile.c |  7 +++++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
index 449a1ecbc92a..0accb2cb81d6 100644
--- a/drivers/gpu/drm/xe/Kconfig
+++ b/drivers/gpu/drm/xe/Kconfig
@@ -84,6 +84,27 @@ config DRM_XE_FORCE_PROBE
 	  4571.
 
 	  Use "!*" to block the probe of the driver for all known devices.
+config DRM_XE_SVM
+	bool "Enable Shared Virtual Memory support in xe"
+	depends on DRM_XE
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTREMOVE
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on ARCH_HAS_PTE_DEVMAP
+	depends on SPARSEMEM_VMEMMAP
+	depends on ZONE_DEVICE
+	depends on DEVICE_PRIVATE
+	depends on MMU
+	select HMM_MIRROR
+	select MMU_NOTIFIER
+	default y
+	help
+	  Choose this option if you want Shared Virtual Memory (SVM)
+	  support in xe. With SVM, virtual address space is shared
+	  between CPU and GPU. This means any virtual address such
+	  as malloc or mmap returns, variables on stack, or global
+	  memory pointers, can be used for GPU transparently.
 
 menu "drm/Xe Debugging"
 depends on DRM_XE
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index f1c4f9de51df..a1a436912fe3 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -159,9 +159,12 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
  */
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
-	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_device __maybe_unused *xe;
 	int err;
 
+	if (IS_ENABLED(CONFIG_DRM_XE_SVM))
+		xe = tile_to_xe(tile);
+
 	xe_device_mem_access_get(tile_to_xe(tile));
 
 	err = tile_ttm_mgr_init(tile);
@@ -177,7 +180,7 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	xe_tile_sysfs_init(tile);
 
-	if (xe->info.has_usm)
+	if (IS_ENABLED(CONFIG_DRM_XE_SVM) && xe->info.has_usm)
 		xe_devm_add(tile, &tile->mem.vram);
 err_mem_access:
 	xe_device_mem_access_put(tile_to_xe(tile));
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 14/31] drm/xe: Introduce helper to get tile from memory region
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (12 preceding siblings ...)
  2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:17   ` Matthew Brost
  2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a simple helper to retrieve tile from memory region

v1: move the function to xe_device.h (Matt)
    improve commit message, add kerneldoc (Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_device.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 74eb9833d4d8..68082357aebd 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -178,4 +178,12 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
 
 void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred);
 
+/**
+ * xe_mem_region_to_tile() - retrieve tile from memory region
+ * @mr: the memory region we retrieve tile from
+ */
+static inline struct xe_tile *xe_mem_region_to_tile(struct xe_mem_region *mr)
+{
+	return container_of(mr, struct xe_tile, mem.vram);
+}
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (13 preceding siblings ...)
  2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:35   ` Matthew Brost
  2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Since we now create struct page backing for each vram page,
each vram page now also has a pfn, just like system memory.
This allow us to calcuate device physical address from pfn.

v1: move the function to xe_svm.h (Matt)
    s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa (Matt)
    add kernel document for the helper (Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index e944971cfc6d..8a34429eb674 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -6,8 +6,31 @@
 #ifndef __XE_SVM_H
 #define __XE_SVM_H
 
-struct xe_tile;
-struct xe_mem_region;
+#include "xe_device_types.h"
+#include "xe_device.h"
+#include "xe_assert.h"
+
+/**
+ * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
+ *
+ * @mr: The memory region that page resides in
+ * @pfn: page frame number of the page
+ *
+ * Returns: the device physical address of the page
+ */
+static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
+{
+	u64 dpa;
+	struct xe_tile *tile = xe_mem_region_to_tile(mr);
+	struct xe_device *xe = tile_to_xe(tile);
+	u64 offset;
+
+	xe_assert(xe, (pfn << PAGE_SHIFT) >= mr->hpa_base);
+	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;
+
+	return dpa;
+}
 
 int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
 void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 16/31] drm/xe/svm: Get xe memory region from page
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (14 preceding siblings ...)
  2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:38   ` Matthew Brost
  2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

For gpu vram page, we now have a struct page backing of
it. struct page's pgmap points to xe_memory_region's
pagemap. This allow us to retrieve xe_memory_region
from struct page.

v1: move the function to xe_svm.h

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 8a34429eb674..624c1581f8ba 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -6,6 +6,7 @@
 #ifndef __XE_SVM_H
 #define __XE_SVM_H
 
+#include <linux/mm_types.h>
 #include "xe_device_types.h"
 #include "xe_device.h"
 #include "xe_assert.h"
@@ -35,4 +36,14 @@ static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
 int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
 void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
 
+/**
+ * xe_page_to_mem_region() - Get a page's memory region
+ *
+ * @page: a struct page pointer pointing to a page in vram memory region
+ */
+static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
+{
+	return container_of(page->pgmap, struct xe_mem_region, pagemap);
+}
+
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 17/31] drm/xe: Get xe_vma from xe_userptr
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (15 preceding siblings ...)
  2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:42   ` Matthew Brost
  2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
                   ` (14 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a helper to get xe_vma from xe_userptr.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 0b2790f697db..4860747592ad 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -178,6 +178,20 @@ static inline struct xe_userptr_vma *to_userptr_vma(struct xe_vma *vma)
 	return container_of(vma, struct xe_userptr_vma, vma);
 }
 
+/**
+ * xe_userptr_to_vma() - Return xe_vma from a xe_userptr pointer
+ *
+ * @userptr: The userptr struct pointer
+ */
+
+static inline struct xe_vma *xe_userptr_to_vma(struct xe_userptr *userptr)
+{
+	struct xe_userptr_vma *uvma;
+
+	uvma = container_of(userptr, struct xe_userptr_vma, userptr);
+	return &uvma->vma;
+}
+
 u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile);
 
 int xe_vm_create_ioctl(struct drm_device *dev, void *data,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 18/31] drm/xe/svm: Build userptr sg table for device pages
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (16 preceding siblings ...)
  2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:52   ` Matthew Brost
  2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Previously function xe_build_sg only support userptr with system
memory pages. Now this function is extended to support userptr
with device pages backing as well.

For device pages, there is no need of dma-mapping. Instead, we
calculated the device page's dpa (device physical address) and
use dpa to fill sg table.

As of now, we assume each userptr is only backed either by all
system memory pages or all by device pages. There is no support
of mixture backing of device and system memory pages.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_hmm.c      | 121 +++++++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_vm_types.h |   2 +
 2 files changed, 100 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index 427c6bc49949..a261c1dd2060 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -11,6 +11,7 @@
 #include <linux/hmm.h>
 #include <linux/mm.h>
 #include "xe_hmm.h"
+#include "xe_svm.h"
 #include "xe_vm.h"
 #include "xe_bo.h"
 
@@ -43,15 +44,90 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
 	}
 }
 
+/**
+ * xe_build_sg_device_pages() - build sg table for userptr when the backing store
+ * is device pages
+ *
+ * @st: sg table to build
+ * @hmm_pfns: pfn array of the userptr
+ * @pages: struct page arrary of this userptr
+ * @npages: how many pages in this userptr
+ */
+static int xe_build_sg_device_pages(struct sg_table *st, unsigned long *hmm_pfns,
+						struct page **pages, uint64_t npages)
+{
+	struct scatterlist *sg;
+	int i;
+
+	sg = NULL;
+	st->nents = 0;
+	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
+		return -ENOMEM;
+
+	for (i = 0; i < npages; i++) {
+		unsigned long addr;
+		struct xe_mem_region *mr;
+
+		mr = xe_page_to_mem_region(pages[i]);
+		addr = xe_mem_region_pfn_to_dpa(mr, hmm_pfns[i]);
+		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
+			sg->length += PAGE_SIZE;
+			sg_dma_len(sg) += PAGE_SIZE;
+			continue;
+		}
+
+		sg =  sg ? sg_next(sg) : st->sgl;
+		sg_dma_address(sg) = addr;
+		sg_dma_len(sg) = PAGE_SIZE;
+		sg->length = PAGE_SIZE;
+		st->nents++;
+	}
+
+	sg_mark_end(sg);
+	return 0;
+}
+
+/**
+ * xe_validate_hmm_pfns() - validate all pages in a userptr belong to one memory
+ * region, and populate the pages array.
+ *
+ * @userptr: The userptr to validate
+ * @hmm_pfns: an array holding hmm pfns
+ * @npages: number of pages of this userptr
+ * @pages: output parameter to hold the populated pages from pfn.
+ */
+static void xe_validate_hmm_pfns(struct xe_userptr *userptr, unsigned long *hmm_pfns,
+						uint64_t npages, struct page **pages)
+{
+	int i;
+	struct xe_vma *vma = xe_userptr_to_vma(userptr);
+	struct xe_vm *vm = xe_vma_vm(vma);
+
+	pages[0] = hmm_pfn_to_page(hmm_pfns[0]);
+	userptr->is_device_pages = is_device_private_page(pages[0]);
+	for (i = 1; i < npages; i++) {
+		pages[i] = hmm_pfn_to_page(hmm_pfns[i]);
+		/**
+		 * We currently assume no mixture of device pages and system memory
+		 * pages in one userptr. If it turns out this is not true, we will
+		 * either split userptr into device pages based and system memory
+		 * based, or support a mixture backing store in one userptr.
+		 */
+		xe_assert(vm->xe,
+			userptr->is_device_pages == is_device_private_page(pages[i]));
+	}
+}
+
+
 /**
  * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
  * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
  * and will be used to program GPU page table later.
  *
  * @xe: the xe device who will access the dma-address in sg table
+ * @userptr: the userptr that we build the sg table for
  * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
  * has the pfn numbers of pages that back up this hmm address range.
- * @st: pointer to the sg table.
  * @write: whether we write to this range. This decides dma map direction
  * for system pages. If write we map it bi-diretional; otherwise
  * DMA_TO_DEVICE
@@ -64,11 +140,6 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
  * access memory. So if the memory is system memory, we need to
  * do a dma-mapping so it can be accessed by GPU/DMA.
  *
- * FIXME: This function currently only support pages in system
- * memory. If the memory is GPU local memory (of the GPU who
- * is going to access memory), we need gpu dpa (device physical
- * address), and there is no need of dma-mapping. This is TBD.
- *
  * FIXME: dma-mapping for peer gpu device to access remote gpu's
  * memory. Add this when you support p2p
  *
@@ -77,12 +148,13 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
  *
  * Returns 0 if successful; -ENOMEM if fails to allocate memory
  */
-static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
-			     struct sg_table *st, bool write)
+static int xe_build_sg(struct xe_device *xe, struct xe_userptr *userptr,
+					struct hmm_range *range, bool write)
 {
+	struct sg_table *st = &userptr->sgt;
 	struct device *dev = xe->drm.dev;
 	struct page **pages;
-	u64 i, npages;
+	u64 npages;
 	int ret;
 
 	npages = xe_npages_in_range(range->start, range->end);
@@ -90,19 +162,22 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
 	if (!pages)
 		return -ENOMEM;
 
-	for (i = 0; i < npages; i++) {
-		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
-		xe_assert(xe, !is_device_private_page(pages[i]));
-	}
-
-	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
-			npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
-	if (ret)
-		goto free_pages;
+	xe_validate_hmm_pfns(userptr, range->hmm_pfns, npages, pages);
+	if (!userptr->is_device_pages) {
+		ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
+				npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
+		if (ret)
+			goto free_pages;
 
-	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
-			DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+		ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
+				DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+	} else {
+		ret = xe_build_sg_device_pages(st, range->hmm_pfns, pages, npages);
+		if (ret)
+			goto free_pages;
+	}
 
+	userptr->sg = st;
 free_pages:
 	kvfree(pages);
 	return ret;
@@ -127,7 +202,8 @@ void xe_userptr_free_sg(struct xe_userptr_vma *uvma)
 	struct device *dev = xe->drm.dev;
 
 	xe_assert(xe, userptr->sg);
-	dma_unmap_sgtable(dev, userptr->sg,
+	if (!userptr->is_device_pages)
+		dma_unmap_sgtable(dev, userptr->sg,
 			write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
 
 	sg_free_table(userptr->sg);
@@ -239,12 +315,11 @@ int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
 	if (ret)
 		goto free_pfns;
 
-	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
+	ret = xe_build_sg(vm->xe, userptr, &hmm_range, write);
 	if (ret)
 		goto free_pfns;
 
 	xe_mark_range_accessed(&hmm_range, write);
-	userptr->sg = &userptr->sgt;
 	userptr->notifier_seq = hmm_range.notifier_seq;
 
 free_pfns:
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index fbf6bfcf59a8..3b4debfecc9b 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -64,6 +64,8 @@ struct xe_userptr {
 	struct sg_table *sg;
 	/** @notifier_seq: notifier sequence number */
 	unsigned long notifier_seq;
+	/** @is_device_pages: the backing store is in device memory*/
+	bool is_device_pages;
 	/**
 	 * @initial_bind: user pointer has been bound at least once.
 	 * write: vm->userptr.notifier_lock in read mode and vm->resv held.
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (17 preceding siblings ...)
  2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:56   ` Matthew Brost
  2024-04-09 20:17 ` [v2 20/31] drm/xe: add xe lock document Oak Zeng
                   ` (12 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

With system allocator, a userptr can now be back by device
memory also. Introduce a helper function xe_vma_is_devmem
to determine whether a vma is backed by device memory.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 846e896edcb5..525092111be9 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -577,6 +577,17 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
 	.pt_entry = xe_pt_stage_bind_entry,
 };
 
+static bool xe_vma_is_devmem(struct xe_vma *vma)
+{
+	if (xe_vma_is_userptr(vma)) {
+		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		return uvma->userptr.is_device_pages;
+	} else {
+		struct xe_bo *bo = xe_vma_bo(vma);
+		return bo && (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	}
+}
+
 /**
  * xe_pt_stage_bind() - Build a disconnected page-table tree for a given address
  * range.
@@ -601,8 +612,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 {
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_bo *bo = xe_vma_bo(vma);
-	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
-		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	bool is_devmem = xe_vma_is_devmem(vma);
 	struct xe_res_cursor curs;
 	struct xe_pt_stage_bind_walk xe_walk = {
 		.base = {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 20/31] drm/xe: add xe lock document
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (18 preceding siblings ...)
  2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

This is not intended a complete documentation of xe locks. It
only documents some key locks used in xe driver and gives an
example to illustrate the lock usage.

This is just a start. We should eventually refine this document.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 Documentation/gpu/xe/index.rst   |   1 +
 Documentation/gpu/xe/xe_lock.rst |   8 +++
 drivers/gpu/drm/xe/xe_lock_doc.h | 113 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm_types.h |   2 +-
 4 files changed, 123 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/gpu/xe/xe_lock.rst
 create mode 100644 drivers/gpu/drm/xe/xe_lock_doc.h

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index 106b60aba1f0..6ae2c8e7bbb4 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -24,3 +24,4 @@ DG2, etc is provided to prototype the driver.
    xe_tile
    xe_debugging
    xe_svm
+   xe_lock
diff --git a/Documentation/gpu/xe/xe_lock.rst b/Documentation/gpu/xe/xe_lock.rst
new file mode 100644
index 000000000000..24e4c2e7c5d1
--- /dev/null
+++ b/Documentation/gpu/xe/xe_lock.rst
@@ -0,0 +1,8 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+==============
+xe lock design
+==============
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_lock_doc.h
+   :doc: xe lock design
diff --git a/drivers/gpu/drm/xe/xe_lock_doc.h b/drivers/gpu/drm/xe/xe_lock_doc.h
new file mode 100644
index 000000000000..0fab623ce056
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_lock_doc.h
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _XE_LOCK_DOC_H_
+#define _XE_LOCK_DOC_H_
+
+/**
+ * DOC: XE lock design
+ *
+ * Locks used in xekmd are complicated. This document try to document the
+ * very fundamentals, such as key locks  used, their purpose and the
+ * order of locking if you need to hold multiple locks.
+ *
+ * Locks used in xekmd
+ * ===================
+ * 1. xe_vm::lock
+ * xe_vm::lock is used mainly to protect data in xe_vm struct, more specifically
+ * this includes below:
+ *
+ * 1) vm::rebind_list
+ * 2) vm::flags, only XE_VM_FLA_BANNED bit
+ * 3) vma::tile_present
+ * 4) userptr::repin_list
+ * 5) userptr::invalidated list
+ * 6) vm::preempt::exec_queue
+ * 7) drm_gpuvm::rb list and tree
+ * 8) vm::size
+ * 9) vm::q[]->last_fence, only if q->flags' EXEC_QUEUE_FLAG_VM is set,
+ *    see xe_exec_queue_last_fence_lockdep_assert
+ * 10) a contested list during vm close. see xe_vm_close_and_put
+ *
+ * 2. mm mmap_lock
+ * mm's mmap_lock is used to protect mm's memory mapping such as CPU page
+ * tables. Linux core mm hold this lock whenever it need to change process
+ * space's memory mapping, for example, during a user munmap process.
+ *
+ * xe hold mmap_lock when it needs to walk CPU page table, such as when
+ * it calls hmm_range_fault to populate CPU page tables.
+ *
+ * 3. xe_vm's dma-resv
+ * xe_vm's dma reservation object is used protect GPU page table update.
+ * For BO type vma, dma resv is enough for page table update. For userptr
+ * and hmmptr, besides dma resv, we need an extra notifier_lock to avoid
+ * page table update collision with userptr invalidation. See below.
+ *
+ * 4. xe_vm::userptr::notifier_lock
+ * notifier_lock is used to protect userptr/hmmptr GPU page table update,
+ * to avoid a update collision with userptr invalidation. So notifier_lock
+ * is required in the userptr invalidate callback function. Notifier_lock
+ * is the "user_lock" in the documentation of mmu_interval_read_begin().
+ *
+ * Lock order
+ * ==========
+ * Acquiring locks in the same order can avoid deadlocks. The locking
+ * order of above locks are:
+ *
+ * xe_vm::lock => mmap_lock => xe_vm::dma-resv => notifier_lock
+ *
+ *
+ * Use case, pseudo codes
+ * =====================
+ *
+ * Below are pseudo codes of hmmptr's gpu page fault handler:
+ *
+ * get gpu vm from page fault asid
+ * Down_write(vm->lock)
+ * walk vma tree, get vma of fault address
+ *
+ * Again:
+ * Mmap_read_lock
+ * do page migration for vma if needed
+ * vma->userptr.notifier_seq = mmu_interval_read_begin(&vma->userptr.notifier)
+ * call hmm_range_fault to retrieve vma's pfns/pages
+ * Mmap_read_unlock
+ *
+ * xe_vm_lock(vm)
+ * down_read(&vm->userptr.notifier_lock);
+ * if (!mmu_interval_read_retry() {
+ *     up_read(&vm->userptr.notifier_lock);
+ *     goto Again; //collision happened with userptr invalidation, retry
+ * }
+ *
+ * xe_vm_populate_pgtable or submit gpu job to update page table
+ * up_read(&vm->userptr.notifier_lock);
+ *
+ * xe_vm_unlock(vm)
+ * Up_write(vm->lock)
+ *
+ * In above code, we first hold vm->lock so we can walk vm's vma tree to
+ * get a vma of the fault address.
+ *
+ * Then we do page migration if needed. Page migration is not needed for
+ * userptr but might be needed for hmmptr. After migration, we populate
+ * the pfns of the vma. Since this requires walking CPU page table, we
+ * hold a mmap_lock in this step.
+ *
+ * After that, the remaining work is to update GPU page table with the
+ * pfns/pages populated above. Since we use vm's dma-resv object to protect
+ * gpu page table update, we need to hold vm's dma-resv in this step.
+ *
+ * Since we don't hold the mmap_lock during GPU page table update, user
+ * might perform munmap simultaneously which can cause userptr invalidation.
+ * If such collision happens, we will retry.
+ *
+ * notifier_lock is hold in both mmu notifier callback (Not listed above),
+ * and GPU page table update.
+ *
+ */
+#endif
+
+
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 3b4debfecc9b..d1f5949d4a3b 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -271,7 +271,7 @@ struct xe_vm {
 
 	/**
 	 * @lock: outer most lock, protects objects of anything attached to this
-	 * VM
+	 * VM. See more details in xe_lock_doc.h
 	 */
 	struct rw_semaphore lock;
 	/**
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 21/31] drm/xe/svm: Introduce svm migration function
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (19 preceding siblings ...)
  2024-04-09 20:17 ` [v2 20/31] drm/xe: add xe lock document Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:06   ` Matthew Brost
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce xe_migrate_pa function for data migration.
This function is similar to xe_migrate_copy function
but has different parameters. Instead of BO and ttm
resource parameters, it has source and destination
buffer's physical address as parameter. This function is
intended to be used by svm sub-system which doesn't
have BO and TTM concept.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 217 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h |   7 ++
 2 files changed, 224 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 82b63bdb9c47..f1d53911253b 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -462,6 +462,37 @@ static bool xe_migrate_allow_identity(u64 size, const struct xe_res_cursor *cur)
 	return cur->size >= size;
 }
 
+/**
+ * pte_update_cmd_size() - calculate the batch buffer command size
+ * to update a flat page table.
+ *
+ * @size: The virtual address range size of the page table to update
+ *
+ * The page table to update is supposed to be a flat 1 level page
+ * table with all entries pointing to 4k pages.
+ *
+ * Return the number of dwords of the update command
+ */
+static u32 pte_update_cmd_size(u64 size)
+{
+	u32 dword;
+	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+
+	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
+	/*
+	 * MI_STORE_DATA_IMM command is used to update page table. Each
+	 * instruction can update maximumly 0x1ff pte entries. To update
+	 * n (n <= 0x1ff) pte entries, we need:
+	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
+	 * 2 dword for the page table's physical location
+	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
+	 */
+	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
+	dword += entries * 2;
+
+	return dword;
+}
+
 static u32 pte_update_size(struct xe_migrate *m,
 			   bool is_vram,
 			   struct ttm_resource *res,
@@ -562,6 +593,48 @@ static void emit_pte(struct xe_migrate *m,
 	}
 }
 
+/**
+ * build_pt_update_batch_sram() - build batch buffer commands to update
+ * migration vm page table for system memory
+ *
+ * @m: The migration context
+ * @bb: The batch buffer which hold the page table update commands
+ * @pt_offset: The offset of page table to update, in byte
+ * @pa: device physical address you want the page table to point to
+ * @size: size of the virtual address space you want the page table to cover
+ */
+static void build_pt_update_batch_sram(struct xe_migrate *m,
+		     struct xe_bb *bb, u32 pt_offset,
+		     u64 pa, u32 size)
+{
+	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
+	u32 ptes;
+
+	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+	while (ptes) {
+		u32 chunk = min(0x1ffU, ptes);
+
+		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
+		bb->cs[bb->len++] = pt_offset;
+		bb->cs[bb->len++] = 0;
+
+		pt_offset += chunk * 8;
+		ptes -= chunk;
+
+		while (chunk--) {
+			u64 addr;
+
+			addr = pa & PAGE_MASK;
+			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
+								 addr, pat_index,
+								 0, false, 0);
+			bb->cs[bb->len++] = lower_32_bits(addr);
+			bb->cs[bb->len++] = upper_32_bits(addr);
+			pa += XE_PAGE_SIZE;
+		}
+	}
+}
+
 #define EMIT_COPY_CCS_DW 5
 static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
 			  u64 dst_ofs, bool dst_is_indirect,
@@ -879,6 +952,150 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 	return fence;
 }
 
+/**
+ * xe_migrate_pa() - Migrate buffers with src and dst physical address
+ *
+ * @m: The migration context
+ * @src_pa: physical address of source, from GPU's point of view. This is a
+ * device physical address (dpa) when source is in vram. When source is in
+ * system memory, this is a dma mapped host physical address
+ * @src_is_vram: True if source buffer is in vram.
+ * @dst_pa: physical address of destination, from GPU's point of view. This is a
+ * device physical address (dpa) when source is in vram. When source is in
+ * system memory, this is a dma mapped host physical address
+ * @dst_is_vram: True if destination buffer is in vram.
+ * @size: The size of data to copy.
+ *
+ * Copy @size bytes of data from @src_pa to @dst_pa. The functionality
+ * and behavior of this function is similar to xe_migrate_copy function, but
+ * the interface is different. This function is a helper function supposed to
+ * be used by SVM subsytem. Since in SVM subsystem there is no buffer object
+ * and ttm, there is no src/dst bo as function input. Instead, we directly use
+ * src/dst's physical address as function input.
+ *
+ * Since the back store of any user malloc'ed or mmap'ed memory can be placed in
+ * system  memory, it can not be compressed. Thus this function doesn't need
+ * to consider copy CCS (compression control surface) data as xe_migrate_copy did.
+ *
+ * This function assumes the source buffer and destination buffer are all physically
+ * contiguous.
+ *
+ * We use gpu blitter to copy data. Source and destination are first mapped to
+ * migration vm which is a flat one level (L0) page table, then blitter is used to
+ * perform the copy.
+ *
+ * Return: Pointer to a dma_fence representing the last copy batch, or
+ * an error pointer on failure. If there is a failure, any copy operation
+ * started by the function call has been synced.
+ */
+struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
+				  u64 src_pa,
+				  bool src_is_vram,
+				  u64 dst_pa,
+				  bool dst_is_vram,
+				  u64 size)
+{
+#define NUM_PT_PER_BLIT (MAX_PREEMPTDISABLE_TRANSFER / SZ_2M)
+	struct xe_gt *gt = m->tile->primary_gt;
+	struct xe_device *xe = gt_to_xe(gt);
+	struct dma_fence *fence = NULL;
+	u64 src_L0_ofs, dst_L0_ofs;
+	u64 round_update_size;
+	/* A slot is a 4K page of page table, covers 2M virtual address*/
+	u32 pt_slot;
+	int err;
+
+	while (size) {
+		u32 batch_size = 2; /* arb_clear() + MI_BATCH_BUFFER_END */
+		struct xe_sched_job *job;
+		struct xe_bb *bb;
+		u32 update_idx;
+
+		/* Maximumly copy MAX_PREEMPTDISABLE_TRANSFER bytes. Why?*/
+		round_update_size = min_t(u64, size, MAX_PREEMPTDISABLE_TRANSFER);
+
+		/* src pte update*/
+		if (!src_is_vram)
+			batch_size += pte_update_cmd_size(round_update_size);
+		/* dst pte update*/
+		if (!dst_is_vram)
+			batch_size += pte_update_cmd_size(round_update_size);
+
+		/* Copy command size*/
+		batch_size += EMIT_COPY_DW;
+
+		bb = xe_bb_new(gt, batch_size, true);
+		if (IS_ERR(bb)) {
+			err = PTR_ERR(bb);
+			goto err_sync;
+		}
+
+		if (!src_is_vram) {
+			pt_slot = 0;
+			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+					src_pa, round_update_size);
+			src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		}
+		else
+			src_L0_ofs = xe_migrate_vram_ofs(xe, src_pa);
+
+		if (!dst_is_vram) {
+			pt_slot = NUM_PT_PER_BLIT;
+			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+					dst_pa, round_update_size);
+			dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		}
+		else
+			dst_L0_ofs = xe_migrate_vram_ofs(xe, dst_pa);
+
+
+		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
+		update_idx = bb->len;
+
+		emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
+			  XE_PAGE_SIZE);
+
+		mutex_lock(&m->job_mutex);
+		job = xe_bb_create_migration_job(m->q, bb,
+						 xe_migrate_batch_base(m, true),
+						 update_idx);
+		if (IS_ERR(job)) {
+			err = PTR_ERR(job);
+			goto err;
+		}
+
+		xe_sched_job_add_migrate_flush(job, 0);
+		xe_sched_job_arm(job);
+		dma_fence_put(fence);
+		fence = dma_fence_get(&job->drm.s_fence->finished);
+		xe_sched_job_push(job);
+		dma_fence_put(m->fence);
+		m->fence = dma_fence_get(fence);
+
+		mutex_unlock(&m->job_mutex);
+
+		xe_bb_free(bb, fence);
+		size -= round_update_size;
+		src_pa += round_update_size;
+		dst_pa += round_update_size;
+		continue;
+
+err:
+		mutex_unlock(&m->job_mutex);
+		xe_bb_free(bb, NULL);
+
+err_sync:
+		/* Sync partial copy if any. FIXME: under job_mutex? */
+		if (fence) {
+			dma_fence_wait(fence, false);
+			dma_fence_put(fence);
+		}
+
+		return ERR_PTR(err);
+	}
+
+	return fence;
+}
 static void emit_clear_link_copy(struct xe_gt *gt, struct xe_bb *bb, u64 src_ofs,
 				 u32 size, u32 pitch)
 {
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 701bb27349b0..98b480244265 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -101,6 +101,13 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 				  struct ttm_resource *dst,
 				  bool copy_only_ccs);
 
+struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
+				  u64 src_pa,
+				  bool src_is_vram,
+				  u64 dst_pa,
+				  bool dst_is_vram,
+				  u64 size);
+
 struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 				   struct xe_bo *bo,
 				   struct ttm_resource *dst);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (20 preceding siblings ...)
  2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:23   ` Matthew Brost
  2024-04-17 20:55   ` Matthew Brost
  2024-04-09 20:17 ` [v2 23/31] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
                   ` (9 subsequent siblings)
  31 siblings, 2 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Function xe_devm_alloc_pages allocate pages from drm buddy and perform
house keeping work for all the pages allocated, such as get a page
refcount, keep a bitmap of all pages to denote whether a page is in
use, put pages to a drm lru list for eviction purpose.

Function xe_devm_free_blocks return list of memory blocks to drm buddy
allocator.

Function xe_devm_free_page is a call back function from hmm layer. It
is called whenever a page's refcount reaches to 1. This function clears
the bit of this page in the bitmap. If all the bits in the bitmap is
cleared, it means all the pages have been freed, we return all the pages
in this memory block back to drm buddy.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h        |   7 ++
 drivers/gpu/drm/xe/xe_svm_devmem.c | 147 ++++++++++++++++++++++++++++-
 2 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 624c1581f8ba..92a3ee90d5a7 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -46,4 +46,11 @@ static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
 	return container_of(page->pgmap, struct xe_mem_region, pagemap);
 }
 
+int xe_devm_alloc_pages(struct xe_tile *tile,
+						unsigned long npages,
+						struct list_head *blocks,
+						unsigned long *pfn);
+
+void xe_devm_free_blocks(struct list_head *blocks);
+void xe_devm_page_free(struct page *page);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 31af56e8285a..5ba0cd9a70b0 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -5,18 +5,161 @@
 
 #include <linux/mm_types.h>
 #include <linux/sched/mm.h>
-
+#include <linux/gfp.h>
+#include <linux/migrate.h>
+#include <linux/dma-mapping.h>
+#include <linux/dma-fence.h>
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <drm/drm_buddy.h>
 #include "xe_device_types.h"
 #include "xe_svm.h"
+#include "xe_migrate.h"
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_assert.h"
 
+/**
+ * struct xe_svm_block_meta - svm uses this data structure to manage each
+ * block allocated from drm buddy. This will be set to the drm_buddy_block's
+ * private field.
+ *
+ * @lru: used to link this block to drm's lru lists. This will be replace
+ * with struct drm_lru_entity later.
+ * @tile: tile from which we allocated this block
+ * @bitmap: A bitmap of each page in this block. 1 means this page is used,
+ * 0 means this page is idle. When all bits of this block are 0, it is time
+ * to return this block to drm buddy subsystem.
+ */
+struct xe_svm_block_meta {
+	struct list_head lru;
+	struct xe_tile *tile;
+	unsigned long bitmap[];
+};
 
 static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
 {
 	return 0;
 }
 
-static void xe_devm_page_free(struct page *page)
+static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
+{
+	/** DRM buddy's block offset is 0-based*/
+	offset += mr->hpa_base;
+
+	return PHYS_PFN(offset);
+}
+
+/** FIXME: we locked page by calling zone_device_page_init
+ *  in xe_devm_alloc_pages. Should we unlock pages here?
+ */
+static void free_block(struct drm_buddy_block *block)
+{
+	struct xe_svm_block_meta *meta =
+		(struct xe_svm_block_meta *)block->private;
+	struct xe_tile *tile  = meta->tile;
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+
+	kfree(block->private);
+	drm_buddy_free_block(mm, block);
+}
+
+void xe_devm_page_free(struct page *page)
+{
+	struct drm_buddy_block *block =
+					(struct drm_buddy_block *)page->zone_device_data;
+	struct xe_svm_block_meta *meta =
+					(struct xe_svm_block_meta *)block->private;
+	struct xe_tile *tile  = meta->tile;
+	struct xe_mem_region *mr = &tile->mem.vram;
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	u64 size = drm_buddy_block_size(mm, block);
+	u64 pages_per_block = size >> PAGE_SHIFT;
+	u64 block_pfn_first =
+					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+	u64 page_pfn = page_to_pfn(page);
+	u64 i = page_pfn - block_pfn_first;
+
+	xe_assert(tile->xe, i < pages_per_block);
+	clear_bit(i, meta->bitmap);
+	if (bitmap_empty(meta->bitmap, pages_per_block))
+		free_block(block);
+}
+
+/**
+ * xe_devm_alloc_pages() - allocate device pages from buddy allocator
+ *
+ * @xe_tile: which tile to allocate device memory from
+ * @npages: how many pages to allocate
+ * @blocks: used to return the allocated blocks
+ * @pfn: used to return the pfn of all allocated pages. Must be big enough
+ * to hold at @npages entries.
+ *
+ * This function allocate blocks of memory from drm buddy allocator, and
+ * performs initialization work: set struct page::zone_device_data to point
+ * to the memory block; set/initialize drm_buddy_block::private field;
+ * lock_page for each page allocated; add memory block to lru managers lru
+ * list - this is TBD.
+ *
+ * return: 0 on success
+ * error code otherwise
+ */
+int xe_devm_alloc_pages(struct xe_tile *tile,
+						unsigned long npages,
+						struct list_head *blocks,
+						unsigned long *pfn)
+{
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	struct drm_buddy_block *block, *tmp;
+	u64 size = npages << PAGE_SHIFT;
+	int ret = 0, i, j = 0;
+
+	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
+						blocks, DRM_BUDDY_TOPDOWN_ALLOCATION);
+
+	if (unlikely(ret))
+		return ret;
+
+	list_for_each_entry_safe(block, tmp, blocks, link) {
+		struct xe_mem_region *mr = &tile->mem.vram;
+		u64 block_pfn_first, pages_per_block;
+		struct xe_svm_block_meta *meta;
+		u32 meta_size;
+
+		size = drm_buddy_block_size(mm, block);
+		pages_per_block = size >> PAGE_SHIFT;
+		meta_size = BITS_TO_BYTES(pages_per_block) +
+					sizeof(struct xe_svm_block_meta);
+		meta = kzalloc(meta_size, GFP_KERNEL);
+		bitmap_fill(meta->bitmap, pages_per_block);
+		meta->tile = tile;
+		block->private = meta;
+		block_pfn_first =
+					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		for(i = 0; i < pages_per_block; i++) {
+			struct page *page;
+
+			pfn[j++] = block_pfn_first + i;
+			page = pfn_to_page(block_pfn_first + i);
+			/**Lock page per hmm requirement, see hmm.rst.*/
+			zone_device_page_init(page);
+			page->zone_device_data = block;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xe_devm_free_blocks() - free all memory blocks
+ *
+ * @blocks: memory blocks list head
+ */
+void xe_devm_free_blocks(struct list_head *blocks)
 {
+	struct drm_buddy_block *block, *tmp;
+
+	list_for_each_entry_safe(block, tmp, blocks, link)
+		free_block(block);
 }
 
 static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 23/31] drm/xe/svm: Trace buddy block allocation and free
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (21 preceding siblings ...)
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

trace_xe_buddy_block_alloc and trace_xe_buddy_block_free
are added to trace buddy allocation and free.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm_devmem.c |  6 ++++-
 drivers/gpu/drm/xe/xe_trace.h      | 35 ++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 5ba0cd9a70b0..088ac209ad80 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -17,6 +17,7 @@
 #include "xe_migrate.h"
 #include "xe_ttm_vram_mgr_types.h"
 #include "xe_assert.h"
+#include "xe_trace.h"
 
 /**
  * struct xe_svm_block_meta - svm uses this data structure to manage each
@@ -81,8 +82,10 @@ void xe_devm_page_free(struct page *page)
 
 	xe_assert(tile->xe, i < pages_per_block);
 	clear_bit(i, meta->bitmap);
-	if (bitmap_empty(meta->bitmap, pages_per_block))
+	if (bitmap_empty(meta->bitmap, pages_per_block)) {
 		free_block(block);
+		trace_xe_buddy_block_free(block, size, block_pfn_first);
+	}
 }
 
 /**
@@ -135,6 +138,7 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 		block->private = meta;
 		block_pfn_first =
 					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		trace_xe_buddy_block_alloc(block, size, block_pfn_first);
 		for(i = 0; i < pages_per_block; i++) {
 			struct page *page;
 
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 5f7d26bf4cd7..f3fcce9f1434 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -21,6 +21,7 @@
 #include "xe_guc_exec_queue_types.h"
 #include "xe_sched_job.h"
 #include "xe_vm.h"
+#include <drm/drm_buddy.h>
 
 DECLARE_EVENT_CLASS(xe_gt_tlb_invalidation_fence,
 		    TP_PROTO(struct xe_gt_tlb_invalidation_fence *fence),
@@ -622,6 +623,40 @@ DEFINE_EVENT_PRINT(xe_guc_ctb, xe_guc_ctb_g2h,
 
 );
 
+DECLARE_EVENT_CLASS(xe_buddy_block,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn),
+
+               TP_STRUCT__entry(
+                               __field(u64, block)
+                               __field(u64, header)
+                               __field(u64, size)
+                               __field(u64, pfn)
+               ),
+
+               TP_fast_assign(
+                               __entry->block = (u64)block;
+                               __entry->header = block->header;
+                               __entry->size = size;
+                               __entry->pfn = pfn;
+               ),
+
+               TP_printk("xe svm: allocated block %llx, block header %llx, size %llx, pfn %llx\n",
+                       __entry->block, __entry->header, __entry->size, __entry->pfn)
+);
+
+
+DEFINE_EVENT(xe_buddy_block, xe_buddy_block_alloc,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn)
+);
+
+
+DEFINE_EVENT(xe_buddy_block, xe_buddy_block_free,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn)
+);
+
 #endif
 
 /* This part must be outside protection */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 24/31] drm/xe/svm: Create and destroy xe svm
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (22 preceding siblings ...)
  2024-04-09 20:17 ` [v2 23/31] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:25   ` Matthew Brost
  2024-04-09 20:17 ` [v2 25/31] drm/xe/svm: Add vm to xe_svm process Oak Zeng
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a data structure xe_svm to represent a shared virtual
address space b/t CPU program and GPU program. Each process can
only have maximumly one xe_svm instance. One xe_svm can have
multiple gpu vm.

Introduce helper functions to create and destroy xe_svm instance.
Once xe_svm instance is created, it is added to a global hash table
keyed by mm_struct. Later on we can retrieve xe_svm using mm_struct.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/Makefile |  1 +
 drivers/gpu/drm/xe/xe_svm.c | 77 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h | 23 +++++++++++
 3 files changed, 101 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index cd5213ba182b..f89d77b6d654 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -129,6 +129,7 @@ xe-y += xe_bb.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm.o \
 	xe_svm_devmem.o \
 	xe_sync.o \
 	xe_tile.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
new file mode 100644
index 000000000000..416cfc81c053
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright Â© 2023 Intel Corporation
+ */
+
+#include <linux/mutex.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/hashtable.h>
+#include "xe_svm.h"
+
+#define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
+DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
+
+/**
+ * xe_create_svm() - create a svm instance
+ *
+ * one xe_svm struct represent a shared address space
+ * between cpu and gpu program. So one xe_svm is associated
+ * to one mm_struct.
+ *
+ * If xe_svm for this process already exists, just return
+ * it; otherwise create one.
+ *
+ * Return the created xe svm struct pointer
+ */
+struct xe_svm *xe_create_svm(void)
+{
+	struct mm_struct *mm = current->mm;
+	struct xe_svm *svm;
+
+	svm = xe_lookup_svm_by_mm(mm);
+	if (svm)
+		return svm;
+
+	svm = kzalloc(sizeof(struct xe_svm), GFP_KERNEL);
+	svm->mm = mm;
+	mutex_init(&svm->mutex);
+	INIT_LIST_HEAD(&svm->vm_list);
+	/** Add svm to global xe_svm_table hash table
+	 *  use mm as key so later we can retrieve svm using mm
+	 */
+	hash_add_rcu(xe_svm_table, &svm->hnode, (uintptr_t)mm);
+	return svm;
+}
+
+/**
+ * xe_destroy_svm() - destroy a svm process
+ *
+ * @svm: the xe_svm to destroy
+ */
+void xe_destroy_svm(struct xe_svm *svm)
+{
+	BUG_ON(list_empty(&svm->vm_list));
+	hash_del_rcu(&svm->hnode);
+	mutex_destroy(&svm->mutex);
+	kfree(svm);
+}
+
+
+/**
+ * xe_lookup_svm_by_mm() - retrieve xe_svm from mm struct
+ *
+ * @mm: the mm struct of the svm to retrieve
+ *
+ * Return the xe_svm struct pointer, or NULL if fail
+ */
+struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
+{
+	struct xe_svm *svm;
+
+	hash_for_each_possible_rcu(xe_svm_table, svm, hnode, (uintptr_t)mm)
+		if (svm->mm == mm)
+			return svm;
+
+	return NULL;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 92a3ee90d5a7..066740fb93f5 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -11,6 +11,29 @@
 #include "xe_device.h"
 #include "xe_assert.h"
 
+
+/**
+ * struct xe_svm - data structure to represent a shared
+ * virtual address space from device side. xe_svm and
+ * mm_struct has a 1:1 relationship.
+ */
+struct xe_svm {
+	/** @mm: The mm_struct corresponding to this xe_svm */
+	struct mm_struct *mm;
+	/**
+	 * @mutex: A lock protects below vm_list
+	 */
+	struct mutex mutex;
+	/** @hnode: used to add this svm to a global xe_svm_hash table*/
+	struct hlist_node hnode;
+	/** @vm_list: a list gpu vm in this svm space */
+	struct list_head vm_list;
+};
+
+extern struct xe_svm *xe_create_svm(void);
+void xe_destroy_svm(struct xe_svm *svm);
+extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
+
 /**
  * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
  *
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 25/31] drm/xe/svm: Add vm to xe_svm process
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (23 preceding siblings ...)
  2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

One shared virtual address space (xe_svm) works across CPU
and multiple GPUs under one CPU process. Each xe_svm process
can have multiple gpu vm, for example, one gpu vm for one
gpu card. Add gpu vm to the current xe_svm process during
xe_vm creation to note this gpu vm participate the shared
virtual address space with the current CPU process, also
remove xe_vm from xe_svm on xe_vm destroy.

FIXME: right now we blindly add all xe_vm to svm. Should
we introduce uAPI to allow user decide which xe_vm participate
svm?

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c      | 45 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h      |  3 +++
 drivers/gpu/drm/xe/xe_vm.c       |  5 ++++
 drivers/gpu/drm/xe/xe_vm_types.h |  2 ++
 4 files changed, 55 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 416cfc81c053..1f4c2d32121a 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -8,6 +8,7 @@
 #include <linux/kernel.h>
 #include <linux/hashtable.h>
 #include "xe_svm.h"
+#include "xe_vm_types.h"
 
 #define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
 DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
@@ -75,3 +76,47 @@ struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
 
 	return NULL;
 }
+
+/**
+ * xe_svm_add_vm() - add a gpu vm to the current svm process
+ *
+ * @vm: The gpu vm to add to the current svm process.
+ *
+ * One shared virtual address space (xe_svm) works across CPU
+ * and multiple GPUs. So each xe_svm process can have N gpu
+ * vm, for example, one gpu vm for on gpu card. This function
+ * add a gpu vm to the current xe_svm process.
+ */
+void xe_svm_add_vm(struct xe_vm *vm)
+{
+	struct xe_svm *svm;
+
+	svm = xe_lookup_svm_by_mm(current->mm);
+	if (!svm)
+		svm = xe_create_svm();
+
+	mutex_lock(&svm->mutex);
+	list_add(&vm->svm_link, &svm->vm_list);
+	mutex_unlock(&svm->mutex);
+}
+
+/**
+ * xe_svm_remove_vm() - remove a gpu vm from svm process
+ *
+ * @vm: The gpu vm to remove from svm process.
+ */
+void xe_svm_remove_vm(struct xe_vm *vm)
+{
+	struct xe_svm *svm;
+
+	svm = xe_lookup_svm_by_mm(current->mm);
+	if (!svm)
+		return;
+
+	mutex_lock(&svm->mutex);
+	list_del(&vm->svm_link);
+	mutex_unlock(&svm->mutex);
+
+	if (list_empty(&svm->vm_list))
+		xe_destroy_svm(svm);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 066740fb93f5..f601dffe3fc1 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -11,6 +11,7 @@
 #include "xe_device.h"
 #include "xe_assert.h"
 
+struct xe_vm;
 
 /**
  * struct xe_svm - data structure to represent a shared
@@ -33,6 +34,8 @@ struct xe_svm {
 extern struct xe_svm *xe_create_svm(void);
 void xe_destroy_svm(struct xe_svm *svm);
 extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
+void xe_svm_add_vm(struct xe_vm *vm);
+void xe_svm_remove_vm(struct xe_vm *vm);
 
 /**
  * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 61d336f24a65..498b36469d00 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -40,6 +40,7 @@
 #include "xe_trace.h"
 #include "xe_wa.h"
 #include "xe_hmm.h"
+#include "xe_svm.h"
 
 static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
 {
@@ -1347,6 +1348,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	INIT_LIST_HEAD(&vm->userptr.repin_list);
 	INIT_LIST_HEAD(&vm->userptr.invalidated);
 	INIT_LIST_HEAD(&vm->userptr.fault_invalidated);
+	INIT_LIST_HEAD(&vm->svm_link);
 	init_rwsem(&vm->userptr.notifier_lock);
 	spin_lock_init(&vm->userptr.invalidated_lock);
 	INIT_WORK(&vm->userptr.garbage_collector, vm_userptr_garbage_collector);
@@ -1445,6 +1447,8 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		xe->usm.num_vm_in_non_fault_mode++;
 	mutex_unlock(&xe->usm.lock);
 
+	/** FIXME: Should we add vm to svm conditionally? Per uAPI?*/
+	xe_svm_add_vm(vm);
 	trace_xe_vm_create(vm);
 
 	return vm;
@@ -1562,6 +1566,7 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	for_each_tile(tile, xe, id)
 		xe_range_fence_tree_fini(&vm->rftree[id]);
 
+	xe_svm_remove_vm(vm);
 	xe_vm_put(vm);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index d1f5949d4a3b..eb797195c374 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -394,6 +394,8 @@ struct xe_vm {
 	bool batch_invalidate_tlb;
 	/** @xef: XE file handle for tracking this VM's drm client */
 	struct xe_file *xef;
+	/** @svm_link: used to link this vm to xe_svm's vm_list*/
+	struct list_head svm_link;
 };
 
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 26/31] drm/xe: Make function lookup_vma public
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (24 preceding siblings ...)
  2024-04-09 20:17 ` [v2 25/31] drm/xe/svm: Add vm to xe_svm process Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:26   ` Matthew Brost
  2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
                   ` (5 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Public this function as it will be used by later patches. Also
rename it to xe_vm_lookup_vma

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 10 ++++++++--
 drivers/gpu/drm/xe/xe_vm.h           |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 707a3466f36b..668984f0769e 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -80,7 +80,13 @@ static bool vma_matches(struct xe_vma *vma, u64 page_addr)
 	return true;
 }
 
-static struct xe_vma *lookup_vma(struct xe_vm *vm, u64 page_addr)
+/**
+ * xe_vm_lookup_vma() - look up a vma from address
+ *
+ * @vm: the xe_vm that the vma resides in
+ * @page_address: address to look up
+ */
+struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr)
 {
 	struct xe_vma *vma = NULL;
 
@@ -166,7 +172,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		ret = -ENOENT;
 		goto unlock_vm;
 	}
-	vma = lookup_vma(vm, pf->page_addr);
+	vma = xe_vm_lookup_vma(vm, pf->page_addr);
 	if (!vma) {
 		ret = -EINVAL;
 		goto unlock_vm;
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 4860747592ad..d55330988e32 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -306,3 +306,4 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
 void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
 void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
 void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
+struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (25 preceding siblings ...)
  2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:07   ` Matthew Brost
  2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Under the picture of svm, CPU and GPU program share one same
virtual address space. The backing store of this virtual address
space can be either in system memory or device memory. Since GPU
device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
Any CPU access to device memory causes a page fault. Implement
a page fault handler to migrate memory back to system memory and
map it to CPU page table so the CPU program can proceed.

Also unbind this page from GPU side, and free the original GPU
device page

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Makefile         |   1 +
 drivers/gpu/drm/xe/xe_svm.h         |   8 +-
 drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
 drivers/gpu/drm/xe/xe_svm_migrate.c | 222 ++++++++++++++++++++++++++++
 4 files changed, 230 insertions(+), 8 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index f89d77b6d654..65289acdd563 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -131,6 +131,7 @@ xe-y += xe_bb.o \
 	xe_step.o \
 	xe_svm.o \
 	xe_svm_devmem.o \
+	xe_svm_migrate.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index f601dffe3fc1..c9e4239c44b4 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -7,11 +7,11 @@
 #define __XE_SVM_H
 
 #include <linux/mm_types.h>
+#include <linux/mm.h>
 #include "xe_device_types.h"
 #include "xe_device.h"
 #include "xe_assert.h"
-
-struct xe_vm;
+#include "xe_vm_types.h"
 
 /**
  * struct xe_svm - data structure to represent a shared
@@ -31,6 +31,9 @@ struct xe_svm {
 	struct list_head vm_list;
 };
 
+#define xe_svm_for_each_vm(svm, vm)					\
+		list_for_each_entry(vm, &svm->vm_list, svm_link)
+
 extern struct xe_svm *xe_create_svm(void);
 void xe_destroy_svm(struct xe_svm *svm);
 extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
@@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 
 void xe_devm_free_blocks(struct list_head *blocks);
 void xe_devm_page_free(struct page *page);
+vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 088ac209ad80..32ada458f1dd 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -37,11 +37,6 @@ struct xe_svm_block_meta {
 	unsigned long bitmap[];
 };
 
-static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
-{
-	return 0;
-}
-
 static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
 {
 	/** DRM buddy's block offset is 0-based*/
@@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
 
 static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
 	.page_free = xe_devm_page_free,
-	.migrate_to_ram = xe_devm_migrate_to_ram,
+	.migrate_to_ram = xe_svm_migrate_to_sram,
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
new file mode 100644
index 000000000000..0db831af098e
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/gfp.h>
+#include <linux/migrate.h>
+#include <linux/dma-mapping.h>
+#include <linux/dma-fence.h>
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <drm/drm_buddy.h>
+#include "xe_device_types.h"
+#include "xe_device.h"
+#include "xe_trace.h"
+#include "xe_migrate.h"
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_assert.h"
+#include "xe_pt.h"
+#include "xe_svm.h"
+#include "xe_vm.h"
+
+
+/**
+ * alloc_host_page() - allocate one host page for the fault vma
+ *
+ * @dev: (GPU) device that will access the allocated page
+ * @vma: the fault vma that we need allocate page for
+ * @addr: the fault address. The allocated page is for this address
+ * @dma_addr: used to output the dma address of the allocated page.
+ * This dma address will be used for gpu to access this page. GPU
+ * access host page through a dma mapped address.
+ * @pfn: used to output the pfn of the allocated page.
+ *
+ * This function allocate one host page for the specified vma. It
+ * also does some prepare work for GPU to access this page, such
+ * as map this page to iommu (by calling dma_map_page).
+ *
+ * When this function returns, the page is locked.
+ *
+ * Return struct page pointer when success
+ * NULL otherwise
+ */
+static struct page *alloc_host_page(struct device *dev,
+							 struct vm_area_struct *vma,
+							 unsigned long addr,
+							 dma_addr_t *dma_addr,
+							 unsigned long *pfn)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (unlikely(!page))
+		return NULL;
+
+	/**Lock page per hmm requirement, see hmm.rst*/
+	lock_page(page);
+	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
+		unlock_page(page);
+		__free_page(page);
+		return NULL;
+	}
+
+	*pfn = migrate_pfn(page_to_pfn(page));
+	return page;
+}
+
+static void free_host_page(struct page *page)
+{
+	unlock_page(page);
+	put_page(page);
+}
+
+/**
+ * migrate_page_vram_to_ram() - migrate one page from vram to ram
+ *
+ * @vma: The vma that the page is mapped to
+ * @addr: The virtual address that the page is mapped to
+ * @src_pfn: src page's page frame number
+ * @dst_pfn: used to return dstination page (in system ram)'s pfn
+ *
+ * Allocate one page in system ram and copy memory from device memory
+ * to system ram.
+ *
+ * Return: 0 if this page is already in sram (no need to migrate)
+ * 1: successfully migrated this page from vram to sram.
+ * error code otherwise
+ */
+static int migrate_page_vram_to_ram(struct vm_area_struct *vma, unsigned long addr,
+						unsigned long src_pfn, unsigned long *dst_pfn)
+{
+	struct xe_mem_region *mr;
+	struct xe_tile *tile;
+	struct xe_device *xe;
+	struct device *dev;
+	dma_addr_t dma_addr = 0;
+	struct dma_fence *fence;
+	struct page *host_page;
+	struct page *src_page;
+	u64 src_dpa;
+
+	src_page = migrate_pfn_to_page(src_pfn);
+	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
+		return 0;
+
+	mr = xe_page_to_mem_region(src_page);
+	tile = xe_mem_region_to_tile(mr);
+	xe = tile_to_xe(tile);
+	dev = xe->drm.dev;
+
+	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
+	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
+	if (!host_page)
+		return -ENOMEM;
+
+	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
+						dma_addr, false, PAGE_SIZE);
+	if (IS_ERR(fence)) {
+		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
+		free_host_page(host_page);
+		return PTR_ERR(fence);
+	}
+
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
+	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
+	return 1;
+}
+
+/**
+ * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page fault
+ *
+ * @vmf: cpu vm fault structure, contains fault information such as vma etc.
+ *
+ * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
+ *
+ * This function migrate one gpu vma which contains the fault address to sram.
+ * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e., create one
+ * gpu vma for one cpu vma initially and try not to split it). So this scheme end
+ * up migrate at the vma granularity. This might not be the best performant scheme
+ *
+ * This can be tunned with a migration granularity for  performance, for example,
+ * migration 2M for each CPU page fault, or let user specify how much to migrate.
+ * This is more complex due to vma splitting.
+ *
+ * This function should also update GPU page table, so the fault virtual address
+ * points to the same sram location from GPU side. This is TBD.
+ *
+ * Return:
+ * 0 on success
+ * VM_FAULT_SIGBUS: failed to migrate page to system memory, application
+ * will be signaled a SIGBUG
+ */
+vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
+{
+	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
+	struct xe_tile *tile = xe_mem_region_to_tile(mr);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
+	unsigned long addr = vma->vm_start;
+	u64 npages = vma_pages(vma);
+	struct xe_vma *xe_vma;
+	vm_fault_t ret = 0;
+	struct xe_vm *vm;
+	void *buf;
+	int i;
+
+	struct migrate_vma migrate_vma = {
+		.vma		= vmf->vma,
+		.start		= vma->vm_start,
+		.end		= vma->vm_end,
+		.pgmap_owner	= xe,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.fault_page = vmf->page,
+	};
+
+	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
+	migrate_vma.src = buf;
+	migrate_vma.dst = buf + npages;
+	if (migrate_vma_setup(&migrate_vma) < 0) {
+		ret = VM_FAULT_SIGBUS;
+		goto free_buf;
+	}
+
+	if (!migrate_vma.cpages)
+		goto free_buf;
+
+	for (i = 0; i < npages; i++) {
+		ret = migrate_page_vram_to_ram(vma, addr, migrate_vma.src[i],
+							migrate_vma.dst + i);
+		if (ret < 0) {
+			ret = VM_FAULT_SIGBUS;
+			break;
+		}
+
+		/** Migration has been successful, free source page */
+		if (ret == 1) {
+			struct page *src_page = migrate_pfn_to_page(migrate_vma.src[i]);
+
+			xe_devm_page_free(src_page);
+		}
+
+		addr += PAGE_SIZE;
+	}
+
+	xe_svm_for_each_vm(svm, vm) {
+		xe_assert(xe, vm->mm == mm);
+		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
+		if (xe_vma)
+			xe_vm_invalidate_vma(xe_vma);
+	}
+	migrate_vma_pages(&migrate_vma);
+	migrate_vma_finalize(&migrate_vma);
+free_buf:
+	kvfree(buf);
+	return 0;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (26 preceding siblings ...)
  2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:49   ` Matthew Brost
  2024-04-09 20:17 ` [v2 29/31] drm/xe/svm: trace svm migration Oak Zeng
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a helper function xe_svm_migrate_vma_to_vram.

Since the source pages of the svm range can be physically not
contiguous, and the destination vram pages can also be not
contiguous, there is no easy way to migrate multiple pages per
blitter command. We do page by page migration for now.

Migration is best effort. Even if we fail to migrate some pages,
we will try to migrate the rest pages.

FIXME: Use one blitter command to copy when both src and dst are
physically contiguous

FIXME: when a vma is partially migrated, split vma as we assume
no mixture vma placement.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h         |   2 +
 drivers/gpu/drm/xe/xe_svm_migrate.c | 115 ++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index c9e4239c44b4..18ce2e3757c5 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 void xe_devm_free_blocks(struct list_head *blocks);
 void xe_devm_page_free(struct page *page);
 vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
+int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
+							struct xe_tile *tile);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
index 0db831af098e..ab8dd1f58aa4 100644
--- a/drivers/gpu/drm/xe/xe_svm_migrate.c
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
 	kvfree(buf);
 	return 0;
 }
+
+/**
+ * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to vram
+ * Must be called with mmap_read_lock held.
+ * @vm: the vm that the vma belongs to
+ * @vma: the vma to migrate.
+ * @tile: the destination tile which holds the new backing store of the range
+ *
+ * Returns: negative errno on faiure, 0 on success
+ */
+int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
+							struct xe_vma *vma,
+							struct xe_tile *tile)
+{
+	struct mm_struct *mm = vm->mm;
+	unsigned long start = xe_vma_start(vma);
+	unsigned long end = xe_vma_end(vma);
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	struct xe_mem_region *mr = &tile->mem.vram;
+	struct vm_area_struct *vas;
+
+	struct migrate_vma migrate = {
+		.start		= start,
+		.end		= end,
+		.pgmap_owner	= tile->xe,
+		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
+	};
+	struct device *dev = tile->xe->drm.dev;
+	dma_addr_t *src_dma_addr;
+	struct dma_fence *fence;
+	struct page *src_page;
+	LIST_HEAD(blocks);
+	int ret = 0, i;
+	u64 dst_dpa;
+	void *buf;
+
+	mmap_assert_locked(mm);
+
+	vas = find_vma_intersection(mm, start, start + 4);
+	if (!vas)
+		return -ENOENT;
+
+	migrate.vma = vas;
+	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
+					GFP_KERNEL);
+	if(!buf)
+		return -ENOMEM;
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
+	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
+	if (ret)
+		goto kfree_buf;
+
+	ret = migrate_vma_setup(&migrate);
+	if (ret) {
+		drm_err(&tile->xe->drm, "vma setup returned %d for range [%lx - %lx]\n",
+				ret, start, end);
+		goto free_dst_pages;
+	}
+
+	/**FIXME: partial migration of a range print a warning for now.
+	 * If this message is printed, we need to split xe_vma as we
+	 * don't support a mixture placement of one vma
+	 */
+	if (migrate.cpages != npages)
+		drm_warn(&tile->xe->drm, "Partial migration for range [%lx - %lx], range is %ld pages, migrate only %ld pages\n",
+				start, end, npages, migrate.cpages);
+
+	/**Migrate page by page for now.
+	 * Both source pages and destination pages can physically not contiguous,
+	 * there is no good way to migrate multiple pages per blitter command.
+	 */
+	for (i = 0; i < npages; i++) {
+		src_page = migrate_pfn_to_page(migrate.src[i]);
+		if (unlikely(!src_page || !(migrate.src[i] & MIGRATE_PFN_MIGRATE)))
+			goto free_dst_page;
+
+		xe_assert(tile->xe, !is_zone_device_page(src_page));
+		src_dma_addr[i] = dma_map_page(dev, src_page, 0, PAGE_SIZE, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
+			drm_warn(&tile->xe->drm, "dma map error for host pfn %lx\n", migrate.src[i]);
+			goto free_dst_page;
+		}
+		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
+		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
+				dst_dpa, true, PAGE_SIZE);
+		if (IS_ERR(fence)) {
+			drm_warn(&tile->xe->drm, "migrate host page (pfn: %lx) to vram failed\n",
+					migrate.src[i]);
+			/**Migration is best effort. Even we failed here, we continue*/
+			goto free_dst_page;
+		}
+		/**FIXME: Use the first migration's out fence as the second migration's input fence,
+		 * and so on. Only wait the out fence of last migration?
+		 */
+		dma_fence_wait(fence, false);
+		dma_fence_put(fence);
+free_dst_page:
+		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
+	}
+
+	for (i = 0; i < npages; i++)
+		if (!(dma_mapping_error(dev, src_dma_addr[i])))
+			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE, DMA_TO_DEVICE);
+
+	migrate_vma_pages(&migrate);
+	migrate_vma_finalize(&migrate);
+free_dst_pages:
+	if (ret)
+		xe_devm_free_blocks(&blocks);
+kfree_buf:
+	kfree(buf);
+	return ret;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 29/31] drm/xe/svm: trace svm migration
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (27 preceding siblings ...)
  2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Add two trace points to trace svm migrations.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm_migrate.c |  5 ++++-
 drivers/gpu/drm/xe/xe_trace.h       | 11 +++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
index ab8dd1f58aa4..69096d81bf02 100644
--- a/drivers/gpu/drm/xe/xe_svm_migrate.c
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -211,8 +211,10 @@ vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
 	xe_svm_for_each_vm(svm, vm) {
 		xe_assert(xe, vm->mm == mm);
 		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
-		if (xe_vma)
+		if (xe_vma) {
+			trace_xe_svm_migrate_to_sram(xe_vma);
 			xe_vm_invalidate_vma(xe_vma);
+		}
 	}
 	migrate_vma_pages(&migrate_vma);
 	migrate_vma_finalize(&migrate_vma);
@@ -328,6 +330,7 @@ int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
 
 	migrate_vma_pages(&migrate);
 	migrate_vma_finalize(&migrate);
+	trace_xe_svm_migrate_to_vram(vma);
 free_dst_pages:
 	if (ret)
 		xe_devm_free_blocks(&blocks);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index f3fcce9f1434..12e0c9856540 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -480,6 +480,17 @@ DEFINE_EVENT(xe_vma, xe_vma_userptr_invalidate_complete,
 	     TP_ARGS(vma)
 );
 
+DEFINE_EVENT(xe_vma, xe_svm_migrate_to_sram,
+		    TP_PROTO(struct xe_vma *vma),
+		    TP_ARGS(vma)
+);
+
+
+DEFINE_EVENT(xe_vma, xe_svm_migrate_to_vram,
+		    TP_PROTO(struct xe_vma *vma),
+		    TP_ARGS(vma)
+);
+
 DECLARE_EVENT_CLASS(xe_vm,
 		    TP_PROTO(struct xe_vm *vm),
 		    TP_ARGS(vm),
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (28 preceding siblings ...)
  2024-04-09 20:17 ` [v2 29/31] drm/xe/svm: trace svm migration Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:50   ` Matthew Brost
  2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
  2024-04-09 20:52 ` ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver Patchwork
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

xe_vma_is_fault_userptr is added to determine the vma is
a fault userptr.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index d55330988e32..a718f927e362 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -166,6 +166,11 @@ static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 		!xe_vma_is_system_allocator(vma);
 }
 
+static inline bool xe_vma_is_fault_userptr(struct xe_vma *vma)
+{
+	return xe_vma_is_userptr(vma) && (vma->gpuva.flags & XE_VMA_FAULT_USERPTR);
+}
+
 /**
  * to_userptr_vma() - Return a pointer to an embedding userptr vma
  * @vma: Pointer to the embedded struct xe_vma
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (29 preceding siblings ...)
  2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:55   ` Matthew Brost
  2024-04-09 20:52 ` ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver Patchwork
  31 siblings, 1 reply; 72+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

If applicable, migrate a vma from sram to vram for system allocator.
Traditional userptr is not migrated. Only userptr created during
fault (aka userptr splitted from system allocator vma) can be
migrated.

FIXME: The migration should be conditional on user memory attributes
setting. Add this logic when memory attributes are supported

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 9 ++++++++-
 drivers/gpu/drm/xe/xe_vm.c           | 4 ----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 668984f0769e..c6ba00049964 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -20,6 +20,7 @@
 #include "xe_guc_ct.h"
 #include "xe_migrate.h"
 #include "xe_trace.h"
+#include "xe_svm.h"
 #include "xe_vm.h"
 
 struct pagefault {
@@ -209,12 +210,18 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 
 	if (xe_vma_is_userptr(vma) && write_locked) {
 		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		struct xe_userptr *userptr = &uvma->userptr;
 
 		spin_lock(&vm->userptr.invalidated_lock);
-		list_del_init(&uvma->userptr.invalidate_link);
+		list_del_init(&userptr->invalidate_link);
 		spin_unlock(&vm->userptr.invalidated_lock);
 
+		mmap_read_lock(userptr->notifier.mm);
+		/**FIXME: Add migration policy here*/
+		if (xe_vma_is_fault_userptr(vma))
+			xe_svm_migrate_vma_to_vram(vm, vma, tile);
 		ret = xe_vma_userptr_pin_pages(uvma);
+		mmap_read_unlock(userptr->notifier.mm);
 		if (ret)
 			goto unlock_vm;
 
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 498b36469d00..8a58fe144a02 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -71,16 +71,12 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct xe_device *xe = vm->xe;
-	struct xe_userptr *userptr;
 	int ret;
 
 	lockdep_assert_held(&vm->lock);
 	xe_assert(xe, xe_vma_is_userptr(vma));
 
-	userptr = &uvma->userptr;
-	mmap_read_lock(userptr->notifier.mm);
 	ret = xe_userptr_populate_range(uvma);
-	mmap_read_unlock(userptr->notifier.mm);
 
 	return ret;
 }
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (30 preceding siblings ...)
  2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
@ 2024-04-09 20:52 ` Patchwork
  31 siblings, 0 replies; 72+ messages in thread
From: Patchwork @ 2024-04-09 20:52 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: Basic system allocator support in xe driver
URL   : https://patchwork.freedesktop.org/series/132229/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 7be27f645de2 drm-tip: 2024y-04m-09d-17h-53m-12s UTC integration manifest
=== git am output follows ===
error: patch failed: drivers/gpu/drm/xe/xe_bo.h:10
error: drivers/gpu/drm/xe/xe_bo.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_device.c:226
error: drivers/gpu/drm/xe/xe_device.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_exec.c:135
error: drivers/gpu/drm/xe/xe_exec.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_exec_queue_types.h:78
error: drivers/gpu/drm/xe/xe_exec_queue_types.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_pci.c:375
error: drivers/gpu/drm/xe/xe_pci.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_pt.c:1161
error: drivers/gpu/drm/xe/xe_pt.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_sched_job_types.h:39
error: drivers/gpu/drm/xe/xe_sched_job_types.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_trace.h:264
error: drivers/gpu/drm/xe/xe_trace.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_uc_fw.c:105
error: drivers/gpu/drm/xe/xe_uc_fw.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_vm.c:515
error: drivers/gpu/drm/xe/xe_vm.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_vm.h:207
error: drivers/gpu/drm/xe/xe_vm.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_vm_types.h:180
error: drivers/gpu/drm/xe/xe_vm_types.h: patch does not apply
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Applying: drm/xe: Refactor vm_bind
Patch failed at 0001 drm/xe: Refactor vm_bind
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
@ 2024-04-10 21:09   ` Matthew Brost
  2024-04-16 19:01   ` Matthew Brost
  1 sibling, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:09 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:23PM -0400, Oak Zeng wrote:

Oak - Doing a very high level review due to early stage of the code.

> Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> page is backed by a struct page.
> 
> Those struct pages are created to allow hmm migrate buffer b/t
> GPU vram and CPU system memory using existing Linux migration
> mechanism (i.e., migrating b/t CPU system memory and hard disk).
> 
> This is prepare work to enable svm (shared virtual memory) through
> Linux kernel hmm framework. The memory remap's page map type is set
> to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> vram page get a struct page and can be mapped in CPU page table,
> but such pages are treated as GPU's private resource, so CPU can't
> access them. If CPU access such page, a page fault is triggered
> and page will be migrate to system memory.
> 
> For GPU device which supports coherent memory protocol b/t CPU and
> GPU (such as CXL and CAPI protocol), we can remap device memory as
> MEMORY_DEVICE_COHERENT. This is TBD.
> 
> v1:
> Changes per code review feedback from Matt:
>     change .o order in Makefile
>     fix indentation
>     change code order in mmio_fini
>     remove unnecessary header file
>     uniform xe_svm_devm_add/_remove parameter
>     use tile (vs dev) as pagemap.owner during memremap
>     only remap vram for platform that support usm
> Changes per review feedback from Brian:
>     s/xe_svm_devm_add/xe_devm_add
>     s/xe_svm_devm_remove/xe_devm_remove
>     move calling of xe_devm_add to xe_tile.c
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile          |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  8 +++
>  drivers/gpu/drm/xe/xe_mmio.c         |  6 ++
>  drivers/gpu/drm/xe/xe_svm.h          | 15 +++++
>  drivers/gpu/drm/xe/xe_svm_devmem.c   | 89 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_tile.c         |  4 ++
>  6 files changed, 123 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index fff70fc9a09e..cd5213ba182b 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -129,6 +129,7 @@ xe-y += xe_bb.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
>  	xe_step.o \
> +	xe_svm_devmem.o \

This goes for the series - IMO let's have all the svm code in one file
unless we have a really good reason not to. For the series I see 4:

mbrost@lstrano-desk:xe$ ls -la *.c | grep svm
-rw-rw-r-- 1 mbrost mbrost  8975 Apr 10 11:17 xe_svm.c
-rw-rw-r-- 1 mbrost mbrost  6774 Apr 10 11:17 xe_svm_devmem.c
-rw-rw-r-- 1 mbrost mbrost 10636 Apr 10 11:17 xe_svm_migrate.c
-rw-rw-r-- 1 mbrost mbrost  5940 Apr 10 11:17 xe_svm_range.c

Personally I'd like the name xe_devmem.c (or xe_devm.c) open to xe_svm.c
I guess too.

Whatever name we land on let's also try to make sure all exported
functions (in *.h file) start with the same prefix as the file too.

>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index e73b9a086718..d6a14327986b 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -103,6 +103,14 @@ struct xe_mem_region {
>  	resource_size_t actual_physical_size;
>  	/** @mapping: pointer to VRAM mappable space */
>  	void __iomem *mapping;
> +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> +	struct dev_pagemap pagemap;
> +	/**
> +	 * @hpa_base: base host physical address
> +	 *
> +	 * This is generated when remap device memory as ZONE_DEVICE
> +	 */
> +	resource_size_t hpa_base;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> index 7ba2477452d7..12923fe6abae 100644
> --- a/drivers/gpu/drm/xe/xe_mmio.c
> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> @@ -22,6 +22,7 @@
>  #include "xe_module.h"
>  #include "xe_sriov.h"
>  #include "xe_tile.h"
> +#include "xe_svm.h"
>  
>  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
>  #define TILE_COUNT		REG_GENMASK(15, 8)
> @@ -354,6 +355,11 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
>  static void mmio_fini(struct drm_device *drm, void *arg)
>  {
>  	struct xe_device *xe = arg;
> +	struct xe_tile *tile;
> +	u8 id;
> +
> +	for_each_tile(tile, xe, id)
> +		xe_devm_remove(tile, &tile->mem.vram);
>  
>  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>  	if (xe->mem.vram.mapping)
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> new file mode 100644
> index 000000000000..e944971cfc6d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#ifndef __XE_SVM_H
> +#define __XE_SVM_H
> +
> +struct xe_tile;
> +struct xe_mem_region;
> +
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> new file mode 100644
> index 000000000000..31af56e8285a
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/mm_types.h>
> +#include <linux/sched/mm.h>
> +
> +#include "xe_device_types.h"
> +#include "xe_svm.h"
> +
> +
> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
> +static void xe_devm_page_free(struct page *page)
> +{
> +}
> +
> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> +	.page_free = xe_devm_page_free,
> +	.migrate_to_ram = xe_devm_migrate_to_ram,
> +};
> +
> +/**
> + * xe_devm_add: Remap and provide memmap backing for device memory
> + * @tile: tile that the memory region blongs to
> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> +	mr->pagemap.owner = xe;

Nit: I know I suggested this another series too - helper to go from xe
-> owner which can be used in various places we set this.

> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start, resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> +			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> +	return 0;
> +}
> +
> +/**
> + * xe_devm_remove: Unmap device memory and free resources
> + * @tile: xe tile
> + * @mr: memory region to remove
> + */
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> +
> +	/*FIXME: Does below cause a kernel hange during moduel remove?*/

Goes for the series, try resolve issues rather than having FIXMEs.

Matt

> +	if (mr->hpa_base) {
> +		devm_memunmap_pages(dev, &mr->pagemap);
> +		devm_release_mem_region(dev, mr->pagemap.range.start,
> +			mr->pagemap.range.end - mr->pagemap.range.start + 1);
> +	}
> +}
> +
> diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> index 0650b2fa75ef..f1c4f9de51df 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -14,6 +14,7 @@
>  #include "xe_tile_sysfs.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_wa.h"
> +#include "xe_svm.h"
>  
>  /**
>   * DOC: Multi-tile Design
> @@ -158,6 +159,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
>   */
>  int xe_tile_init_noalloc(struct xe_tile *tile)
>  {
> +	struct xe_device *xe = tile_to_xe(tile);
>  	int err;
>  
>  	xe_device_mem_access_get(tile_to_xe(tile));
> @@ -175,6 +177,8 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
>  
>  	xe_tile_sysfs_init(tile);
>  
> +	if (xe->info.has_usm)
> +		xe_devm_add(tile, &tile->mem.vram);
>  err_mem_access:
>  	xe_device_mem_access_put(tile_to_xe(tile));
>  	return err;
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config
  2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
@ 2024-04-10 21:13   ` Matthew Brost
  2024-06-04 18:57     ` Zeng, Oak
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:13 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:24PM -0400, Oak Zeng wrote:
> Introduce a DRM_XE_SVM kernel config entry for

Maybe consider another name for this? I could see use cases for non-SVM
where we still want private pages mapped (e.g. VRAM userptrs on
non-faulting devices). Don't really have suggestion but worth
considering.

> xe svm feature. xe svm feature allows share
> virtual address space between CPU and GPU program.
> 
> v1: Improve commit message (Thomas)
>     Avoid using #if directive (Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Kconfig   | 21 +++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_tile.c |  7 +++++--
>  2 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
> index 449a1ecbc92a..0accb2cb81d6 100644
> --- a/drivers/gpu/drm/xe/Kconfig
> +++ b/drivers/gpu/drm/xe/Kconfig
> @@ -84,6 +84,27 @@ config DRM_XE_FORCE_PROBE
>  	  4571.
>  
>  	  Use "!*" to block the probe of the driver for all known devices.
> +config DRM_XE_SVM
> +	bool "Enable Shared Virtual Memory support in xe"
> +	depends on DRM_XE
> +	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +	depends on ARCH_ENABLE_MEMORY_HOTREMOVE
> +	depends on MEMORY_HOTPLUG
> +	depends on MEMORY_HOTREMOVE
> +	depends on ARCH_HAS_PTE_DEVMAP
> +	depends on SPARSEMEM_VMEMMAP
> +	depends on ZONE_DEVICE
> +	depends on DEVICE_PRIVATE
> +	depends on MMU
> +	select HMM_MIRROR
> +	select MMU_NOTIFIER
> +	default y
> +	help
> +	  Choose this option if you want Shared Virtual Memory (SVM)
> +	  support in xe. With SVM, virtual address space is shared
> +	  between CPU and GPU. This means any virtual address such
> +	  as malloc or mmap returns, variables on stack, or global
> +	  memory pointers, can be used for GPU transparently.
>  
>  menu "drm/Xe Debugging"
>  depends on DRM_XE
> diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> index f1c4f9de51df..a1a436912fe3 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -159,9 +159,12 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
>   */
>  int xe_tile_init_noalloc(struct xe_tile *tile)
>  {
> -	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_device __maybe_unused *xe;

Just assign this here blindly? The __maybe_unused should suppress the
warning CONFIG_DRM_XE_SVM is false and should just compile out if it is.

Matt 

>  	int err;
>  
> +	if (IS_ENABLED(CONFIG_DRM_XE_SVM))
> +		xe = tile_to_xe(tile);
> +
>  	xe_device_mem_access_get(tile_to_xe(tile));
>  
>  	err = tile_ttm_mgr_init(tile);
> @@ -177,7 +180,7 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
>  
>  	xe_tile_sysfs_init(tile);
>  
> -	if (xe->info.has_usm)
> +	if (IS_ENABLED(CONFIG_DRM_XE_SVM) && xe->info.has_usm)
>  		xe_devm_add(tile, &tile->mem.vram);
>  err_mem_access:
>  	xe_device_mem_access_put(tile_to_xe(tile));
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 14/31] drm/xe: Introduce helper to get tile from memory region
  2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
@ 2024-04-10 21:17   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:17 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:25PM -0400, Oak Zeng wrote:
> Introduce a simple helper to retrieve tile from memory region
> 
> v1: move the function to xe_device.h (Matt)
>     improve commit message, add kerneldoc (Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>

This LGTM but can it be moved to xe_tile.h? That might be a better place
but I know xe_device.h and xe_tile.h are intertwined a bit.

Matt

> ---
>  drivers/gpu/drm/xe/xe_device.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 74eb9833d4d8..68082357aebd 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -178,4 +178,12 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
>  
>  void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred);
>  
> +/**
> + * xe_mem_region_to_tile() - retrieve tile from memory region
> + * @mr: the memory region we retrieve tile from
> + */
> +static inline struct xe_tile *xe_mem_region_to_tile(struct xe_mem_region *mr)
> +{
> +	return container_of(mr, struct xe_tile, mem.vram);
> +}
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn
  2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
@ 2024-04-10 21:35   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:35 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:26PM -0400, Oak Zeng wrote:
> Since we now create struct page backing for each vram page,
> each vram page now also has a pfn, just like system memory.
> This allow us to calcuate device physical address from pfn.
> 
> v1: move the function to xe_svm.h (Matt)
>     s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa (Matt)
>     add kernel document for the helper (Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index e944971cfc6d..8a34429eb674 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -6,8 +6,31 @@
>  #ifndef __XE_SVM_H
>  #define __XE_SVM_H
>  
> -struct xe_tile;
> -struct xe_mem_region;
> +#include "xe_device_types.h"
> +#include "xe_device.h"
> +#include "xe_assert.h"

Hmm, including all these headers is a frowned upon and indicates to me
this likely the wrong location. The new header hopefully is clean with
only forward delc and function defs. I know Xe headers are not great at
this but lets not make this worse than it is.

Maybe should be xe_device.h? Also if we move the entire implementation 1
*.c file it is possible this function be private to that C file too.

> +
> +/**
> + * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
> + *
> + * @mr: The memory region that page resides in
> + * @pfn: page frame number of the page
> + *
> + * Returns: the device physical address of the page
> + */
> +static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)

I'd change to xe_mem_region_page_to_dpa with a struct page argument
rather than pfn. The pfn then be derived from the page.

I think this better as we will 3 types pfn all with different values /
shifts.

- hmm pfn
- migrate pfn
- linux core pfn

If a migrate pfn or hmm pfn were passed in as argument we'd get the
wrong dpa. I think passing in a page is safer and less bug prone. In my
example if we had a migrate pfn or hmm pfn, we'd use the appropriate
helper to get the struct page.

This also aligns with how a similar AMD helper (svm_migrate_addr, [1]) is
implemented.

Matt

[1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c#L234

> +{
> +	u64 dpa;
> +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	u64 offset;
> +
> +	xe_assert(xe, (pfn << PAGE_SHIFT) >= mr->hpa_base);
> +	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> +	dpa = mr->dpa_base + offset;
> +
> +	return dpa;
> +}
>  
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
>  void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 16/31] drm/xe/svm: Get xe memory region from page
  2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
@ 2024-04-10 21:38   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:38 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:27PM -0400, Oak Zeng wrote:
> For gpu vram page, we now have a struct page backing of
> it. struct page's pgmap points to xe_memory_region's
> pagemap. This allow us to retrieve xe_memory_region
> from struct page.
> 
> v1: move the function to xe_svm.h
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 8a34429eb674..624c1581f8ba 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h

See my comments about location in previous patch and also if this can be
a private function if implementation is in 1 *.c file.

> @@ -6,6 +6,7 @@
>  #ifndef __XE_SVM_H
>  #define __XE_SVM_H
>  
> +#include <linux/mm_types.h>
>  #include "xe_device_types.h"
>  #include "xe_device.h"
>  #include "xe_assert.h"
> @@ -35,4 +36,14 @@ static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
>  void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
>  
> +/**
> + * xe_page_to_mem_region() - Get a page's memory region
> + *
> + * @page: a struct page pointer pointing to a page in vram memory region
> + */
> +static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
> +{
> +	return container_of(page->pgmap, struct xe_mem_region, pagemap);
> +}

If the previous patch is xe_mem_region_page_to_dpa and want very robust
code we could add an assert to that function.

xe_assert(xe, mr == xe_page_to_mem_region(page));

Matt

> +
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 17/31] drm/xe: Get xe_vma from xe_userptr
  2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
@ 2024-04-10 21:42   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:42 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:28PM -0400, Oak Zeng wrote:
> Introduce a helper to get xe_vma from xe_userptr.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.h | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 0b2790f697db..4860747592ad 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -178,6 +178,20 @@ static inline struct xe_userptr_vma *to_userptr_vma(struct xe_vma *vma)
>  	return container_of(vma, struct xe_userptr_vma, vma);
>  }
>  
> +/**
> + * xe_userptr_to_vma() - Return xe_vma from a xe_userptr pointer
> + *
> + * @userptr: The userptr struct pointer
> + */
> +

Extra newline. Otherwise LGTM.

Matt

> +static inline struct xe_vma *xe_userptr_to_vma(struct xe_userptr *userptr)
> +{
> +	struct xe_userptr_vma *uvma;
> +
> +	uvma = container_of(userptr, struct xe_userptr_vma, userptr);
> +	return &uvma->vma;
> +}
> +
>  u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile);
>  
>  int xe_vm_create_ioctl(struct drm_device *dev, void *data,
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 18/31] drm/xe/svm: Build userptr sg table for device pages
  2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
@ 2024-04-10 21:52   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:52 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:29PM -0400, Oak Zeng wrote:
> Previously function xe_build_sg only support userptr with system
> memory pages. Now this function is extended to support userptr
> with device pages backing as well.
> 
> For device pages, there is no need of dma-mapping. Instead, we
> calculated the device page's dpa (device physical address) and
> use dpa to fill sg table.
> 
> As of now, we assume each userptr is only backed either by all
> system memory pages or all by device pages. There is no support
> of mixture backing of device and system memory pages.
> 

I'm not sure if this required as per Jason's suggestion or rather
insistence to not use a sg list for a collection of dpas [1].

For single device we should just be able to the buddy blocks as the
cursor which I suggest in [1]. Maybe this doesn't work in a multi-device
case but it certainly should work for a single device. Since we are
working a single device first, let get it working without abusing a SG
list.

Matt

[1] https://patchwork.freedesktop.org/patch/574894/?series=128910&rev=1

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_hmm.c      | 121 +++++++++++++++++++++++++------
>  drivers/gpu/drm/xe/xe_vm_types.h |   2 +
>  2 files changed, 100 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
> index 427c6bc49949..a261c1dd2060 100644
> --- a/drivers/gpu/drm/xe/xe_hmm.c
> +++ b/drivers/gpu/drm/xe/xe_hmm.c
> @@ -11,6 +11,7 @@
>  #include <linux/hmm.h>
>  #include <linux/mm.h>
>  #include "xe_hmm.h"
> +#include "xe_svm.h"
>  #include "xe_vm.h"
>  #include "xe_bo.h"
>  
> @@ -43,15 +44,90 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>  	}
>  }
>  
> +/**
> + * xe_build_sg_device_pages() - build sg table for userptr when the backing store
> + * is device pages
> + *
> + * @st: sg table to build
> + * @hmm_pfns: pfn array of the userptr
> + * @pages: struct page arrary of this userptr
> + * @npages: how many pages in this userptr
> + */
> +static int xe_build_sg_device_pages(struct sg_table *st, unsigned long *hmm_pfns,
> +						struct page **pages, uint64_t npages)
> +{
> +	struct scatterlist *sg;
> +	int i;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		unsigned long addr;
> +		struct xe_mem_region *mr;
> +
> +		mr = xe_page_to_mem_region(pages[i]);
> +		addr = xe_mem_region_pfn_to_dpa(mr, hmm_pfns[i]);
> +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}
> +
> +/**
> + * xe_validate_hmm_pfns() - validate all pages in a userptr belong to one memory
> + * region, and populate the pages array.
> + *
> + * @userptr: The userptr to validate
> + * @hmm_pfns: an array holding hmm pfns
> + * @npages: number of pages of this userptr
> + * @pages: output parameter to hold the populated pages from pfn.
> + */
> +static void xe_validate_hmm_pfns(struct xe_userptr *userptr, unsigned long *hmm_pfns,
> +						uint64_t npages, struct page **pages)
> +{
> +	int i;
> +	struct xe_vma *vma = xe_userptr_to_vma(userptr);
> +	struct xe_vm *vm = xe_vma_vm(vma);
> +
> +	pages[0] = hmm_pfn_to_page(hmm_pfns[0]);
> +	userptr->is_device_pages = is_device_private_page(pages[0]);
> +	for (i = 1; i < npages; i++) {
> +		pages[i] = hmm_pfn_to_page(hmm_pfns[i]);
> +		/**
> +		 * We currently assume no mixture of device pages and system memory
> +		 * pages in one userptr. If it turns out this is not true, we will
> +		 * either split userptr into device pages based and system memory
> +		 * based, or support a mixture backing store in one userptr.
> +		 */
> +		xe_assert(vm->xe,
> +			userptr->is_device_pages == is_device_private_page(pages[i]));
> +	}
> +}
> +
> +
>  /**
>   * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
>   * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
>   * and will be used to program GPU page table later.
>   *
>   * @xe: the xe device who will access the dma-address in sg table
> + * @userptr: the userptr that we build the sg table for
>   * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
>   * has the pfn numbers of pages that back up this hmm address range.
> - * @st: pointer to the sg table.
>   * @write: whether we write to this range. This decides dma map direction
>   * for system pages. If write we map it bi-diretional; otherwise
>   * DMA_TO_DEVICE
> @@ -64,11 +140,6 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>   * access memory. So if the memory is system memory, we need to
>   * do a dma-mapping so it can be accessed by GPU/DMA.
>   *
> - * FIXME: This function currently only support pages in system
> - * memory. If the memory is GPU local memory (of the GPU who
> - * is going to access memory), we need gpu dpa (device physical
> - * address), and there is no need of dma-mapping. This is TBD.
> - *
>   * FIXME: dma-mapping for peer gpu device to access remote gpu's
>   * memory. Add this when you support p2p
>   *
> @@ -77,12 +148,13 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>   *
>   * Returns 0 if successful; -ENOMEM if fails to allocate memory
>   */
> -static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
> -			     struct sg_table *st, bool write)
> +static int xe_build_sg(struct xe_device *xe, struct xe_userptr *userptr,
> +					struct hmm_range *range, bool write)
>  {
> +	struct sg_table *st = &userptr->sgt;
>  	struct device *dev = xe->drm.dev;
>  	struct page **pages;
> -	u64 i, npages;
> +	u64 npages;
>  	int ret;
>  
>  	npages = xe_npages_in_range(range->start, range->end);
> @@ -90,19 +162,22 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
>  	if (!pages)
>  		return -ENOMEM;
>  
> -	for (i = 0; i < npages; i++) {
> -		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
> -		xe_assert(xe, !is_device_private_page(pages[i]));
> -	}
> -
> -	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
> -			npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
> -	if (ret)
> -		goto free_pages;
> +	xe_validate_hmm_pfns(userptr, range->hmm_pfns, npages, pages);
> +	if (!userptr->is_device_pages) {
> +		ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
> +				npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
> +		if (ret)
> +			goto free_pages;
>  
> -	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
> -			DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
> +		ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
> +				DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
> +	} else {
> +		ret = xe_build_sg_device_pages(st, range->hmm_pfns, pages, npages);
> +		if (ret)
> +			goto free_pages;
> +	}
>  
> +	userptr->sg = st;
>  free_pages:
>  	kvfree(pages);
>  	return ret;
> @@ -127,7 +202,8 @@ void xe_userptr_free_sg(struct xe_userptr_vma *uvma)
>  	struct device *dev = xe->drm.dev;
>  
>  	xe_assert(xe, userptr->sg);
> -	dma_unmap_sgtable(dev, userptr->sg,
> +	if (!userptr->is_device_pages)
> +		dma_unmap_sgtable(dev, userptr->sg,
>  			write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
>  
>  	sg_free_table(userptr->sg);
> @@ -239,12 +315,11 @@ int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
>  	if (ret)
>  		goto free_pfns;
>  
> -	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
> +	ret = xe_build_sg(vm->xe, userptr, &hmm_range, write);
>  	if (ret)
>  		goto free_pfns;
>  
>  	xe_mark_range_accessed(&hmm_range, write);
> -	userptr->sg = &userptr->sgt;
>  	userptr->notifier_seq = hmm_range.notifier_seq;
>  
>  free_pfns:
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index fbf6bfcf59a8..3b4debfecc9b 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -64,6 +64,8 @@ struct xe_userptr {
>  	struct sg_table *sg;
>  	/** @notifier_seq: notifier sequence number */
>  	unsigned long notifier_seq;
> +	/** @is_device_pages: the backing store is in device memory*/
> +	bool is_device_pages;
>  	/**
>  	 * @initial_bind: user pointer has been bound at least once.
>  	 * write: vm->userptr.notifier_lock in read mode and vm->resv held.
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory
  2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
@ 2024-04-10 21:56   ` Matthew Brost
  2024-06-05  2:29     ` Zeng, Oak
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 21:56 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:30PM -0400, Oak Zeng wrote:
> With system allocator, a userptr can now be back by device
> memory also. Introduce a helper function xe_vma_is_devmem
> to determine whether a vma is backed by device memory.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 846e896edcb5..525092111be9 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -577,6 +577,17 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
>  	.pt_entry = xe_pt_stage_bind_entry,
>  };
>  
> +static bool xe_vma_is_devmem(struct xe_vma *vma)

At some point we probably want to scrub the driver as we intermix
devmem, vram, and lmem nomenclature. I think in case we mean the same
thing too. Anwyays that is a little out of scope here.

> +{
> +	if (xe_vma_is_userptr(vma)) {
> +		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> +		return uvma->userptr.is_device_pages;

Helper itself LGTM. Maybe promote to xe_vm.c/xe_vm.h?

Also consider other options rather than userptr.is_device_pages flag
here (e.g. look for buddy blocks, check a gpuvm flags, etc...). Can live
with a flag but we can do without it, great.

Matt

> +	} else {
> +		struct xe_bo *bo = xe_vma_bo(vma);
> +		return bo && (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
> +	}
> +}
> +
>  /**
>   * xe_pt_stage_bind() - Build a disconnected page-table tree for a given address
>   * range.
> @@ -601,8 +612,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>  {
>  	struct xe_device *xe = tile_to_xe(tile);
>  	struct xe_bo *bo = xe_vma_bo(vma);
> -	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
> -		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
> +	bool is_devmem = xe_vma_is_devmem(vma);
>  	struct xe_res_cursor curs;
>  	struct xe_pt_stage_bind_walk xe_walk = {
>  		.base = {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 21/31] drm/xe/svm: Introduce svm migration function
  2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
@ 2024-04-10 22:06   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 22:06 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:32PM -0400, Oak Zeng wrote:
> Introduce xe_migrate_pa function for data migration.
> This function is similar to xe_migrate_copy function
> but has different parameters. Instead of BO and ttm
> resource parameters, it has source and destination
> buffer's physical address as parameter. This function is
> intended to be used by svm sub-system which doesn't
> have BO and TTM concept.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_migrate.c | 217 ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_migrate.h |   7 ++
>  2 files changed, 224 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index 82b63bdb9c47..f1d53911253b 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -462,6 +462,37 @@ static bool xe_migrate_allow_identity(u64 size, const struct xe_res_cursor *cur)
>  	return cur->size >= size;
>  }
>  
> +/**
> + * pte_update_cmd_size() - calculate the batch buffer command size
> + * to update a flat page table.
> + *
> + * @size: The virtual address range size of the page table to update
> + *
> + * The page table to update is supposed to be a flat 1 level page
> + * table with all entries pointing to 4k pages.
> + *
> + * Return the number of dwords of the update command
> + */
> +static u32 pte_update_cmd_size(u64 size)
> +{
> +	u32 dword;
> +	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> +
> +	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
> +	/*
> +	 * MI_STORE_DATA_IMM command is used to update page table. Each
> +	 * instruction can update maximumly 0x1ff pte entries. To update
> +	 * n (n <= 0x1ff) pte entries, we need:
> +	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
> +	 * 2 dword for the page table's physical location
> +	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
> +	 */
> +	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
> +	dword += entries * 2;
> +
> +	return dword;
> +}
> +
>  static u32 pte_update_size(struct xe_migrate *m,
>  			   bool is_vram,
>  			   struct ttm_resource *res,
> @@ -562,6 +593,48 @@ static void emit_pte(struct xe_migrate *m,
>  	}
>  }
>  
> +/**
> + * build_pt_update_batch_sram() - build batch buffer commands to update
> + * migration vm page table for system memory
> + *
> + * @m: The migration context
> + * @bb: The batch buffer which hold the page table update commands
> + * @pt_offset: The offset of page table to update, in byte
> + * @pa: device physical address you want the page table to point to
> + * @size: size of the virtual address space you want the page table to cover
> + */
> +static void build_pt_update_batch_sram(struct xe_migrate *m,
> +		     struct xe_bb *bb, u32 pt_offset,
> +		     u64 pa, u32 size)
> +{
> +	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
> +	u32 ptes;
> +
> +	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> +	while (ptes) {
> +		u32 chunk = min(0x1ffU, ptes);
> +
> +		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
> +		bb->cs[bb->len++] = pt_offset;
> +		bb->cs[bb->len++] = 0;
> +
> +		pt_offset += chunk * 8;
> +		ptes -= chunk;
> +
> +		while (chunk--) {
> +			u64 addr;
> +
> +			addr = pa & PAGE_MASK;
> +			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
> +								 addr, pat_index,
> +								 0, false, 0);
> +			bb->cs[bb->len++] = lower_32_bits(addr);
> +			bb->cs[bb->len++] = upper_32_bits(addr);
> +			pa += XE_PAGE_SIZE;
> +		}
> +	}
> +}
> +
>  #define EMIT_COPY_CCS_DW 5
>  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>  			  u64 dst_ofs, bool dst_is_indirect,
> @@ -879,6 +952,150 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
>  	return fence;
>  }
>  
> +/**
> + * xe_migrate_pa() - Migrate buffers with src and dst physical address
> + *
> + * @m: The migration context
> + * @src_pa: physical address of source, from GPU's point of view. This is a
> + * device physical address (dpa) when source is in vram. When source is in
> + * system memory, this is a dma mapped host physical address
> + * @src_is_vram: True if source buffer is in vram.
> + * @dst_pa: physical address of destination, from GPU's point of view. This is a
> + * device physical address (dpa) when source is in vram. When source is in
> + * system memory, this is a dma mapped host physical address
> + * @dst_is_vram: True if destination buffer is in vram.
> + * @size: The size of data to copy.
> + *
> + * Copy @size bytes of data from @src_pa to @dst_pa. The functionality
> + * and behavior of this function is similar to xe_migrate_copy function, but
> + * the interface is different. This function is a helper function supposed to
> + * be used by SVM subsytem. Since in SVM subsystem there is no buffer object
> + * and ttm, there is no src/dst bo as function input. Instead, we directly use
> + * src/dst's physical address as function input.
> + *
> + * Since the back store of any user malloc'ed or mmap'ed memory can be placed in
> + * system  memory, it can not be compressed. Thus this function doesn't need
> + * to consider copy CCS (compression control surface) data as xe_migrate_copy did.
> + *
> + * This function assumes the source buffer and destination buffer are all physically
> + * contiguous.
> + *
> + * We use gpu blitter to copy data. Source and destination are first mapped to
> + * migration vm which is a flat one level (L0) page table, then blitter is used to
> + * perform the copy.
> + *
> + * Return: Pointer to a dma_fence representing the last copy batch, or
> + * an error pointer on failure. If there is a failure, any copy operation
> + * started by the function call has been synced.
> + */
> +struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
> +				  u64 src_pa,
> +				  bool src_is_vram,
> +				  u64 dst_pa,
> +				  bool dst_is_vram,
> +				  u64 size)

This assume both addresses are contiguous if size > 4k.

I don't think needs to be the case when one of the addresses a sram
(dma_addr) as we dynamically map in sram pages into PT entries. i.e only
VRAM addresses need to be contiguous.

I'd suggest this function take from array of dma_addr and 1 vram
address to maximize copy efficiency. Also add a direction variable too
(i.e. is vram the source or destination).

> +{
> +#define NUM_PT_PER_BLIT (MAX_PREEMPTDISABLE_TRANSFER / SZ_2M)
> +	struct xe_gt *gt = m->tile->primary_gt;
> +	struct xe_device *xe = gt_to_xe(gt);
> +	struct dma_fence *fence = NULL;
> +	u64 src_L0_ofs, dst_L0_ofs;
> +	u64 round_update_size;
> +	/* A slot is a 4K page of page table, covers 2M virtual address*/
> +	u32 pt_slot;
> +	int err;
> +
> +	while (size) {

We might not need this loop either if we make the caller enforce the
chunking (i.e. cap size as 2 MB or whatever MAX_PREEMPTDISABLE_TRANSFER
is).

> +		u32 batch_size = 2; /* arb_clear() + MI_BATCH_BUFFER_END */
> +		struct xe_sched_job *job;
> +		struct xe_bb *bb;
> +		u32 update_idx;
> +
> +		/* Maximumly copy MAX_PREEMPTDISABLE_TRANSFER bytes. Why?*/
> +		round_update_size = min_t(u64, size, MAX_PREEMPTDISABLE_TRANSFER);
> +
> +		/* src pte update*/
> +		if (!src_is_vram)
> +			batch_size += pte_update_cmd_size(round_update_size);
> +		/* dst pte update*/
> +		if (!dst_is_vram)
> +			batch_size += pte_update_cmd_size(round_update_size);
> +
> +		/* Copy command size*/
> +		batch_size += EMIT_COPY_DW;
> +
> +		bb = xe_bb_new(gt, batch_size, true);
> +		if (IS_ERR(bb)) {
> +			err = PTR_ERR(bb);
> +			goto err_sync;
> +		}
> +
> +		if (!src_is_vram) {
> +			pt_slot = 0;
> +			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
> +					src_pa, round_update_size);
> +			src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> +		}
> +		else
> +			src_L0_ofs = xe_migrate_vram_ofs(xe, src_pa);
> +
> +		if (!dst_is_vram) {
> +			pt_slot = NUM_PT_PER_BLIT;
> +			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
> +					dst_pa, round_update_size);
> +			dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> +		}
> +		else
> +			dst_L0_ofs = xe_migrate_vram_ofs(xe, dst_pa);
> +
> +
> +		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> +		update_idx = bb->len;
> +
> +		emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
> +			  XE_PAGE_SIZE);
> +
> +		mutex_lock(&m->job_mutex);
> +		job = xe_bb_create_migration_job(m->q, bb,
> +						 xe_migrate_batch_base(m, true),
> +						 update_idx);
> +		if (IS_ERR(job)) {
> +			err = PTR_ERR(job);
> +			goto err;
> +		}
> +
> +		xe_sched_job_add_migrate_flush(job, 0);
> +		xe_sched_job_arm(job);
> +		dma_fence_put(fence);
> +		fence = dma_fence_get(&job->drm.s_fence->finished);
> +		xe_sched_job_push(job);
> +		dma_fence_put(m->fence);
> +		m->fence = dma_fence_get(fence);
> +
> +		mutex_unlock(&m->job_mutex);
> +
> +		xe_bb_free(bb, fence);
> +		size -= round_update_size;
> +		src_pa += round_update_size;
> +		dst_pa += round_update_size;
> +		continue;
> +
> +err:
> +		mutex_unlock(&m->job_mutex);
> +		xe_bb_free(bb, NULL);
> +
> +err_sync:
> +		/* Sync partial copy if any. FIXME: under job_mutex? */
> +		if (fence) {
> +			dma_fence_wait(fence, false);
> +			dma_fence_put(fence);
> +		}
> +
> +		return ERR_PTR(err);
> +	}
> +
> +	return fence;
> +}
>  static void emit_clear_link_copy(struct xe_gt *gt, struct xe_bb *bb, u64 src_ofs,
>  				 u32 size, u32 pitch)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
> index 701bb27349b0..98b480244265 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.h
> +++ b/drivers/gpu/drm/xe/xe_migrate.h
> @@ -101,6 +101,13 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
>  				  struct ttm_resource *dst,
>  				  bool copy_only_ccs);
>  
> +struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
> +				  u64 src_pa,
> +				  bool src_is_vram,
> +				  u64 dst_pa,
> +				  bool dst_is_vram,
> +				  u64 size);
> +

An option we be export xe_migrate_from_vram / xe_migrate_to_vram and
then internally call the function I suggest above with the correct
direction agrument too.

Matt

>  struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
>  				   struct xe_bo *bo,
>  				   struct ttm_resource *dst);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
@ 2024-04-10 22:23   ` Matthew Brost
  2024-04-15 20:13     ` Zeng, Oak
  2024-06-05 22:16     ` Zeng, Oak
  2024-04-17 20:55   ` Matthew Brost
  1 sibling, 2 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 22:23 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> house keeping work for all the pages allocated, such as get a page
> refcount, keep a bitmap of all pages to denote whether a page is in
> use, put pages to a drm lru list for eviction purpose.
> 
> Function xe_devm_free_blocks return list of memory blocks to drm buddy
> allocator.
> 
> Function xe_devm_free_page is a call back function from hmm layer. It
> is called whenever a page's refcount reaches to 1. This function clears
> the bit of this page in the bitmap. If all the bits in the bitmap is
> cleared, it means all the pages have been freed, we return all the pages
> in this memory block back to drm buddy.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
>  drivers/gpu/drm/xe/xe_svm_devmem.c | 147 ++++++++++++++++++++++++++++-

See comments about file in previous patches, they apply here too.

>  2 files changed, 152 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 624c1581f8ba..92a3ee90d5a7 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -46,4 +46,11 @@ static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
>  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
>  }
>  
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn);
> +
> +void xe_devm_free_blocks(struct list_head *blocks);
> +void xe_devm_page_free(struct page *page);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> index 31af56e8285a..5ba0cd9a70b0 100644
> --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -5,18 +5,161 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched/mm.h>
> -
> +#include <linux/gfp.h>
> +#include <linux/migrate.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dma-fence.h>
> +#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> +#include <drm/drm_buddy.h>
>  #include "xe_device_types.h"
>  #include "xe_svm.h"
> +#include "xe_migrate.h"
> +#include "xe_ttm_vram_mgr_types.h"
> +#include "xe_assert.h"
>  
> +/**
> + * struct xe_svm_block_meta - svm uses this data structure to manage each
> + * block allocated from drm buddy. This will be set to the drm_buddy_block's
> + * private field.
> + *
> + * @lru: used to link this block to drm's lru lists. This will be replace
> + * with struct drm_lru_entity later.
> + * @tile: tile from which we allocated this block
> + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> + * 0 means this page is idle. When all bits of this block are 0, it is time
> + * to return this block to drm buddy subsystem.
> + */
> +struct xe_svm_block_meta {
> +	struct list_head lru;
> +	struct xe_tile *tile;
> +	unsigned long bitmap[];
> +};

This looks not needed to me but admittedly haven't looked at the LRU stuff.

I am thinking roughly...

- I think we drop all this special tracking (kill xe_svm_block_meta)
- Have functions to allocate / free the buddy blocks, store buddy blocks in userptr
- Blocks are allocated before migration to VRAM
- Blocks can be freed on either CPU fault after migration or on VMA
  destroy (probably depends on madvive hints for VMA where we free
  blocks)
- Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
  (conceptually the same if we switch to 1 to N ratio between xe_vma &
  pt_state)
- block->private == memory region so we can get pfn from block
- When we need migrate_pfns we loop over buddy blocks populating migrate.dst

Also I noticed the drm_buddy_* calls in this file are not protected by a
lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
VRAM mgr code, we either need to reach into there or move this lock to
common place so the VRAM manager and block allocations for SVM don't
race with each other.

Matt

>  
>  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
>  {
>  	return 0;
>  }
>  
> -static void xe_devm_page_free(struct page *page)
> +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> +{
> +	/** DRM buddy's block offset is 0-based*/
> +	offset += mr->hpa_base;
> +
> +	return PHYS_PFN(offset);
> +}
> +
> +/** FIXME: we locked page by calling zone_device_page_init
> + *  in xe_devm_alloc_pages. Should we unlock pages here?
> + */
> +static void free_block(struct drm_buddy_block *block)
> +{
> +	struct xe_svm_block_meta *meta =
> +		(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +
> +	kfree(block->private);
> +	drm_buddy_free_block(mm, block);
> +}
> +
> +void xe_devm_page_free(struct page *page)
> +{
> +	struct drm_buddy_block *block =
> +					(struct drm_buddy_block *)page->zone_device_data;
> +	struct xe_svm_block_meta *meta =
> +					(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct xe_mem_region *mr = &tile->mem.vram;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	u64 size = drm_buddy_block_size(mm, block);
> +	u64 pages_per_block = size >> PAGE_SHIFT;
> +	u64 block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +	u64 page_pfn = page_to_pfn(page);
> +	u64 i = page_pfn - block_pfn_first;
> +
> +	xe_assert(tile->xe, i < pages_per_block);
> +	clear_bit(i, meta->bitmap);
> +	if (bitmap_empty(meta->bitmap, pages_per_block))
> +		free_block(block);
> +}
> +
> +/**
> + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> + *
> + * @xe_tile: which tile to allocate device memory from
> + * @npages: how many pages to allocate
> + * @blocks: used to return the allocated blocks
> + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> + * to hold at @npages entries.
> + *
> + * This function allocate blocks of memory from drm buddy allocator, and
> + * performs initialization work: set struct page::zone_device_data to point
> + * to the memory block; set/initialize drm_buddy_block::private field;
> + * lock_page for each page allocated; add memory block to lru managers lru
> + * list - this is TBD.
> + *
> + * return: 0 on success
> + * error code otherwise
> + */
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn)
> +{
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	struct drm_buddy_block *block, *tmp;
> +	u64 size = npages << PAGE_SHIFT;
> +	int ret = 0, i, j = 0;
> +
> +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> +						blocks, DRM_BUDDY_TOPDOWN_ALLOCATION);
> +
> +	if (unlikely(ret))
> +		return ret;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link) {
> +		struct xe_mem_region *mr = &tile->mem.vram;
> +		u64 block_pfn_first, pages_per_block;
> +		struct xe_svm_block_meta *meta;
> +		u32 meta_size;
> +
> +		size = drm_buddy_block_size(mm, block);
> +		pages_per_block = size >> PAGE_SHIFT;
> +		meta_size = BITS_TO_BYTES(pages_per_block) +
> +					sizeof(struct xe_svm_block_meta);
> +		meta = kzalloc(meta_size, GFP_KERNEL);
> +		bitmap_fill(meta->bitmap, pages_per_block);
> +		meta->tile = tile;
> +		block->private = meta;
> +		block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +		for(i = 0; i < pages_per_block; i++) {
> +			struct page *page;
> +
> +			pfn[j++] = block_pfn_first + i;
> +			page = pfn_to_page(block_pfn_first + i);
> +			/**Lock page per hmm requirement, see hmm.rst.*/
> +			zone_device_page_init(page);
> +			page->zone_device_data = block;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * xe_devm_free_blocks() - free all memory blocks
> + *
> + * @blocks: memory blocks list head
> + */
> +void xe_devm_free_blocks(struct list_head *blocks)
>  {
> +	struct drm_buddy_block *block, *tmp;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link)
> +		free_block(block);
>  }
>  
>  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 24/31] drm/xe/svm: Create and destroy xe svm
  2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
@ 2024-04-10 22:25   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 22:25 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:35PM -0400, Oak Zeng wrote:
> Introduce a data structure xe_svm to represent a shared virtual
> address space b/t CPU program and GPU program. Each process can
> only have maximumly one xe_svm instance. One xe_svm can have
> multiple gpu vm.
> 
> Introduce helper functions to create and destroy xe_svm instance.
> Once xe_svm instance is created, it is added to a global hash table
> keyed by mm_struct. Later on we can retrieve xe_svm using mm_struct.
> 

I don't think this needed at all, will explain a bit later in the series
but quite sure this can be droppped entirely.

Matt

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile |  1 +
>  drivers/gpu/drm/xe/xe_svm.c | 77 +++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_svm.h | 23 +++++++++++
>  3 files changed, 101 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index cd5213ba182b..f89d77b6d654 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -129,6 +129,7 @@ xe-y += xe_bb.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
>  	xe_step.o \
> +	xe_svm.o \
>  	xe_svm_devmem.o \
>  	xe_sync.o \
>  	xe_tile.o \
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> new file mode 100644
> index 000000000000..416cfc81c053
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright Â© 2023 Intel Corporation
> + */
> +
> +#include <linux/mutex.h>
> +#include <linux/mm_types.h>
> +#include <linux/kernel.h>
> +#include <linux/hashtable.h>
> +#include "xe_svm.h"
> +
> +#define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
> +DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
> +
> +/**
> + * xe_create_svm() - create a svm instance
> + *
> + * one xe_svm struct represent a shared address space
> + * between cpu and gpu program. So one xe_svm is associated
> + * to one mm_struct.
> + *
> + * If xe_svm for this process already exists, just return
> + * it; otherwise create one.
> + *
> + * Return the created xe svm struct pointer
> + */
> +struct xe_svm *xe_create_svm(void)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct xe_svm *svm;
> +
> +	svm = xe_lookup_svm_by_mm(mm);
> +	if (svm)
> +		return svm;
> +
> +	svm = kzalloc(sizeof(struct xe_svm), GFP_KERNEL);
> +	svm->mm = mm;
> +	mutex_init(&svm->mutex);
> +	INIT_LIST_HEAD(&svm->vm_list);
> +	/** Add svm to global xe_svm_table hash table
> +	 *  use mm as key so later we can retrieve svm using mm
> +	 */
> +	hash_add_rcu(xe_svm_table, &svm->hnode, (uintptr_t)mm);
> +	return svm;
> +}
> +
> +/**
> + * xe_destroy_svm() - destroy a svm process
> + *
> + * @svm: the xe_svm to destroy
> + */
> +void xe_destroy_svm(struct xe_svm *svm)
> +{
> +	BUG_ON(list_empty(&svm->vm_list));
> +	hash_del_rcu(&svm->hnode);
> +	mutex_destroy(&svm->mutex);
> +	kfree(svm);
> +}
> +
> +
> +/**
> + * xe_lookup_svm_by_mm() - retrieve xe_svm from mm struct
> + *
> + * @mm: the mm struct of the svm to retrieve
> + *
> + * Return the xe_svm struct pointer, or NULL if fail
> + */
> +struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
> +{
> +	struct xe_svm *svm;
> +
> +	hash_for_each_possible_rcu(xe_svm_table, svm, hnode, (uintptr_t)mm)
> +		if (svm->mm == mm)
> +			return svm;
> +
> +	return NULL;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 92a3ee90d5a7..066740fb93f5 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -11,6 +11,29 @@
>  #include "xe_device.h"
>  #include "xe_assert.h"
>  
> +
> +/**
> + * struct xe_svm - data structure to represent a shared
> + * virtual address space from device side. xe_svm and
> + * mm_struct has a 1:1 relationship.
> + */
> +struct xe_svm {
> +	/** @mm: The mm_struct corresponding to this xe_svm */
> +	struct mm_struct *mm;
> +	/**
> +	 * @mutex: A lock protects below vm_list
> +	 */
> +	struct mutex mutex;
> +	/** @hnode: used to add this svm to a global xe_svm_hash table*/
> +	struct hlist_node hnode;
> +	/** @vm_list: a list gpu vm in this svm space */
> +	struct list_head vm_list;
> +};
> +
> +extern struct xe_svm *xe_create_svm(void);
> +void xe_destroy_svm(struct xe_svm *svm);
> +extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> +
>  /**
>   * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
>   *
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 26/31] drm/xe: Make function lookup_vma public
  2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
@ 2024-04-10 22:26   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-10 22:26 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:37PM -0400, Oak Zeng wrote:
> Public this function as it will be used by later patches. Also
> rename it to xe_vm_lookup_vma
> 

Like the previous patch, pretty sure this can be dropped too. Again will
fully explain later.

Matt

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_pagefault.c | 10 ++++++++--
>  drivers/gpu/drm/xe/xe_vm.h           |  1 +
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> index 707a3466f36b..668984f0769e 100644
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> @@ -80,7 +80,13 @@ static bool vma_matches(struct xe_vma *vma, u64 page_addr)
>  	return true;
>  }
>  
> -static struct xe_vma *lookup_vma(struct xe_vm *vm, u64 page_addr)
> +/**
> + * xe_vm_lookup_vma() - look up a vma from address
> + *
> + * @vm: the xe_vm that the vma resides in
> + * @page_address: address to look up
> + */
> +struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr)
>  {
>  	struct xe_vma *vma = NULL;
>  
> @@ -166,7 +172,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>  		ret = -ENOENT;
>  		goto unlock_vm;
>  	}
> -	vma = lookup_vma(vm, pf->page_addr);
> +	vma = xe_vm_lookup_vma(vm, pf->page_addr);
>  	if (!vma) {
>  		ret = -EINVAL;
>  		goto unlock_vm;
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 4860747592ad..d55330988e32 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -306,3 +306,4 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
>  void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
>  void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
>  void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> +struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
@ 2024-04-11  2:07   ` Matthew Brost
  2024-04-12 17:24     ` Zeng, Oak
  2024-06-07  4:30     ` Zeng, Oak
  0 siblings, 2 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-11  2:07 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> Under the picture of svm, CPU and GPU program share one same
> virtual address space. The backing store of this virtual address
> space can be either in system memory or device memory. Since GPU
> device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> Any CPU access to device memory causes a page fault. Implement
> a page fault handler to migrate memory back to system memory and
> map it to CPU page table so the CPU program can proceed.
> 
> Also unbind this page from GPU side, and free the original GPU
> device page
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile         |   1 +
>  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
>  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
>  drivers/gpu/drm/xe/xe_svm_migrate.c | 222 ++++++++++++++++++++++++++++
>  4 files changed, 230 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index f89d77b6d654..65289acdd563 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
>  	xe_step.o \
>  	xe_svm.o \
>  	xe_svm_devmem.o \
> +	xe_svm_migrate.o \

See comments about file org, same thing applies here. Let's put all of
the svm implementation in a single file.

>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index f601dffe3fc1..c9e4239c44b4 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -7,11 +7,11 @@
>  #define __XE_SVM_H
>  
>  #include <linux/mm_types.h>
> +#include <linux/mm.h>
>  #include "xe_device_types.h"
>  #include "xe_device.h"
>  #include "xe_assert.h"
> -
> -struct xe_vm;
> +#include "xe_vm_types.h"
>  
>  /**
>   * struct xe_svm - data structure to represent a shared
> @@ -31,6 +31,9 @@ struct xe_svm {
>  	struct list_head vm_list;
>  };
>  
> +#define xe_svm_for_each_vm(svm, vm)					\
> +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> +

Don't think this is need, see below.

>  extern struct xe_svm *xe_create_svm(void);
>  void xe_destroy_svm(struct xe_svm *svm);
>  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
>  
>  void xe_devm_free_blocks(struct list_head *blocks);
>  void xe_devm_page_free(struct page *page);
> +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> index 088ac209ad80..32ada458f1dd 100644
> --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
>  	unsigned long bitmap[];
>  };
>  
> -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> -{
> -	return 0;
> -}
> -
>  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
>  {
>  	/** DRM buddy's block offset is 0-based*/
> @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
>  
>  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
>  	.page_free = xe_devm_page_free,
> -	.migrate_to_ram = xe_devm_migrate_to_ram,
> +	.migrate_to_ram = xe_svm_migrate_to_sram,

Again single file so this will be static function, no reason to export
this.

>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
> new file mode 100644
> index 000000000000..0db831af098e
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> @@ -0,0 +1,222 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/gfp.h>
> +#include <linux/migrate.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dma-fence.h>
> +#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <drm/drm_buddy.h>
> +#include "xe_device_types.h"
> +#include "xe_device.h"
> +#include "xe_trace.h"
> +#include "xe_migrate.h"
> +#include "xe_ttm_vram_mgr_types.h"
> +#include "xe_assert.h"
> +#include "xe_pt.h"
> +#include "xe_svm.h"
> +#include "xe_vm.h"
> +
> +
> +/**
> + * alloc_host_page() - allocate one host page for the fault vma
> + *
> + * @dev: (GPU) device that will access the allocated page
> + * @vma: the fault vma that we need allocate page for
> + * @addr: the fault address. The allocated page is for this address
> + * @dma_addr: used to output the dma address of the allocated page.
> + * This dma address will be used for gpu to access this page. GPU
> + * access host page through a dma mapped address.
> + * @pfn: used to output the pfn of the allocated page.
> + *
> + * This function allocate one host page for the specified vma. It
> + * also does some prepare work for GPU to access this page, such
> + * as map this page to iommu (by calling dma_map_page).
> + *
> + * When this function returns, the page is locked.
> + *
> + * Return struct page pointer when success
> + * NULL otherwise
> + */
> +static struct page *alloc_host_page(struct device *dev,
> +							 struct vm_area_struct *vma,
> +							 unsigned long addr,
> +							 dma_addr_t *dma_addr,
> +							 unsigned long *pfn)

Weird alignment, also I don't think we are want to allocate a page at
time...

Beyond that, can't say I'm a fan of 2 arguments being return and
populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
haven't seen a lot that style function in Linux.

Probably makes more sense to have a function which allocates pages,
locks them, and populates the pfn array (migrate_pfn) rather than doing
this a page at a time.

> +{
> +	struct page *page;
> +
> +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> +	if (unlikely(!page))
> +		return NULL;
> +
> +	/**Lock page per hmm requirement, see hmm.rst*/
> +	lock_page(page);
> +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);

The device is writing to these pages so I think DMA_BIDIRECTIONAL is
needed, right? As mentioned above I think this should be broken out into
a different step too.

> +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> +		unlock_page(page);
> +		__free_page(page);
> +		return NULL;
> +	}
> +
> +	*pfn = migrate_pfn(page_to_pfn(page));
> +	return page;
> +}
> +
> +static void free_host_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> + *
> + * @vma: The vma that the page is mapped to
> + * @addr: The virtual address that the page is mapped to
> + * @src_pfn: src page's page frame number
> + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> + *
> + * Allocate one page in system ram and copy memory from device memory
> + * to system ram.
> + *
> + * Return: 0 if this page is already in sram (no need to migrate)
> + * 1: successfully migrated this page from vram to sram.
> + * error code otherwise
> + */
> +static int migrate_page_vram_to_ram(struct vm_area_struct *vma, unsigned long addr,
> +						unsigned long src_pfn, unsigned long *dst_pfn)
> +{

We definitely don't want to copy 1 page at time. I touch on this in [1].
Basically this going to perform poorly unless we use larger copies, the
migrate code supports non-contigous sram addresses, and vram addresses
will likely be contigous due to the buddy allocator.
 
[1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1

> +	struct xe_mem_region *mr;
> +	struct xe_tile *tile;
> +	struct xe_device *xe;
> +	struct device *dev;
> +	dma_addr_t dma_addr = 0;
> +	struct dma_fence *fence;
> +	struct page *host_page;
> +	struct page *src_page;
> +	u64 src_dpa;
> +
> +	src_page = migrate_pfn_to_page(src_pfn);
> +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))

I'm going to say this is a bug if !src_page ||
!is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE) and
we return -EFAULT (or another error code if that makes more sense). We
are migrating from the device where we know we have backing store from
the original fault.

> +		return 0;
> +
> +	mr = xe_page_to_mem_region(src_page);
> +	tile = xe_mem_region_to_tile(mr);
> +	xe = tile_to_xe(tile);
> +	dev = xe->drm.dev;
> +
> +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> +	if (!host_page)
> +		return -ENOMEM;
> +
> +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> +						dma_addr, false, PAGE_SIZE);
> +	if (IS_ERR(fence)) {
> +		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> +		free_host_page(host_page);
> +		return PTR_ERR(fence);
> +	}
> +
> +	dma_fence_wait(fence, false);

Even if we did want to migrate a page at a time, we only need to wait on
the last fence due to the ordered nature of exec queues.

> +	dma_fence_put(fence);
> +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);

With above, will likely unmap all dma pages in a single function once
the last fence is signaled.

> +	return 1;
> +}
> +
> +/**
> + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page fault
> + *
> + * @vmf: cpu vm fault structure, contains fault information such as vma etc.
> + *
> + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> + *
> + * This function migrate one gpu vma which contains the fault address to sram.
> + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e., create one
> + * gpu vma for one cpu vma initially and try not to split it). So this scheme end
> + * up migrate at the vma granularity. This might not be the best performant scheme
> + *
> + * This can be tunned with a migration granularity for  performance, for example,
> + * migration 2M for each CPU page fault, or let user specify how much to migrate.
> + * This is more complex due to vma splitting.
> + *
> + * This function should also update GPU page table, so the fault virtual address
> + * points to the same sram location from GPU side. This is TBD.
> + *
> + * Return:
> + * 0 on success
> + * VM_FAULT_SIGBUS: failed to migrate page to system memory, application
> + * will be signaled a SIGBUG
> + */
> +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> +{
> +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);

I don't think this is needed... More below.

> +	unsigned long addr = vma->vm_start;
> +	u64 npages = vma_pages(vma);
> +	struct xe_vma *xe_vma;
> +	vm_fault_t ret = 0;
> +	struct xe_vm *vm;
> +	void *buf;
> +	int i;
> +
> +	struct migrate_vma migrate_vma = {
> +		.vma		= vmf->vma,
> +		.start		= vma->vm_start,
> +		.end		= vma->vm_end,

So I know in my PoC I had the fault user pointer (xe_vma) == struct
vm_area_struct of the GPU fault. That is definitely wrong. We likely
want to allocate sub-range of vm_area_struct for the xe_vma, we can call
this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
finally 4k in sizes trying the largest first... The chunk sizes are
trivial as we likely can just have table with values, the key here is
the vm_area_struct vm_start / vm_end are not what we want to use here
rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the xe_vma
from the faulting page vmf->page->zone_device_data field unless you have
another use that field...

I also comment on my patch with my suggestion implement chunk sizes too.

> +		.pgmap_owner	= xe,

Again helper for this.

> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page = vmf->page,
> +	};
> +
> +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> +	migrate_vma.src = buf;
> +	migrate_vma.dst = buf + npages;
> +	if (migrate_vma_setup(&migrate_vma) < 0) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto free_buf;
> +	}
> +
> +	if (!migrate_vma.cpages)

This is an error, need to set a return value.

> +		goto free_buf;
> +

We probably should check migrate.cpages != npages too as I also think
this is an error.

> +	for (i = 0; i < npages; i++) {
> +		ret = migrate_page_vram_to_ram(vma, addr, migrate_vma.src[i],
> +							migrate_vma.dst + i);
> +		if (ret < 0) {
> +			ret = VM_FAULT_SIGBUS;
> +			break;
> +		}
> +
> +		/** Migration has been successful, free source page */
> +		if (ret == 1) {
> +			struct page *src_page = migrate_pfn_to_page(migrate_vma.src[i]);
> +
> +			xe_devm_page_free(src_page);
> +		}
> +
> +		addr += PAGE_SIZE;
> +	}

I touch on this above, this should be reworked to roughly:

- alloc pages and populate migrate_vma.dst
- dma map sram pages
- migrate a chunk of contigous vram addresses at a time
- wait on last dma fence from migrate
- unmap dma pages
- unlock and free all pages

> +
> +	xe_svm_for_each_vm(svm, vm) {
> +		xe_assert(xe, vm->mm == mm);
> +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> +		if (xe_vma)
> +			xe_vm_invalidate_vma(xe_vma);
> +	}

I've touched on why this isn't needed... I think one of these
migrate_vma_* functions will trigger all MMU notifiers registered for
the range. The notifier owns the invalidate then.

Beyond this, maybe I'm confused about a few things and how this fits all
together. Doesn't every user process have its own unique mm, fd, and vm
(e.g. own address space)? If a user want a shared address space then use
threads with a single mm, fd, and vm.

So even if we had to resolve the xe_vma's here and do an invalidate here
very confused what this is doing. This is this the case with multiple
devices and each VM points to a different device? Again so that case I
don't think a xe_svm structure would be needed, on GPU fault we should
be to detect from the faulting page zone_device_data and pgmap owner
if the fault already has a physical backing on another GPU and resolve
how to map it into GPU with a fault... Jason suggests this in the
following thread [2] and I think I agree with him.

[2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-240632fd3e35@amd.com/T/

> +	migrate_vma_pages(&migrate_vma);

This logic is going to change but ... 

On an error I think we only want to call migrate_vma_finalize to revert
pages back to the original state (i.e. migrate_vma_pages commits the
page changes which we don't want to do on an error).

> +	migrate_vma_finalize(&migrate_vma);
> +free_buf:
> +	kvfree(buf);
> +	return 0;

I don't think 0 should blindly be return here, if there is an error
return VM_FAULT_SIGBUS. We likely want a high level error message too.

Matt

> +}
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
@ 2024-04-11  2:49   ` Matthew Brost
  2024-04-12 21:21     ` Zeng, Oak
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-04-11  2:49 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> Introduce a helper function xe_svm_migrate_vma_to_vram.
> 
> Since the source pages of the svm range can be physically not
> contiguous, and the destination vram pages can also be not
> contiguous, there is no easy way to migrate multiple pages per
> blitter command. We do page by page migration for now.
> 
> Migration is best effort. Even if we fail to migrate some pages,
> we will try to migrate the rest pages.
> 
> FIXME: Use one blitter command to copy when both src and dst are
> physically contiguous
> 

Yep, touch in this throughout the series. Only vram needs to be
contiguous though as we dynamically create PT mappings for sram pages in
the migrate code. Getting this in a must and should be done immediately
IMO as this a very, very basic perform thing we know needs to be done.
We will likely have to tune this code quite a bit for performance so
getting known things done would be helpful.

> FIXME: when a vma is partially migrated, split vma as we assume
> no mixture vma placement.
> 

Agree we do not want support partial migrations. We likely want to
return -EAGAIN for something and fault back to a smaller xe_vma chunk
size which I discussed in [1] and add comment on in [2].

Migration should be best case too if we fail migrate we can always leave
backing store in sram too.

I do have question though, when do we get partial migrations? A user
having called mlock on some of the pages? I just want to make sure I
fully understand that case.

[1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
[2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h         |   2 +
>  drivers/gpu/drm/xe/xe_svm_migrate.c | 115 ++++++++++++++++++++++++++++

Same comment on file structure throughout the series apply here too.

>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index c9e4239c44b4..18ce2e3757c5 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
>  void xe_devm_free_blocks(struct list_head *blocks);
>  void xe_devm_page_free(struct page *page);
>  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> +							struct xe_tile *tile);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
> index 0db831af098e..ab8dd1f58aa4 100644
> --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
>  	kvfree(buf);
>  	return 0;
>  }
> +
> +/**
> + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to vram
> + * Must be called with mmap_read_lock held.
> + * @vm: the vm that the vma belongs to
> + * @vma: the vma to migrate.
> + * @tile: the destination tile which holds the new backing store of the range
> + *
> + * Returns: negative errno on faiure, 0 on success
> + */
> +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> +							struct xe_vma *vma,
> +							struct xe_tile *tile)
> +{
> +	struct mm_struct *mm = vm->mm;
> +	unsigned long start = xe_vma_start(vma);
> +	unsigned long end = xe_vma_end(vma);
> +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> +	struct xe_mem_region *mr = &tile->mem.vram;
> +	struct vm_area_struct *vas;
> +
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= tile->xe,

Again helper to assign owner.

> +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct device *dev = tile->xe->drm.dev;
> +	dma_addr_t *src_dma_addr;
> +	struct dma_fence *fence;
> +	struct page *src_page;
> +	LIST_HEAD(blocks);
> +	int ret = 0, i;
> +	u64 dst_dpa;
> +	void *buf;
> +
> +	mmap_assert_locked(mm);

This mmap_assert_locked is ambiguous, we should make it clear if this
read or write locked. Doesn't it have to be write to do the migrate
pages?

A larger question about the locking... The CPU fault handler holds the
mmap lock in write mode, right? 

I'm asking as basically I think at least initially we want to hold the
mmap lock in a way that the GPU handler and CPU handler do not race.
i.e. From fault userptr create in GPU fault handler to issuing the bind
we prevent the CPU fault handler from running.

I'm having issues figuring out how to prevent races between initial
binds of userptrs and userptr invalidates on faulting VMs. This race is
seen any xe_exec_fault_mode for example... So preventing races between
CPU / GPU fault handler with the mmap probably is a good idea initially.
Likely can make the locking finer grained once this is all working and I
figure out how to handle this race better.

> +
> +	vas = find_vma_intersection(mm, start, start + 4);

find_vma should work fine here.

> +	if (!vas)
> +		return -ENOENT;
> +
> +	migrate.vma = vas;
> +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> +					GFP_KERNEL);
> +	if(!buf)
> +		return -ENOMEM;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);

Again as I discussed in [3] I think this should be broken out into a
different step with the blocks allocated before this, and here just
populate migrate.dst from the existing blocks.

[3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1

> +	if (ret)
> +		goto kfree_buf;
> +
> +	ret = migrate_vma_setup(&migrate);
> +	if (ret) {
> +		drm_err(&tile->xe->drm, "vma setup returned %d for range [%lx - %lx]\n",
> +				ret, start, end);
> +		goto free_dst_pages;
> +	}
> +
> +	/**FIXME: partial migration of a range print a warning for now.
> +	 * If this message is printed, we need to split xe_vma as we
> +	 * don't support a mixture placement of one vma
> +	 */
> +	if (migrate.cpages != npages)
> +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx - %lx], range is %ld pages, migrate only %ld pages\n",
> +				start, end, npages, migrate.cpages);

As discussed above, we shouldn't support this. We should fall back to
smaller xe_vma chunk size until we find one that works or simply leave
the pages in sram and map those pages to GPU.

> +
> +	/**Migrate page by page for now.
> +	 * Both source pages and destination pages can physically not contiguous,
> +	 * there is no good way to migrate multiple pages per blitter command.
> +	 */

Touched on this a bunch throughout the series, lets do better than a
page a time migration.

Algorithm should be very similar to what I discussed here [4] but with a
few key differences.

- I think the sram pages can be unpopulated (page == NULL) if the user
  has not yet touched the page
- Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid

In these cases this indicate we have to issue a copy for the pages we
are accumulating with contigous vram addresses.

[4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1

> +	for (i = 0; i < npages; i++) {
> +		src_page = migrate_pfn_to_page(migrate.src[i]);
> +		if (unlikely(!src_page || !(migrate.src[i] & MIGRATE_PFN_MIGRATE)))

Discussed this in the CPU fault patch, once we call migrate_vma_setup,
on subsequent errors we need to call migrate_vma_finalize to revert the
pages to the original state. At least I think if I am reading the doc
after this correctly.

Here on error we just free the pages...

Matt

> +			goto free_dst_page;
> +
> +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> +		src_dma_addr[i] = dma_map_page(dev, src_page, 0, PAGE_SIZE, DMA_TO_DEVICE);
> +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> +			drm_warn(&tile->xe->drm, "dma map error for host pfn %lx\n", migrate.src[i]);
> +			goto free_dst_page;
> +		}
> +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> +				dst_dpa, true, PAGE_SIZE);
> +		if (IS_ERR(fence)) {
> +			drm_warn(&tile->xe->drm, "migrate host page (pfn: %lx) to vram failed\n",
> +					migrate.src[i]);
> +			/**Migration is best effort. Even we failed here, we continue*/
> +			goto free_dst_page;
> +		}
> +		/**FIXME: Use the first migration's out fence as the second migration's input fence,
> +		 * and so on. Only wait the out fence of last migration?
> +		 */
> +		dma_fence_wait(fence, false);
> +		dma_fence_put(fence);
> +free_dst_page:
> +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> +	}
> +
> +	for (i = 0; i < npages; i++)
> +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE, DMA_TO_DEVICE);
> +
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +free_dst_pages:
> +	if (ret)
> +		xe_devm_free_blocks(&blocks);
> +kfree_buf:
> +	kfree(buf);
> +	return ret;
> +}
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr
  2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
@ 2024-04-11  2:50   ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-11  2:50 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:41PM -0400, Oak Zeng wrote:
> xe_vma_is_fault_userptr is added to determine the vma is
> a fault userptr.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index d55330988e32..a718f927e362 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -166,6 +166,11 @@ static inline bool xe_vma_is_userptr(struct xe_vma *vma)
>  		!xe_vma_is_system_allocator(vma);
>  }
>  
> +static inline bool xe_vma_is_fault_userptr(struct xe_vma *vma)
> +{
> +	return xe_vma_is_userptr(vma) && (vma->gpuva.flags & XE_VMA_FAULT_USERPTR);

Presumably we never set XE_VMA_FAULT_USERPTR when xe_vma_is_userptr is
false probably safe to just check XE_VMA_FAULT_USERPTR here.

Matt

> +}
> +
>  /**
>   * to_userptr_vma() - Return a pointer to an embedding userptr vma
>   * @vma: Pointer to the embedded struct xe_vma
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator
  2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
@ 2024-04-11  2:55   ` Matthew Brost
  2024-06-07 17:22     ` Zeng, Oak
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-04-11  2:55 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:42PM -0400, Oak Zeng wrote:
> If applicable, migrate a vma from sram to vram for system allocator.
> Traditional userptr is not migrated. Only userptr created during
> fault (aka userptr splitted from system allocator vma) can be
> migrated.
> 
> FIXME: The migration should be conditional on user memory attributes
> setting. Add this logic when memory attributes are supported
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_pagefault.c | 9 ++++++++-
>  drivers/gpu/drm/xe/xe_vm.c           | 4 ----
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> index 668984f0769e..c6ba00049964 100644
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> @@ -20,6 +20,7 @@
>  #include "xe_guc_ct.h"
>  #include "xe_migrate.h"
>  #include "xe_trace.h"
> +#include "xe_svm.h"
>  #include "xe_vm.h"
>  
>  struct pagefault {
> @@ -209,12 +210,18 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>  
>  	if (xe_vma_is_userptr(vma) && write_locked) {
>  		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> +		struct xe_userptr *userptr = &uvma->userptr;
>  
>  		spin_lock(&vm->userptr.invalidated_lock);
> -		list_del_init(&uvma->userptr.invalidate_link);
> +		list_del_init(&userptr->invalidate_link);
>  		spin_unlock(&vm->userptr.invalidated_lock);
>  
> +		mmap_read_lock(userptr->notifier.mm);
> +		/**FIXME: Add migration policy here*/
> +		if (xe_vma_is_fault_userptr(vma))
> +			xe_svm_migrate_vma_to_vram(vm, vma, tile);

Agree we need a policy here...

See my comments about locking in [1] thinking if we migrate we likely
want to hold the mmap lock until at least the bind being issued to
prevent races with the CPU fault handler, at least initially. 

[1] https://patchwork.freedesktop.org/patch/588542/?series=132229&rev=1

>  		ret = xe_vma_userptr_pin_pages(uvma);
> +		mmap_read_unlock(userptr->notifier.mm);
>  		if (ret)
>  			goto unlock_vm;
>  
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 498b36469d00..8a58fe144a02 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -71,16 +71,12 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
>  	struct xe_vma *vma = &uvma->vma;
>  	struct xe_vm *vm = xe_vma_vm(vma);
>  	struct xe_device *xe = vm->xe;
> -	struct xe_userptr *userptr;
>  	int ret;
>  
>  	lockdep_assert_held(&vm->lock);
>  	xe_assert(xe, xe_vma_is_userptr(vma));
>  
> -	userptr = &uvma->userptr;
> -	mmap_read_lock(userptr->notifier.mm);
>  	ret = xe_userptr_populate_range(uvma);
> -	mmap_read_unlock(userptr->notifier.mm);

Now you won't have the lock here other callers of this function...
Probably need to have locked / unlocked version or arguments here.

Matt

>  
>  	return ret;
>  }
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-11  2:07   ` Matthew Brost
@ 2024-04-12 17:24     ` Zeng, Oak
  2024-04-12 18:10       ` Matthew Brost
  2024-06-07  4:30     ` Zeng, Oak
  1 sibling, 1 reply; 72+ messages in thread
From: Zeng, Oak @ 2024-04-12 17:24 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 10:07 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > Under the picture of svm, CPU and GPU program share one same
> > virtual address space. The backing store of this virtual address
> > space can be either in system memory or device memory. Since GPU
> > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > Any CPU access to device memory causes a page fault. Implement
> > a page fault handler to migrate memory back to system memory and
> > map it to CPU page table so the CPU program can proceed.
> >
> > Also unbind this page from GPU side, and free the original GPU
> > device page
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile         |   1 +
> >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> ++++++++++++++++++++++++++++
> >  4 files changed, 230 insertions(+), 8 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index f89d77b6d654..65289acdd563 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> >  	xe_step.o \
> >  	xe_svm.o \
> >  	xe_svm_devmem.o \
> > +	xe_svm_migrate.o \
> 
> See comments about file org, same thing applies here. Let's put all of
> the svm implementation in a single file.
> 
> >  	xe_sync.o \
> >  	xe_tile.o \
> >  	xe_tile_sysfs.o \
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index f601dffe3fc1..c9e4239c44b4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -7,11 +7,11 @@
> >  #define __XE_SVM_H
> >
> >  #include <linux/mm_types.h>
> > +#include <linux/mm.h>
> >  #include "xe_device_types.h"
> >  #include "xe_device.h"
> >  #include "xe_assert.h"
> > -
> > -struct xe_vm;
> > +#include "xe_vm_types.h"
> >
> >  /**
> >   * struct xe_svm - data structure to represent a shared
> > @@ -31,6 +31,9 @@ struct xe_svm {
> >  	struct list_head vm_list;
> >  };
> >
> > +#define xe_svm_for_each_vm(svm, vm)
> 	\
> > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > +
> 
> Don't think this is need, see below.
> 
> >  extern struct xe_svm *xe_create_svm(void);
> >  void xe_destroy_svm(struct xe_svm *svm);
> >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >
> >  void xe_devm_free_blocks(struct list_head *blocks);
> >  void xe_devm_page_free(struct page *page);
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > index 088ac209ad80..32ada458f1dd 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> >  	unsigned long bitmap[];
> >  };
> >
> > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > -{
> > -	return 0;
> > -}
> > -
> >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> >  {
> >  	/** DRM buddy's block offset is 0-based*/
> > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
> >
> >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> >  	.page_free = xe_devm_page_free,
> > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> 
> Again single file so this will be static function, no reason to export
> this.
> 
> >  };
> >
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > new file mode 100644
> > index 000000000000..0db831af098e
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -0,0 +1,222 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> > + */
> > +
> > +#include <linux/gfp.h>
> > +#include <linux/migrate.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/bitops.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/kernel.h>
> > +#include <linux/slab.h>
> > +#include <drm/drm_buddy.h>
> > +#include "xe_device_types.h"
> > +#include "xe_device.h"
> > +#include "xe_trace.h"
> > +#include "xe_migrate.h"
> > +#include "xe_ttm_vram_mgr_types.h"
> > +#include "xe_assert.h"
> > +#include "xe_pt.h"
> > +#include "xe_svm.h"
> > +#include "xe_vm.h"
> > +
> > +
> > +/**
> > + * alloc_host_page() - allocate one host page for the fault vma
> > + *
> > + * @dev: (GPU) device that will access the allocated page
> > + * @vma: the fault vma that we need allocate page for
> > + * @addr: the fault address. The allocated page is for this address
> > + * @dma_addr: used to output the dma address of the allocated page.
> > + * This dma address will be used for gpu to access this page. GPU
> > + * access host page through a dma mapped address.
> > + * @pfn: used to output the pfn of the allocated page.
> > + *
> > + * This function allocate one host page for the specified vma. It
> > + * also does some prepare work for GPU to access this page, such
> > + * as map this page to iommu (by calling dma_map_page).
> > + *
> > + * When this function returns, the page is locked.
> > + *
> > + * Return struct page pointer when success
> > + * NULL otherwise
> > + */
> > +static struct page *alloc_host_page(struct device *dev,
> > +							 struct vm_area_struct
> *vma,
> > +							 unsigned long addr,
> > +							 dma_addr_t
> *dma_addr,
> > +							 unsigned long *pfn)
> 
> Weird alignment, also I don't think we are want to allocate a page at
> time...
> 
> Beyond that, can't say I'm a fan of 2 arguments being return and
> populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> haven't seen a lot that style function in Linux.
> 
> Probably makes more sense to have a function which allocates pages,
> locks them, and populates the pfn array (migrate_pfn) rather than doing
> this a page at a time.
> 
> > +{
> > +	struct page *page;
> > +
> > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > +	if (unlikely(!page))
> > +		return NULL;
> > +
> > +	/**Lock page per hmm requirement, see hmm.rst*/
> > +	lock_page(page);
> > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> DMA_FROM_DEVICE);
> 
> The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> needed, right? As mentioned above I think this should be broken out into
> a different step too.
> 
> > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > +		unlock_page(page);
> > +		__free_page(page);
> > +		return NULL;
> > +	}
> > +
> > +	*pfn = migrate_pfn(page_to_pfn(page));
> > +	return page;
> > +}
> > +
> > +static void free_host_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > + *
> > + * @vma: The vma that the page is mapped to
> > + * @addr: The virtual address that the page is mapped to
> > + * @src_pfn: src page's page frame number
> > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > + *
> > + * Allocate one page in system ram and copy memory from device memory
> > + * to system ram.
> > + *
> > + * Return: 0 if this page is already in sram (no need to migrate)
> > + * 1: successfully migrated this page from vram to sram.
> > + * error code otherwise
> > + */
> > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> unsigned long addr,
> > +						unsigned long src_pfn,
> unsigned long *dst_pfn)
> > +{
> 
> We definitely don't want to copy 1 page at time. I touch on this in [1].
> Basically this going to perform poorly unless we use larger copies, the
> migrate code supports non-contigous sram addresses, and vram addresses
> will likely be contigous due to the buddy allocator.
> 
> [1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> 
> > +	struct xe_mem_region *mr;
> > +	struct xe_tile *tile;
> > +	struct xe_device *xe;
> > +	struct device *dev;
> > +	dma_addr_t dma_addr = 0;
> > +	struct dma_fence *fence;
> > +	struct page *host_page;
> > +	struct page *src_page;
> > +	u64 src_dpa;
> > +
> > +	src_page = migrate_pfn_to_page(src_pfn);
> > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> 
> I'm going to say this is a bug if !src_page ||
> !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE) and
> we return -EFAULT (or another error code if that makes more sense). We
> are migrating from the device where we know we have backing store from
> the original fault.
> 
> > +		return 0;
> > +
> > +	mr = xe_page_to_mem_region(src_page);
> > +	tile = xe_mem_region_to_tile(mr);
> > +	xe = tile_to_xe(tile);
> > +	dev = xe->drm.dev;
> > +
> > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > +	if (!host_page)
> > +		return -ENOMEM;
> > +
> > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > +						dma_addr, false, PAGE_SIZE);
> > +	if (IS_ERR(fence)) {
> > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> DMA_FROM_DEVICE);
> > +		free_host_page(host_page);
> > +		return PTR_ERR(fence);
> > +	}
> > +
> > +	dma_fence_wait(fence, false);
> 
> Even if we did want to migrate a page at a time, we only need to wait on
> the last fence due to the ordered nature of exec queues.
> 
> > +	dma_fence_put(fence);
> > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> 
> With above, will likely unmap all dma pages in a single function once
> the last fence is signaled.
> 
> > +	return 1;
> > +}
> > +
> > +/**
> > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page
> fault
> > + *
> > + * @vmf: cpu vm fault structure, contains fault information such as vma etc.
> > + *
> > + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> > + *
> > + * This function migrate one gpu vma which contains the fault address to
> sram.
> > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> create one
> > + * gpu vma for one cpu vma initially and try not to split it). So this scheme
> end
> > + * up migrate at the vma granularity. This might not be the best performant
> scheme
> > + *
> > + * This can be tunned with a migration granularity for  performance, for
> example,
> > + * migration 2M for each CPU page fault, or let user specify how much to
> migrate.
> > + * This is more complex due to vma splitting.
> > + *
> > + * This function should also update GPU page table, so the fault virtual
> address
> > + * points to the same sram location from GPU side. This is TBD.
> > + *
> > + * Return:
> > + * 0 on success
> > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> application
> > + * will be signaled a SIGBUG
> > + */
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > +{
> > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > +	struct xe_device *xe = tile_to_xe(tile);
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> 
> I don't think this is needed... More below.
> 
> > +	unsigned long addr = vma->vm_start;
> > +	u64 npages = vma_pages(vma);
> > +	struct xe_vma *xe_vma;
> > +	vm_fault_t ret = 0;
> > +	struct xe_vm *vm;
> > +	void *buf;
> > +	int i;
> > +
> > +	struct migrate_vma migrate_vma = {
> > +		.vma		= vmf->vma,
> > +		.start		= vma->vm_start,
> > +		.end		= vma->vm_end,
> 
> So I know in my PoC I had the fault user pointer (xe_vma) == struct
> vm_area_struct of the GPU fault. That is definitely wrong. We likely
> want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> finally 4k in sizes trying the largest first... The chunk sizes are
> trivial as we likely can just have table with values, the key here is
> the vm_area_struct vm_start / vm_end are not what we want to use here
> rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the xe_vma
> from the faulting page vmf->page->zone_device_data field unless you have
> another use that field...

You are right. Such work is planned in the memory attributes part that Himal is working on. We have a migration_granularity attribute which allow user to set the chunk size.

> 
> I also comment on my patch with my suggestion implement chunk sizes too.
> 
> > +		.pgmap_owner	= xe,
> 
> Again helper for this.
> 
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page = vmf->page,
> > +	};
> > +
> > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > +	migrate_vma.src = buf;
> > +	migrate_vma.dst = buf + npages;
> > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > +		ret = VM_FAULT_SIGBUS;
> > +		goto free_buf;
> > +	}
> > +
> > +	if (!migrate_vma.cpages)
> 
> This is an error, need to set a return value.
> 
> > +		goto free_buf;
> > +
> 
> We probably should check migrate.cpages != npages too as I also think
> this is an error.
> 
> > +	for (i = 0; i < npages; i++) {
> > +		ret = migrate_page_vram_to_ram(vma, addr,
> migrate_vma.src[i],
> > +							migrate_vma.dst + i);
> > +		if (ret < 0) {
> > +			ret = VM_FAULT_SIGBUS;
> > +			break;
> > +		}
> > +
> > +		/** Migration has been successful, free source page */
> > +		if (ret == 1) {
> > +			struct page *src_page =
> migrate_pfn_to_page(migrate_vma.src[i]);
> > +
> > +			xe_devm_page_free(src_page);
> > +		}
> > +
> > +		addr += PAGE_SIZE;
> > +	}
> 
> I touch on this above, this should be reworked to roughly:
> 
> - alloc pages and populate migrate_vma.dst
> - dma map sram pages
> - migrate a chunk of contigous vram addresses at a time
> - wait on last dma fence from migrate
> - unmap dma pages
> - unlock and free all pages
> 
> > +
> > +	xe_svm_for_each_vm(svm, vm) {
> > +		xe_assert(xe, vm->mm == mm);
> > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > +		if (xe_vma)
> > +			xe_vm_invalidate_vma(xe_vma);
> > +	}
> 
> I've touched on why this isn't needed... I think one of these
> migrate_vma_* functions will trigger all MMU notifiers registered for
> the range. The notifier owns the invalidate then.

Very good point. Yes after read migrate_vma_setup function, I agree this function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today we invalidate vma with this event. So yes, above codes are not needed.

I do identified one potential improvement: when mmu notifier is called back with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is called from the gpu page fault path, we can ignore the gpu vma invalidation as we will update gpu page table later after migration anyway. I think an page table invalidation is not needed in such case. But this should be just a minor improvement.


> 
> Beyond this, maybe I'm confused about a few things and how this fits all
> together. Doesn't every user process have its own unique mm, fd, and vm
> (e.g. own address space)? If a user want a shared address space then use
> threads with a single mm, fd, and vm.

Yes, this is also my understanding. Each user process has its own mm struct and device fds. 

In a shared address space case, such as there are multiple pthread created through pthread_create in one process, all those pthreads should have different kernel task_struct, but all those task_struct (say we get it from "current" macro) should share one same mm struct, which means they all lives inside one cpu address space.

Now with this work, we are now basically extending this shared cpu address space to gpu program. So both cpu program and gpu program can share one address space.

Since we allow user to create multiple gpu vm for one device (lets focus on one device for now), so each shared address space can have multiple gpu vm... each gpuvm should be able to register its own mmu notifier to core mm, even if those notifier has the same address range. But I will have to test this out. If all this works, above codes are not needed. If different gpuvm can't register mmu notifier for same address range, then we would need a fix....


> 
> So even if we had to resolve the xe_vma's here and do an invalidate here
> very confused what this is doing. This is this the case with multiple
> devices and each VM points to a different device? 

Right now I only focus on single device. See above. This is to solve one gpu device but multiple gpu vm case. But as said above, for now I don't think this is needed. I need to test more on the mmu notifier behavior: whether it allow us to insert two notifiers for the same range for one mm....

Oak

Again so that case I
> don't think a xe_svm structure would be needed, on GPU fault we should
> be to detect from the faulting page zone_device_data and pgmap owner
> if the fault already has a physical backing on another GPU and resolve
> how to map it into GPU with a fault... Jason suggests this in the
> following thread [2] and I think I agree with him.
> 
> [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> 240632fd3e35@amd.com/T/
> 
> > +	migrate_vma_pages(&migrate_vma);
> 
> This logic is going to change but ...
> 
> On an error I think we only want to call migrate_vma_finalize to revert
> pages back to the original state (i.e. migrate_vma_pages commits the
> page changes which we don't want to do on an error).
> 
> > +	migrate_vma_finalize(&migrate_vma);
> > +free_buf:
> > +	kvfree(buf);
> > +	return 0;
> 
> I don't think 0 should blindly be return here, if there is an error
> return VM_FAULT_SIGBUS. We likely want a high level error message too.
> 
> Matt
> 
> > +}
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-12 17:24     ` Zeng, Oak
@ 2024-04-12 18:10       ` Matthew Brost
  2024-04-12 18:39         ` Zeng, Oak
  2024-06-07  4:44         ` Zeng, Oak
  0 siblings, 2 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-12 18:10 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

On Fri, Apr 12, 2024 at 11:24:06AM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 10:07 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> > 
> > On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > > Under the picture of svm, CPU and GPU program share one same
> > > virtual address space. The backing store of this virtual address
> > > space can be either in system memory or device memory. Since GPU
> > > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > > Any CPU access to device memory causes a page fault. Implement
> > > a page fault handler to migrate memory back to system memory and
> > > map it to CPU page table so the CPU program can proceed.
> > >
> > > Also unbind this page from GPU side, and free the original GPU
> > > device page
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile         |   1 +
> > >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> > >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> > ++++++++++++++++++++++++++++
> > >  4 files changed, 230 insertions(+), 8 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> > >
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index f89d77b6d654..65289acdd563 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> > >  	xe_step.o \
> > >  	xe_svm.o \
> > >  	xe_svm_devmem.o \
> > > +	xe_svm_migrate.o \
> > 
> > See comments about file org, same thing applies here. Let's put all of
> > the svm implementation in a single file.
> > 
> > >  	xe_sync.o \
> > >  	xe_tile.o \
> > >  	xe_tile_sysfs.o \
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index f601dffe3fc1..c9e4239c44b4 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -7,11 +7,11 @@
> > >  #define __XE_SVM_H
> > >
> > >  #include <linux/mm_types.h>
> > > +#include <linux/mm.h>
> > >  #include "xe_device_types.h"
> > >  #include "xe_device.h"
> > >  #include "xe_assert.h"
> > > -
> > > -struct xe_vm;
> > > +#include "xe_vm_types.h"
> > >
> > >  /**
> > >   * struct xe_svm - data structure to represent a shared
> > > @@ -31,6 +31,9 @@ struct xe_svm {
> > >  	struct list_head vm_list;
> > >  };
> > >
> > > +#define xe_svm_for_each_vm(svm, vm)
> > 	\
> > > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > > +
> > 
> > Don't think this is need, see below.
> > 
> > >  extern struct xe_svm *xe_create_svm(void);
> > >  void xe_destroy_svm(struct xe_svm *svm);
> > >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > >
> > >  void xe_devm_free_blocks(struct list_head *blocks);
> > >  void xe_devm_page_free(struct page *page);
> > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > index 088ac209ad80..32ada458f1dd 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> > >  	unsigned long bitmap[];
> > >  };
> > >
> > > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > -{
> > > -	return 0;
> > > -}
> > > -
> > >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > >  {
> > >  	/** DRM buddy's block offset is 0-based*/
> > > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
> > >
> > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > >  	.page_free = xe_devm_page_free,
> > > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> > 
> > Again single file so this will be static function, no reason to export
> > this.
> > 
> > >  };
> > >
> > >  /**
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > new file mode 100644
> > > index 000000000000..0db831af098e
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > @@ -0,0 +1,222 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/gfp.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/dma-fence.h>
> > > +#include <linux/bitops.h>
> > > +#include <linux/bitmap.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/slab.h>
> > > +#include <drm/drm_buddy.h>
> > > +#include "xe_device_types.h"
> > > +#include "xe_device.h"
> > > +#include "xe_trace.h"
> > > +#include "xe_migrate.h"
> > > +#include "xe_ttm_vram_mgr_types.h"
> > > +#include "xe_assert.h"
> > > +#include "xe_pt.h"
> > > +#include "xe_svm.h"
> > > +#include "xe_vm.h"
> > > +
> > > +
> > > +/**
> > > + * alloc_host_page() - allocate one host page for the fault vma
> > > + *
> > > + * @dev: (GPU) device that will access the allocated page
> > > + * @vma: the fault vma that we need allocate page for
> > > + * @addr: the fault address. The allocated page is for this address
> > > + * @dma_addr: used to output the dma address of the allocated page.
> > > + * This dma address will be used for gpu to access this page. GPU
> > > + * access host page through a dma mapped address.
> > > + * @pfn: used to output the pfn of the allocated page.
> > > + *
> > > + * This function allocate one host page for the specified vma. It
> > > + * also does some prepare work for GPU to access this page, such
> > > + * as map this page to iommu (by calling dma_map_page).
> > > + *
> > > + * When this function returns, the page is locked.
> > > + *
> > > + * Return struct page pointer when success
> > > + * NULL otherwise
> > > + */
> > > +static struct page *alloc_host_page(struct device *dev,
> > > +							 struct vm_area_struct
> > *vma,
> > > +							 unsigned long addr,
> > > +							 dma_addr_t
> > *dma_addr,
> > > +							 unsigned long *pfn)
> > 
> > Weird alignment, also I don't think we are want to allocate a page at
> > time...
> > 
> > Beyond that, can't say I'm a fan of 2 arguments being return and
> > populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> > haven't seen a lot that style function in Linux.
> > 
> > Probably makes more sense to have a function which allocates pages,
> > locks them, and populates the pfn array (migrate_pfn) rather than doing
> > this a page at a time.
> > 
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > > +	if (unlikely(!page))
> > > +		return NULL;
> > > +
> > > +	/**Lock page per hmm requirement, see hmm.rst*/
> > > +	lock_page(page);
> > > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > DMA_FROM_DEVICE);
> > 
> > The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> > needed, right? As mentioned above I think this should be broken out into
> > a different step too.
> > 
> > > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > > +		unlock_page(page);
> > > +		__free_page(page);
> > > +		return NULL;
> > > +	}
> > > +
> > > +	*pfn = migrate_pfn(page_to_pfn(page));
> > > +	return page;
> > > +}
> > > +
> > > +static void free_host_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > > + *
> > > + * @vma: The vma that the page is mapped to
> > > + * @addr: The virtual address that the page is mapped to
> > > + * @src_pfn: src page's page frame number
> > > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > > + *
> > > + * Allocate one page in system ram and copy memory from device memory
> > > + * to system ram.
> > > + *
> > > + * Return: 0 if this page is already in sram (no need to migrate)
> > > + * 1: successfully migrated this page from vram to sram.
> > > + * error code otherwise
> > > + */
> > > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> > unsigned long addr,
> > > +						unsigned long src_pfn,
> > unsigned long *dst_pfn)
> > > +{
> > 
> > We definitely don't want to copy 1 page at time. I touch on this in [1].
> > Basically this going to perform poorly unless we use larger copies, the
> > migrate code supports non-contigous sram addresses, and vram addresses
> > will likely be contigous due to the buddy allocator.
> > 
> > [1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> > 
> > > +	struct xe_mem_region *mr;
> > > +	struct xe_tile *tile;
> > > +	struct xe_device *xe;
> > > +	struct device *dev;
> > > +	dma_addr_t dma_addr = 0;
> > > +	struct dma_fence *fence;
> > > +	struct page *host_page;
> > > +	struct page *src_page;
> > > +	u64 src_dpa;
> > > +
> > > +	src_page = migrate_pfn_to_page(src_pfn);
> > > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> > 
> > I'm going to say this is a bug if !src_page ||
> > !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE) and
> > we return -EFAULT (or another error code if that makes more sense). We
> > are migrating from the device where we know we have backing store from
> > the original fault.
> > 
> > > +		return 0;
> > > +
> > > +	mr = xe_page_to_mem_region(src_page);
> > > +	tile = xe_mem_region_to_tile(mr);
> > > +	xe = tile_to_xe(tile);
> > > +	dev = xe->drm.dev;
> > > +
> > > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > > +	if (!host_page)
> > > +		return -ENOMEM;
> > > +
> > > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > > +						dma_addr, false, PAGE_SIZE);
> > > +	if (IS_ERR(fence)) {
> > > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> > DMA_FROM_DEVICE);
> > > +		free_host_page(host_page);
> > > +		return PTR_ERR(fence);
> > > +	}
> > > +
> > > +	dma_fence_wait(fence, false);
> > 
> > Even if we did want to migrate a page at a time, we only need to wait on
> > the last fence due to the ordered nature of exec queues.
> > 
> > > +	dma_fence_put(fence);
> > > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> > 
> > With above, will likely unmap all dma pages in a single function once
> > the last fence is signaled.
> > 
> > > +	return 1;
> > > +}
> > > +
> > > +/**
> > > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page
> > fault
> > > + *
> > > + * @vmf: cpu vm fault structure, contains fault information such as vma etc.
> > > + *
> > > + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> > > + *
> > > + * This function migrate one gpu vma which contains the fault address to
> > sram.
> > > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> > create one
> > > + * gpu vma for one cpu vma initially and try not to split it). So this scheme
> > end
> > > + * up migrate at the vma granularity. This might not be the best performant
> > scheme
> > > + *
> > > + * This can be tunned with a migration granularity for  performance, for
> > example,
> > > + * migration 2M for each CPU page fault, or let user specify how much to
> > migrate.
> > > + * This is more complex due to vma splitting.
> > > + *
> > > + * This function should also update GPU page table, so the fault virtual
> > address
> > > + * points to the same sram location from GPU side. This is TBD.
> > > + *
> > > + * Return:
> > > + * 0 on success
> > > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> > application
> > > + * will be signaled a SIGBUG
> > > + */
> > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > > +{
> > > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +	struct vm_area_struct *vma = vmf->vma;
> > > +	struct mm_struct *mm = vma->vm_mm;
> > > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> > 
> > I don't think this is needed... More below.
> > 
> > > +	unsigned long addr = vma->vm_start;
> > > +	u64 npages = vma_pages(vma);
> > > +	struct xe_vma *xe_vma;
> > > +	vm_fault_t ret = 0;
> > > +	struct xe_vm *vm;
> > > +	void *buf;
> > > +	int i;
> > > +
> > > +	struct migrate_vma migrate_vma = {
> > > +		.vma		= vmf->vma,
> > > +		.start		= vma->vm_start,
> > > +		.end		= vma->vm_end,
> > 
> > So I know in my PoC I had the fault user pointer (xe_vma) == struct
> > vm_area_struct of the GPU fault. That is definitely wrong. We likely
> > want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> > this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> > finally 4k in sizes trying the largest first... The chunk sizes are
> > trivial as we likely can just have table with values, the key here is
> > the vm_area_struct vm_start / vm_end are not what we want to use here
> > rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the xe_vma

After I typed this, realized I made a mistake here...

s/xe_vma_start/xe_vma_userptr/
s/xe_vma_end/xe_vma_userptr + xe_vma_size/

But you get the idea - the zone_device_data points Xe specific chunk
data (currently xe_vma, could be xe_pt_state our something later if we
switch to 1:N).

Check AMD's + Nvidia's drivers and they both use this field in a similar
way.

> > from the faulting page vmf->page->zone_device_data field unless you have
> > another use that field...
> 
> You are right. Such work is planned in the memory attributes part that Himal is working on. We have a migration_granularity attribute which allow user to set the chunk size.
> 

That makes sense. The chunk size is always just hint though, right?

> > 
> > I also comment on my patch with my suggestion implement chunk sizes too.
> > 
> > > +		.pgmap_owner	= xe,
> > 
> > Again helper for this.
> > 
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page = vmf->page,
> > > +	};
> > > +
> > > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > > +	migrate_vma.src = buf;
> > > +	migrate_vma.dst = buf + npages;
> > > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > > +		ret = VM_FAULT_SIGBUS;
> > > +		goto free_buf;
> > > +	}
> > > +
> > > +	if (!migrate_vma.cpages)
> > 
> > This is an error, need to set a return value.
> > 
> > > +		goto free_buf;
> > > +
> > 
> > We probably should check migrate.cpages != npages too as I also think
> > this is an error.
> > 
> > > +	for (i = 0; i < npages; i++) {
> > > +		ret = migrate_page_vram_to_ram(vma, addr,
> > migrate_vma.src[i],
> > > +							migrate_vma.dst + i);
> > > +		if (ret < 0) {
> > > +			ret = VM_FAULT_SIGBUS;
> > > +			break;
> > > +		}
> > > +
> > > +		/** Migration has been successful, free source page */
> > > +		if (ret == 1) {
> > > +			struct page *src_page =
> > migrate_pfn_to_page(migrate_vma.src[i]);
> > > +
> > > +			xe_devm_page_free(src_page);
> > > +		}
> > > +
> > > +		addr += PAGE_SIZE;
> > > +	}
> > 
> > I touch on this above, this should be reworked to roughly:
> > 
> > - alloc pages and populate migrate_vma.dst
> > - dma map sram pages
> > - migrate a chunk of contigous vram addresses at a time
> > - wait on last dma fence from migrate
> > - unmap dma pages
> > - unlock and free all pages
> > 
> > > +
> > > +	xe_svm_for_each_vm(svm, vm) {
> > > +		xe_assert(xe, vm->mm == mm);
> > > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > > +		if (xe_vma)
> > > +			xe_vm_invalidate_vma(xe_vma);
> > > +	}
> > 
> > I've touched on why this isn't needed... I think one of these
> > migrate_vma_* functions will trigger all MMU notifiers registered for
> > the range. The notifier owns the invalidate then.
> 
> Very good point. Yes after read migrate_vma_setup function, I agree this function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today we invalidate vma with this event. So yes, above codes are not needed.
> 
> I do identified one potential improvement: when mmu notifier is called back with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is called from the gpu page fault path, we can ignore the gpu vma invalidation as we will update gpu page table later after migration anyway. I think an page table invalidation is not needed in such case. But this should be just a minor improvement.
>

We skip invalidations if the initial_bind flag is clear which should
cover at the initial GPU fault. There is certainly room for improvement
/ optimizations in the MMU notifier though, it is kinda messy right now
too. IMO work like this can be done once the basic design is working +
tests in place to verify changes / optimizations.
 
> 
> > 
> > Beyond this, maybe I'm confused about a few things and how this fits all
> > together. Doesn't every user process have its own unique mm, fd, and vm
> > (e.g. own address space)? If a user want a shared address space then use
> > threads with a single mm, fd, and vm.
> 
> Yes, this is also my understanding. Each user process has its own mm struct and device fds. 
> 
> In a shared address space case, such as there are multiple pthread created through pthread_create in one process, all those pthreads should have different kernel task_struct, but all those task_struct (say we get it from "current" macro) should share one same mm struct, which means they all lives inside one cpu address space.
> 
> Now with this work, we are now basically extending this shared cpu address space to gpu program. So both cpu program and gpu program can share one address space.
> 
> Since we allow user to create multiple gpu vm for one device (lets focus on one device for now), so each shared address space can have multiple gpu vm... each gpuvm should be able to register its own mmu notifier to core mm, even if those notifier has the same address range. But I will have to test this out. If all this works, above codes are not needed. If different gpuvm can't register mmu notifier for same address range, then we would need a fix....
>

The mmu notifier code is implemented with the interval tree which
supports overlapping rangers (i.e. we can have multiple VM's register
notifiers with the sam address range in a single MM).
 
> 
> > 
> > So even if we had to resolve the xe_vma's here and do an invalidate here
> > very confused what this is doing. This is this the case with multiple
> > devices and each VM points to a different device? 
> 
> Right now I only focus on single device. See above. This is to solve one gpu device but multiple gpu vm case. But as said above, for now I don't think this is needed. I need to test more on the mmu notifier behavior: whether it allow us to insert two notifiers for the same range for one mm....
> 

Agree that our focus should be on single device now. If that design it
well thought out I don't think extending this to multiple devices will
be a huge change either.

Matt

> Oak
> 
> Again so that case I
> > don't think a xe_svm structure would be needed, on GPU fault we should
> > be to detect from the faulting page zone_device_data and pgmap owner
> > if the fault already has a physical backing on another GPU and resolve
> > how to map it into GPU with a fault... Jason suggests this in the
> > following thread [2] and I think I agree with him.
> > 
> > [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> > 240632fd3e35@amd.com/T/
> > 
> > > +	migrate_vma_pages(&migrate_vma);
> > 
> > This logic is going to change but ...
> > 
> > On an error I think we only want to call migrate_vma_finalize to revert
> > pages back to the original state (i.e. migrate_vma_pages commits the
> > page changes which we don't want to do on an error).
> > 
> > > +	migrate_vma_finalize(&migrate_vma);
> > > +free_buf:
> > > +	kvfree(buf);
> > > +	return 0;
> > 
> > I don't think 0 should blindly be return here, if there is an error
> > return VM_FAULT_SIGBUS. We likely want a high level error message too.
> > 
> > Matt
> > 
> > > +}
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-12 18:10       ` Matthew Brost
@ 2024-04-12 18:39         ` Zeng, Oak
  2024-06-07  4:44         ` Zeng, Oak
  1 sibling, 0 replies; 72+ messages in thread
From: Zeng, Oak @ 2024-04-12 18:39 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Friday, April 12, 2024 2:11 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Fri, Apr 12, 2024 at 11:24:06AM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Wednesday, April 10, 2024 10:07 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > > Brian <brian.welty@intel.com>
> > > Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> > >
> > > On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > > > Under the picture of svm, CPU and GPU program share one same
> > > > virtual address space. The backing store of this virtual address
> > > > space can be either in system memory or device memory. Since GPU
> > > > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > > > Any CPU access to device memory causes a page fault. Implement
> > > > a page fault handler to migrate memory back to system memory and
> > > > map it to CPU page table so the CPU program can proceed.
> > > >
> > > > Also unbind this page from GPU side, and free the original GPU
> > > > device page
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile         |   1 +
> > > >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> > > >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> > > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> > > ++++++++++++++++++++++++++++
> > > >  4 files changed, 230 insertions(+), 8 deletions(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > > index f89d77b6d654..65289acdd563 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> > > >  	xe_step.o \
> > > >  	xe_svm.o \
> > > >  	xe_svm_devmem.o \
> > > > +	xe_svm_migrate.o \
> > >
> > > See comments about file org, same thing applies here. Let's put all of
> > > the svm implementation in a single file.
> > >
> > > >  	xe_sync.o \
> > > >  	xe_tile.o \
> > > >  	xe_tile_sysfs.o \
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > > index f601dffe3fc1..c9e4239c44b4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -7,11 +7,11 @@
> > > >  #define __XE_SVM_H
> > > >
> > > >  #include <linux/mm_types.h>
> > > > +#include <linux/mm.h>
> > > >  #include "xe_device_types.h"
> > > >  #include "xe_device.h"
> > > >  #include "xe_assert.h"
> > > > -
> > > > -struct xe_vm;
> > > > +#include "xe_vm_types.h"
> > > >
> > > >  /**
> > > >   * struct xe_svm - data structure to represent a shared
> > > > @@ -31,6 +31,9 @@ struct xe_svm {
> > > >  	struct list_head vm_list;
> > > >  };
> > > >
> > > > +#define xe_svm_for_each_vm(svm, vm)
> > > 	\
> > > > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > > > +
> > >
> > > Don't think this is need, see below.
> > >
> > > >  extern struct xe_svm *xe_create_svm(void);
> > > >  void xe_destroy_svm(struct xe_svm *svm);
> > > >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > > > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > > >
> > > >  void xe_devm_free_blocks(struct list_head *blocks);
> > > >  void xe_devm_page_free(struct page *page);
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > index 088ac209ad80..32ada458f1dd 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> > > >  	unsigned long bitmap[];
> > > >  };
> > > >
> > > > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > -{
> > > > -	return 0;
> > > > -}
> > > > -
> > > >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > >  {
> > > >  	/** DRM buddy's block offset is 0-based*/
> > > > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head
> *blocks)
> > > >
> > > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > >  	.page_free = xe_devm_page_free,
> > > > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> > >
> > > Again single file so this will be static function, no reason to export
> > > this.
> > >
> > > >  };
> > > >
> > > >  /**
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > new file mode 100644
> > > > index 000000000000..0db831af098e
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > @@ -0,0 +1,222 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2023 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/gfp.h>
> > > > +#include <linux/migrate.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/dma-fence.h>
> > > > +#include <linux/bitops.h>
> > > > +#include <linux/bitmap.h>
> > > > +#include <linux/kernel.h>
> > > > +#include <linux/slab.h>
> > > > +#include <drm/drm_buddy.h>
> > > > +#include "xe_device_types.h"
> > > > +#include "xe_device.h"
> > > > +#include "xe_trace.h"
> > > > +#include "xe_migrate.h"
> > > > +#include "xe_ttm_vram_mgr_types.h"
> > > > +#include "xe_assert.h"
> > > > +#include "xe_pt.h"
> > > > +#include "xe_svm.h"
> > > > +#include "xe_vm.h"
> > > > +
> > > > +
> > > > +/**
> > > > + * alloc_host_page() - allocate one host page for the fault vma
> > > > + *
> > > > + * @dev: (GPU) device that will access the allocated page
> > > > + * @vma: the fault vma that we need allocate page for
> > > > + * @addr: the fault address. The allocated page is for this address
> > > > + * @dma_addr: used to output the dma address of the allocated page.
> > > > + * This dma address will be used for gpu to access this page. GPU
> > > > + * access host page through a dma mapped address.
> > > > + * @pfn: used to output the pfn of the allocated page.
> > > > + *
> > > > + * This function allocate one host page for the specified vma. It
> > > > + * also does some prepare work for GPU to access this page, such
> > > > + * as map this page to iommu (by calling dma_map_page).
> > > > + *
> > > > + * When this function returns, the page is locked.
> > > > + *
> > > > + * Return struct page pointer when success
> > > > + * NULL otherwise
> > > > + */
> > > > +static struct page *alloc_host_page(struct device *dev,
> > > > +							 struct vm_area_struct
> > > *vma,
> > > > +							 unsigned long addr,
> > > > +							 dma_addr_t
> > > *dma_addr,
> > > > +							 unsigned long *pfn)
> > >
> > > Weird alignment, also I don't think we are want to allocate a page at
> > > time...
> > >
> > > Beyond that, can't say I'm a fan of 2 arguments being return and
> > > populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> > > haven't seen a lot that style function in Linux.
> > >
> > > Probably makes more sense to have a function which allocates pages,
> > > locks them, and populates the pfn array (migrate_pfn) rather than doing
> > > this a page at a time.
> > >
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > > > +	if (unlikely(!page))
> > > > +		return NULL;
> > > > +
> > > > +	/**Lock page per hmm requirement, see hmm.rst*/
> > > > +	lock_page(page);
> > > > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > >
> > > The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> > > needed, right? As mentioned above I think this should be broken out into
> > > a different step too.
> > >
> > > > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > > > +		unlock_page(page);
> > > > +		__free_page(page);
> > > > +		return NULL;
> > > > +	}
> > > > +
> > > > +	*pfn = migrate_pfn(page_to_pfn(page));
> > > > +	return page;
> > > > +}
> > > > +
> > > > +static void free_host_page(struct page *page)
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > > > + *
> > > > + * @vma: The vma that the page is mapped to
> > > > + * @addr: The virtual address that the page is mapped to
> > > > + * @src_pfn: src page's page frame number
> > > > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > > > + *
> > > > + * Allocate one page in system ram and copy memory from device
> memory
> > > > + * to system ram.
> > > > + *
> > > > + * Return: 0 if this page is already in sram (no need to migrate)
> > > > + * 1: successfully migrated this page from vram to sram.
> > > > + * error code otherwise
> > > > + */
> > > > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> > > unsigned long addr,
> > > > +						unsigned long src_pfn,
> > > unsigned long *dst_pfn)
> > > > +{
> > >
> > > We definitely don't want to copy 1 page at time. I touch on this in [1].
> > > Basically this going to perform poorly unless we use larger copies, the
> > > migrate code supports non-contigous sram addresses, and vram addresses
> > > will likely be contigous due to the buddy allocator.
> > >
> > > [1]
> https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> > >
> > > > +	struct xe_mem_region *mr;
> > > > +	struct xe_tile *tile;
> > > > +	struct xe_device *xe;
> > > > +	struct device *dev;
> > > > +	dma_addr_t dma_addr = 0;
> > > > +	struct dma_fence *fence;
> > > > +	struct page *host_page;
> > > > +	struct page *src_page;
> > > > +	u64 src_dpa;
> > > > +
> > > > +	src_page = migrate_pfn_to_page(src_pfn);
> > > > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> > >
> > > I'm going to say this is a bug if !src_page ||
> > > !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE)
> and
> > > we return -EFAULT (or another error code if that makes more sense). We
> > > are migrating from the device where we know we have backing store from
> > > the original fault.
> > >
> > > > +		return 0;
> > > > +
> > > > +	mr = xe_page_to_mem_region(src_page);
> > > > +	tile = xe_mem_region_to_tile(mr);
> > > > +	xe = tile_to_xe(tile);
> > > > +	dev = xe->drm.dev;
> > > > +
> > > > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > > > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > > > +	if (!host_page)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > > > +						dma_addr, false, PAGE_SIZE);
> > > > +	if (IS_ERR(fence)) {
> > > > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > > > +		free_host_page(host_page);
> > > > +		return PTR_ERR(fence);
> > > > +	}
> > > > +
> > > > +	dma_fence_wait(fence, false);
> > >
> > > Even if we did want to migrate a page at a time, we only need to wait on
> > > the last fence due to the ordered nature of exec queues.
> > >
> > > > +	dma_fence_put(fence);
> > > > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> > >
> > > With above, will likely unmap all dma pages in a single function once
> > > the last fence is signaled.
> > >
> > > > +	return 1;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU
> page
> > > fault
> > > > + *
> > > > + * @vmf: cpu vm fault structure, contains fault information such as vma
> etc.
> > > > + *
> > > > + * Note, this is in CPU's vm fault handler, caller holds the mmap read
> lock.
> > > > + *
> > > > + * This function migrate one gpu vma which contains the fault address to
> > > sram.
> > > > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> > > create one
> > > > + * gpu vma for one cpu vma initially and try not to split it). So this
> scheme
> > > end
> > > > + * up migrate at the vma granularity. This might not be the best
> performant
> > > scheme
> > > > + *
> > > > + * This can be tunned with a migration granularity for  performance, for
> > > example,
> > > > + * migration 2M for each CPU page fault, or let user specify how much
> to
> > > migrate.
> > > > + * This is more complex due to vma splitting.
> > > > + *
> > > > + * This function should also update GPU page table, so the fault virtual
> > > address
> > > > + * points to the same sram location from GPU side. This is TBD.
> > > > + *
> > > > + * Return:
> > > > + * 0 on success
> > > > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> > > application
> > > > + * will be signaled a SIGBUG
> > > > + */
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > > > +{
> > > > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > > > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > > > +	struct xe_device *xe = tile_to_xe(tile);
> > > > +	struct vm_area_struct *vma = vmf->vma;
> > > > +	struct mm_struct *mm = vma->vm_mm;
> > > > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> > >
> > > I don't think this is needed... More below.
> > >
> > > > +	unsigned long addr = vma->vm_start;
> > > > +	u64 npages = vma_pages(vma);
> > > > +	struct xe_vma *xe_vma;
> > > > +	vm_fault_t ret = 0;
> > > > +	struct xe_vm *vm;
> > > > +	void *buf;
> > > > +	int i;
> > > > +
> > > > +	struct migrate_vma migrate_vma = {
> > > > +		.vma		= vmf->vma,
> > > > +		.start		= vma->vm_start,
> > > > +		.end		= vma->vm_end,
> > >
> > > So I know in my PoC I had the fault user pointer (xe_vma) == struct
> > > vm_area_struct of the GPU fault. That is definitely wrong. We likely
> > > want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> > > this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> > > finally 4k in sizes trying the largest first... The chunk sizes are
> > > trivial as we likely can just have table with values, the key here is
> > > the vm_area_struct vm_start / vm_end are not what we want to use here
> > > rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the
> xe_vma
> 
> After I typed this, realized I made a mistake here...
> 
> s/xe_vma_start/xe_vma_userptr/
> s/xe_vma_end/xe_vma_userptr + xe_vma_size/
> 
> But you get the idea - the zone_device_data points Xe specific chunk
> data (currently xe_vma, could be xe_pt_state our something later if we
> switch to 1:N).
> 
> Check AMD's + Nvidia's drivers and they both use this field in a similar
> way.
> 
> > > from the faulting page vmf->page->zone_device_data field unless you have
> > > another use that field...
> >
> > You are right. Such work is planned in the memory attributes part that Himal
> is working on. We have a migration_granularity attribute which allow user to
> set the chunk size.
> >
> 
> That makes sense. The chunk size is always just hint though, right?


I believe we should have a default chunk size, such as 2M, and user can overwrite it through hints.

> 
> > >
> > > I also comment on my patch with my suggestion implement chunk sizes too.
> > >
> > > > +		.pgmap_owner	= xe,
> > >
> > > Again helper for this.
> > >
> > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page = vmf->page,
> > > > +	};
> > > > +
> > > > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > > > +	migrate_vma.src = buf;
> > > > +	migrate_vma.dst = buf + npages;
> > > > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +		goto free_buf;
> > > > +	}
> > > > +
> > > > +	if (!migrate_vma.cpages)
> > >
> > > This is an error, need to set a return value.
> > >
> > > > +		goto free_buf;
> > > > +
> > >
> > > We probably should check migrate.cpages != npages too as I also think
> > > this is an error.
> > >
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		ret = migrate_page_vram_to_ram(vma, addr,
> > > migrate_vma.src[i],
> > > > +							migrate_vma.dst + i);
> > > > +		if (ret < 0) {
> > > > +			ret = VM_FAULT_SIGBUS;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		/** Migration has been successful, free source page */
> > > > +		if (ret == 1) {
> > > > +			struct page *src_page =
> > > migrate_pfn_to_page(migrate_vma.src[i]);
> > > > +
> > > > +			xe_devm_page_free(src_page);
> > > > +		}
> > > > +
> > > > +		addr += PAGE_SIZE;
> > > > +	}
> > >
> > > I touch on this above, this should be reworked to roughly:
> > >
> > > - alloc pages and populate migrate_vma.dst
> > > - dma map sram pages
> > > - migrate a chunk of contigous vram addresses at a time
> > > - wait on last dma fence from migrate
> > > - unmap dma pages
> > > - unlock and free all pages
> > >
> > > > +
> > > > +	xe_svm_for_each_vm(svm, vm) {
> > > > +		xe_assert(xe, vm->mm == mm);
> > > > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > > > +		if (xe_vma)
> > > > +			xe_vm_invalidate_vma(xe_vma);
> > > > +	}
> > >
> > > I've touched on why this isn't needed... I think one of these
> > > migrate_vma_* functions will trigger all MMU notifiers registered for
> > > the range. The notifier owns the invalidate then.
> >
> > Very good point. Yes after read migrate_vma_setup function, I agree this
> function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today we
> invalidate vma with this event. So yes, above codes are not needed.
> >
> > I do identified one potential improvement: when mmu notifier is called back
> with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is called from
> the gpu page fault path, we can ignore the gpu vma invalidation as we will
> update gpu page table later after migration anyway. I think an page table
> invalidation is not needed in such case. But this should be just a minor
> improvement.
> >
> 
> We skip invalidations if the initial_bind flag is clear which should
> cover at the initial GPU fault. There is certainly room for improvement
> / optimizations in the MMU notifier though, it is kinda messy right now
> too. IMO work like this can be done once the basic design is working +
> tests in place to verify changes / optimizations.
> 

agreed

> >
> > >
> > > Beyond this, maybe I'm confused about a few things and how this fits all
> > > together. Doesn't every user process have its own unique mm, fd, and vm
> > > (e.g. own address space)? If a user want a shared address space then use
> > > threads with a single mm, fd, and vm.
> >
> > Yes, this is also my understanding. Each user process has its own mm struct
> and device fds.
> >
> > In a shared address space case, such as there are multiple pthread created
> through pthread_create in one process, all those pthreads should have different
> kernel task_struct, but all those task_struct (say we get it from "current" macro)
> should share one same mm struct, which means they all lives inside one cpu
> address space.
> >
> > Now with this work, we are now basically extending this shared cpu address
> space to gpu program. So both cpu program and gpu program can share one
> address space.
> >
> > Since we allow user to create multiple gpu vm for one device (lets focus on
> one device for now), so each shared address space can have multiple gpu vm...
> each gpuvm should be able to register its own mmu notifier to core mm, even
> if those notifier has the same address range. But I will have to test this out. If
> all this works, above codes are not needed. If different gpuvm can't register
> mmu notifier for same address range, then we would need a fix....
> >
> 
> The mmu notifier code is implemented with the interval tree which
> supports overlapping rangers (i.e. we can have multiple VM's register
> notifiers with the sam address range in a single MM).

Ok, that is great. I will delete the xe_svm struct.

Oak

> 
> >
> > >
> > > So even if we had to resolve the xe_vma's here and do an invalidate here
> > > very confused what this is doing. This is this the case with multiple
> > > devices and each VM points to a different device?
> >
> > Right now I only focus on single device. See above. This is to solve one gpu
> device but multiple gpu vm case. But as said above, for now I don't think this is
> needed. I need to test more on the mmu notifier behavior: whether it allow us
> to insert two notifiers for the same range for one mm....
> >
> 
> Agree that our focus should be on single device now. If that design it
> well thought out I don't think extending this to multiple devices will
> be a huge change either.
> 
> Matt
> 
> > Oak
> >
> > Again so that case I
> > > don't think a xe_svm structure would be needed, on GPU fault we should
> > > be to detect from the faulting page zone_device_data and pgmap owner
> > > if the fault already has a physical backing on another GPU and resolve
> > > how to map it into GPU with a fault... Jason suggests this in the
> > > following thread [2] and I think I agree with him.
> > >
> > > [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> > > 240632fd3e35@amd.com/T/
> > >
> > > > +	migrate_vma_pages(&migrate_vma);
> > >
> > > This logic is going to change but ...
> > >
> > > On an error I think we only want to call migrate_vma_finalize to revert
> > > pages back to the original state (i.e. migrate_vma_pages commits the
> > > page changes which we don't want to do on an error).
> > >
> > > > +	migrate_vma_finalize(&migrate_vma);
> > > > +free_buf:
> > > > +	kvfree(buf);
> > > > +	return 0;
> > >
> > > I don't think 0 should blindly be return here, if there is an error
> > > return VM_FAULT_SIGBUS. We likely want a high level error message too.
> > >
> > > Matt
> > >
> > > > +}
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-11  2:49   ` Matthew Brost
@ 2024-04-12 21:21     ` Zeng, Oak
  2024-04-15 19:40       ` Matthew Brost
  0 siblings, 1 reply; 72+ messages in thread
From: Zeng, Oak @ 2024-04-12 21:21 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 10:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
> 
> On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > Introduce a helper function xe_svm_migrate_vma_to_vram.
> >
> > Since the source pages of the svm range can be physically not
> > contiguous, and the destination vram pages can also be not
> > contiguous, there is no easy way to migrate multiple pages per
> > blitter command. We do page by page migration for now.
> >
> > Migration is best effort. Even if we fail to migrate some pages,
> > we will try to migrate the rest pages.
> >
> > FIXME: Use one blitter command to copy when both src and dst are
> > physically contiguous
> >
> 
> Yep, touch in this throughout the series. Only vram needs to be
> contiguous though as we dynamically create PT mappings for sram pages in
> the migrate code. Getting this in a must and should be done immediately
> IMO as this a very, very basic perform thing we know needs to be done.
> We will likely have to tune this code quite a bit for performance so
> getting known things done would be helpful.
> 
> > FIXME: when a vma is partially migrated, split vma as we assume
> > no mixture vma placement.
> >
> 
> Agree we do not want support partial migrations. We likely want to
> return -EAGAIN for something and fault back to a smaller xe_vma chunk
> size which I discussed in [1] and add comment on in [2].
> 
> Migration should be best case too if we fail migrate we can always leave
> backing store in sram too.
> 
> I do have question though, when do we get partial migrations? A user
> having called mlock on some of the pages? I just want to make sure I
> fully understand that case.

Yah, mlock could be one case...

I also looked the hmm code. There are few other cases where MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte is NULL and vma is file-backed (not anonymous); entry is swapped out to hard disk etc. see more details in function migrate_vma_collect_pmd.


> 
> [1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> [2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> 
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> ++++++++++++++++++++++++++++
> 
> Same comment on file structure throughout the series apply here too.
> 
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index c9e4239c44b4..18ce2e3757c5 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >  void xe_devm_free_blocks(struct list_head *blocks);
> >  void xe_devm_page_free(struct page *page);
> >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> > +							struct xe_tile *tile);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > index 0db831af098e..ab8dd1f58aa4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> vm_fault *vmf)
> >  	kvfree(buf);
> >  	return 0;
> >  }
> > +
> > +/**
> > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to
> vram
> > + * Must be called with mmap_read_lock held.
> > + * @vm: the vm that the vma belongs to
> > + * @vma: the vma to migrate.
> > + * @tile: the destination tile which holds the new backing store of the range
> > + *
> > + * Returns: negative errno on faiure, 0 on success
> > + */
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > +							struct xe_vma *vma,
> > +							struct xe_tile *tile)
> > +{
> > +	struct mm_struct *mm = vm->mm;
> > +	unsigned long start = xe_vma_start(vma);
> > +	unsigned long end = xe_vma_end(vma);
> > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > +	struct xe_mem_region *mr = &tile->mem.vram;
> > +	struct vm_area_struct *vas;
> > +
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= tile->xe,
> 
> Again helper to assign owner.
> 
> > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct device *dev = tile->xe->drm.dev;
> > +	dma_addr_t *src_dma_addr;
> > +	struct dma_fence *fence;
> > +	struct page *src_page;
> > +	LIST_HEAD(blocks);
> > +	int ret = 0, i;
> > +	u64 dst_dpa;
> > +	void *buf;
> > +
> > +	mmap_assert_locked(mm);
> 
> This mmap_assert_locked is ambiguous, we should make it clear if this
> read or write locked. Doesn't it have to be write to do the migrate
> pages?

I followed hmm document (Documents/mm/hmm.rst), see section "Migration to and from device memory". It explicitly write a read_lock in this document.

I believe a read_lock is enough for the migrate_vma_setup/migrate_vma_finalize().

As I understand it, the mm.mmap_lock protect the process's address space. When we modify process's address space such as mmap/munmap, we need to hold a write mode lock; if we only read process's address space, such as in the migrate_vma_setup/finalize, or in the cpu page fault handler case, we only need a read mode lock. 

I also cross checked amd driver. They also use a read lock.. see function svm_range_restore_pages in kfd_svm.c....


> 
> A larger question about the locking... The CPU fault handler holds the
> mmap lock in write mode, right?

No. since cpu fault handler doesn't modify process address space, instead it only fill up cpu page table for some valid address range, a read lock is enough. 

> 
> I'm asking as basically I think at least initially we want to hold the
> mmap lock in a way that the GPU handler and CPU handler do not race.
> i.e. From fault userptr create in GPU fault handler to issuing the bind
> we prevent the CPU fault handler from running.

Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid that race.

In user mmap/munmap (such as kernel function vm_munmap), we hold a mmap_write_lock which prevent it racing with cpu and gpu fault handler.


> 
> I'm having issues figuring out how to prevent races between initial
> binds of userptrs and userptr invalidates on faulting VMs. This race is
> seen any xe_exec_fault_mode for example... So preventing races between
> CPU / GPU fault handler with the mmap probably is a good idea initially.
> Likely can make the locking finer grained once this is all working and I
> figure out how to handle this race better.


I don't quite follow here. 

Initial bind of user ptr... if you meant the bind in gpu page fault handler, then the racing with invalidation is roughly like below:
Invalidation is called with mmap_write_lock
In userptr_pin_page, we hold a mmap_read_lock, so we know during pin page, invalidation is excluded.
After pin, before programming gpu page table, we check whether there is invalidation happen *after last pin but before programming page table*, if that happened, we retry



Oak

> 
> > +
> > +	vas = find_vma_intersection(mm, start, start + 4);
> 
> find_vma should work fine here.
> 
> > +	if (!vas)
> > +		return -ENOENT;
> > +
> > +	migrate.vma = vas;
> > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> > +					GFP_KERNEL);
> > +	if(!buf)
> > +		return -ENOMEM;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> 
> Again as I discussed in [3] I think this should be broken out into a
> different step with the blocks allocated before this, and here just
> populate migrate.dst from the existing blocks.
> 
> [3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> 
> > +	if (ret)
> > +		goto kfree_buf;
> > +
> > +	ret = migrate_vma_setup(&migrate);
> > +	if (ret) {
> > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> [%lx - %lx]\n",
> > +				ret, start, end);
> > +		goto free_dst_pages;
> > +	}
> > +
> > +	/**FIXME: partial migration of a range print a warning for now.
> > +	 * If this message is printed, we need to split xe_vma as we
> > +	 * don't support a mixture placement of one vma
> > +	 */
> > +	if (migrate.cpages != npages)
> > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
>  %lx], range is %ld pages, migrate only %ld pages\n",
> > +				start, end, npages, migrate.cpages);
> 
> As discussed above, we shouldn't support this. We should fall back to
> smaller xe_vma chunk size until we find one that works or simply leave
> the pages in sram and map those pages to GPU.
> 
> > +
> > +	/**Migrate page by page for now.
> > +	 * Both source pages and destination pages can physically not
> contiguous,
> > +	 * there is no good way to migrate multiple pages per blitter command.
> > +	 */
> 
> Touched on this a bunch throughout the series, lets do better than a
> page a time migration.
> 
> Algorithm should be very similar to what I discussed here [4] but with a
> few key differences.
> 
> - I think the sram pages can be unpopulated (page == NULL) if the user
>   has not yet touched the page
> - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> 
> In these cases this indicate we have to issue a copy for the pages we
> are accumulating with contigous vram addresses.
> 
> [4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> 
> > +	for (i = 0; i < npages; i++) {
> > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > +		if (unlikely(!src_page || !(migrate.src[i] &
> MIGRATE_PFN_MIGRATE)))
> 
> Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> on subsequent errors we need to call migrate_vma_finalize to revert the
> pages to the original state. At least I think if I am reading the doc
> after this correctly.
> 
> Here on error we just free the pages...
> 
> Matt
> 
> > +			goto free_dst_page;
> > +
> > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> PAGE_SIZE, DMA_TO_DEVICE);
> > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > +			drm_warn(&tile->xe->drm, "dma map error for host
> pfn %lx\n", migrate.src[i]);
> > +			goto free_dst_page;
> > +		}
> > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > +				dst_dpa, true, PAGE_SIZE);
> > +		if (IS_ERR(fence)) {
> > +			drm_warn(&tile->xe->drm, "migrate host page
> (pfn: %lx) to vram failed\n",
> > +					migrate.src[i]);
> > +			/**Migration is best effort. Even we failed here, we
> continue*/
> > +			goto free_dst_page;
> > +		}
> > +		/**FIXME: Use the first migration's out fence as the second
> migration's input fence,
> > +		 * and so on. Only wait the out fence of last migration?
> > +		 */
> > +		dma_fence_wait(fence, false);
> > +		dma_fence_put(fence);
> > +free_dst_page:
> > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > +	}
> > +
> > +	for (i = 0; i < npages; i++)
> > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> DMA_TO_DEVICE);
> > +
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +free_dst_pages:
> > +	if (ret)
> > +		xe_devm_free_blocks(&blocks);
> > +kfree_buf:
> > +	kfree(buf);
> > +	return ret;
> > +}
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-12 21:21     ` Zeng, Oak
@ 2024-04-15 19:40       ` Matthew Brost
  2024-06-07 17:12         ` Zeng, Oak
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-04-15 19:40 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

On Fri, Apr 12, 2024 at 03:21:04PM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 10:49 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
> > 
> > On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > > Introduce a helper function xe_svm_migrate_vma_to_vram.
> > >
> > > Since the source pages of the svm range can be physically not
> > > contiguous, and the destination vram pages can also be not
> > > contiguous, there is no easy way to migrate multiple pages per
> > > blitter command. We do page by page migration for now.
> > >
> > > Migration is best effort. Even if we fail to migrate some pages,
> > > we will try to migrate the rest pages.
> > >
> > > FIXME: Use one blitter command to copy when both src and dst are
> > > physically contiguous
> > >
> > 
> > Yep, touch in this throughout the series. Only vram needs to be
> > contiguous though as we dynamically create PT mappings for sram pages in
> > the migrate code. Getting this in a must and should be done immediately
> > IMO as this a very, very basic perform thing we know needs to be done.
> > We will likely have to tune this code quite a bit for performance so
> > getting known things done would be helpful.
> > 
> > > FIXME: when a vma is partially migrated, split vma as we assume
> > > no mixture vma placement.
> > >
> > 
> > Agree we do not want support partial migrations. We likely want to
> > return -EAGAIN for something and fault back to a smaller xe_vma chunk
> > size which I discussed in [1] and add comment on in [2].
> > 
> > Migration should be best case too if we fail migrate we can always leave
> > backing store in sram too.
> > 
> > I do have question though, when do we get partial migrations? A user
> > having called mlock on some of the pages? I just want to make sure I
> > fully understand that case.
> 
> Yah, mlock could be one case...
> 
> I also looked the hmm code. There are few other cases where MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte is NULL and vma is file-backed (not anonymous); entry is swapped out to hard disk etc. see more details in function migrate_vma_collect_pmd.
> 
> 
> > 
> > [1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > [2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> > ++++++++++++++++++++++++++++
> > 
> > Same comment on file structure throughout the series apply here too.
> > 
> > >  2 files changed, 117 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index c9e4239c44b4..18ce2e3757c5 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > >  void xe_devm_free_blocks(struct list_head *blocks);
> > >  void xe_devm_page_free(struct page *page);
> > >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> > > +							struct xe_tile *tile);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > index 0db831af098e..ab8dd1f58aa4 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> > vm_fault *vmf)
> > >  	kvfree(buf);
> > >  	return 0;
> > >  }
> > > +
> > > +/**
> > > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to
> > vram
> > > + * Must be called with mmap_read_lock held.
> > > + * @vm: the vm that the vma belongs to
> > > + * @vma: the vma to migrate.
> > > + * @tile: the destination tile which holds the new backing store of the range
> > > + *
> > > + * Returns: negative errno on faiure, 0 on success
> > > + */
> > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > > +							struct xe_vma *vma,
> > > +							struct xe_tile *tile)
> > > +{
> > > +	struct mm_struct *mm = vm->mm;
> > > +	unsigned long start = xe_vma_start(vma);
> > > +	unsigned long end = xe_vma_end(vma);
> > > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > +	struct vm_area_struct *vas;
> > > +
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= tile->xe,
> > 
> > Again helper to assign owner.
> > 
> > > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct device *dev = tile->xe->drm.dev;
> > > +	dma_addr_t *src_dma_addr;
> > > +	struct dma_fence *fence;
> > > +	struct page *src_page;
> > > +	LIST_HEAD(blocks);
> > > +	int ret = 0, i;
> > > +	u64 dst_dpa;
> > > +	void *buf;
> > > +
> > > +	mmap_assert_locked(mm);
> > 
> > This mmap_assert_locked is ambiguous, we should make it clear if this
> > read or write locked. Doesn't it have to be write to do the migrate
> > pages?
> 
> I followed hmm document (Documents/mm/hmm.rst), see section "Migration to and from device memory". It explicitly write a read_lock in this document.
> 
> I believe a read_lock is enough for the migrate_vma_setup/migrate_vma_finalize().
> 
> As I understand it, the mm.mmap_lock protect the process's address space. When we modify process's address space such as mmap/munmap, we need to hold a write mode lock; if we only read process's address space, such as in the migrate_vma_setup/finalize, or in the cpu page fault handler case, we only need a read mode lock. 
> 
> I also cross checked amd driver. They also use a read lock.. see function svm_range_restore_pages in kfd_svm.c....
> 

Yea, I see that too. Trying to figure out the locking, IMO the locking
document might actually be wrong, or the very least the locking design
is very ill-conceived. We can discuss internally a bit before I
publically share my grievances.

> 
> > 
> > A larger question about the locking... The CPU fault handler holds the
> > mmap lock in write mode, right?
> 
> No. since cpu fault handler doesn't modify process address space, instead it only fill up cpu page table for some valid address range, a read lock is enough. 
> 

Ah, yes after digging around a bit I see this.

> > 
> > I'm asking as basically I think at least initially we want to hold the
> > mmap lock in a way that the GPU handler and CPU handler do not race.
> > i.e. From fault userptr create in GPU fault handler to issuing the bind
> > we prevent the CPU fault handler from running.
> 
> Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid that race.
>

That's not how rw lock work. 2 threads can both hold the read lock in
parallel (shared read access), only 1 thread hold the write lock
(exclusive write access, no one can hold read lock either). Thus my
concern about the cpu and gpu fault handler running in parallel and the
larger locking design questions. Again we can talk through this in
detail internally a bit.
 
> In user mmap/munmap (such as kernel function vm_munmap), we hold a mmap_write_lock which prevent it racing with cpu and gpu fault handler.
> 
> 
> > 
> > I'm having issues figuring out how to prevent races between initial
> > binds of userptrs and userptr invalidates on faulting VMs. This race is
> > seen any xe_exec_fault_mode for example... So preventing races between
> > CPU / GPU fault handler with the mmap probably is a good idea initially.
> > Likely can make the locking finer grained once this is all working and I
> > figure out how to handle this race better.
> 
> 
> I don't quite follow here. 
> 
> Initial bind of user ptr... if you meant the bind in gpu page fault handler, then the racing with invalidation is roughly like below:
> Invalidation is called with mmap_write_lock

Is it? If the notifer does the invalidation via migrate_vma_setup in the
CPU fault handler we established about that only the mmap_read_lock is
held.

> In userptr_pin_page, we hold a mmap_read_lock, so we know during pin page, invalidation is excluded.

Nope, see above invalidation can happen when userptr_pin_page is
executing because of the read lock. The seqno check (described below) is
what prevents programming of bad page tables.

> After pin, before programming gpu page table, we check whether there is invalidation happen *after last pin but before programming page table*, if that happened, we retry
>

Yes, that is how it works on tip but I am refactoring it in [1]. I was
trying to avoid the retry loop by turning PDE/PTE writes into clears if an
invalidation happened but not sure if that works without a larger
refactor due to nature PDEs being shared. I may need the retry loop but
that also gets tricky with array of binds... A few options here and will
land a on solution once [2] is merged.

Regardless my point here is still valid - it may not be the worst idea
when getting this code initially up and running just to grab
mmap_write_lock in GPU fault handler as a BLK (big kernel lock) to
prevent all races. Once the code is stable and stress testing in place,
switch to finer grained locking as define in HMM document (or newly
defined if we fine locking design is insufficient).

Matt

[1] https://patchwork.freedesktop.org/series/125608/
[2] https://patchwork.freedesktop.org/series/132246/

> 
> 
> Oak
> 
> > 
> > > +
> > > +	vas = find_vma_intersection(mm, start, start + 4);
> > 
> > find_vma should work fine here.
> > 
> > > +	if (!vas)
> > > +		return -ENOENT;
> > > +
> > > +	migrate.vma = vas;
> > > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> > > +					GFP_KERNEL);
> > > +	if(!buf)
> > > +		return -ENOMEM;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> > 
> > Again as I discussed in [3] I think this should be broken out into a
> > different step with the blocks allocated before this, and here just
> > populate migrate.dst from the existing blocks.
> > 
> > [3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> > 
> > > +	if (ret)
> > > +		goto kfree_buf;
> > > +
> > > +	ret = migrate_vma_setup(&migrate);
> > > +	if (ret) {
> > > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> > [%lx - %lx]\n",
> > > +				ret, start, end);
> > > +		goto free_dst_pages;
> > > +	}
> > > +
> > > +	/**FIXME: partial migration of a range print a warning for now.
> > > +	 * If this message is printed, we need to split xe_vma as we
> > > +	 * don't support a mixture placement of one vma
> > > +	 */
> > > +	if (migrate.cpages != npages)
> > > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
> >  %lx], range is %ld pages, migrate only %ld pages\n",
> > > +				start, end, npages, migrate.cpages);
> > 
> > As discussed above, we shouldn't support this. We should fall back to
> > smaller xe_vma chunk size until we find one that works or simply leave
> > the pages in sram and map those pages to GPU.
> > 
> > > +
> > > +	/**Migrate page by page for now.
> > > +	 * Both source pages and destination pages can physically not
> > contiguous,
> > > +	 * there is no good way to migrate multiple pages per blitter command.
> > > +	 */
> > 
> > Touched on this a bunch throughout the series, lets do better than a
> > page a time migration.
> > 
> > Algorithm should be very similar to what I discussed here [4] but with a
> > few key differences.
> > 
> > - I think the sram pages can be unpopulated (page == NULL) if the user
> >   has not yet touched the page
> > - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> > 
> > In these cases this indicate we have to issue a copy for the pages we
> > are accumulating with contigous vram addresses.
> > 
> > [4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > 
> > > +	for (i = 0; i < npages; i++) {
> > > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > > +		if (unlikely(!src_page || !(migrate.src[i] &
> > MIGRATE_PFN_MIGRATE)))
> > 
> > Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> > on subsequent errors we need to call migrate_vma_finalize to revert the
> > pages to the original state. At least I think if I am reading the doc
> > after this correctly.
> > 
> > Here on error we just free the pages...
> > 
> > Matt
> > 
> > > +			goto free_dst_page;
> > > +
> > > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> > PAGE_SIZE, DMA_TO_DEVICE);
> > > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > > +			drm_warn(&tile->xe->drm, "dma map error for host
> > pfn %lx\n", migrate.src[i]);
> > > +			goto free_dst_page;
> > > +		}
> > > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > > +				dst_dpa, true, PAGE_SIZE);
> > > +		if (IS_ERR(fence)) {
> > > +			drm_warn(&tile->xe->drm, "migrate host page
> > (pfn: %lx) to vram failed\n",
> > > +					migrate.src[i]);
> > > +			/**Migration is best effort. Even we failed here, we
> > continue*/
> > > +			goto free_dst_page;
> > > +		}
> > > +		/**FIXME: Use the first migration's out fence as the second
> > migration's input fence,
> > > +		 * and so on. Only wait the out fence of last migration?
> > > +		 */
> > > +		dma_fence_wait(fence, false);
> > > +		dma_fence_put(fence);
> > > +free_dst_page:
> > > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > > +	}
> > > +
> > > +	for (i = 0; i < npages; i++)
> > > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> > DMA_TO_DEVICE);
> > > +
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +free_dst_pages:
> > > +	if (ret)
> > > +		xe_devm_free_blocks(&blocks);
> > > +kfree_buf:
> > > +	kfree(buf);
> > > +	return ret;
> > > +}
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-10 22:23   ` Matthew Brost
@ 2024-04-15 20:13     ` Zeng, Oak
  2024-04-15 21:19       ` Matthew Brost
  2024-06-05 22:16     ` Zeng, Oak
  1 sibling, 1 reply; 72+ messages in thread
From: Zeng, Oak @ 2024-04-15 20:13 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 6:24 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free
> device memory
> 
> On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> > house keeping work for all the pages allocated, such as get a page
> > refcount, keep a bitmap of all pages to denote whether a page is in
> > use, put pages to a drm lru list for eviction purpose.
> >
> > Function xe_devm_free_blocks return list of memory blocks to drm buddy
> > allocator.
> >
> > Function xe_devm_free_page is a call back function from hmm layer. It
> > is called whenever a page's refcount reaches to 1. This function clears
> > the bit of this page in the bitmap. If all the bits in the bitmap is
> > cleared, it means all the pages have been freed, we return all the pages
> > in this memory block back to drm buddy.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> ++++++++++++++++++++++++++++-
> 
> See comments about file in previous patches, they apply here too.
> 
> >  2 files changed, 152 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index 624c1581f8ba..92a3ee90d5a7 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> *xe_page_to_mem_region(struct page *page)
> >  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
> >  }
> >
> > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > +						unsigned long npages,
> > +						struct list_head *blocks,
> > +						unsigned long *pfn);
> > +
> > +void xe_devm_free_blocks(struct list_head *blocks);
> > +void xe_devm_page_free(struct page *page);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > index 31af56e8285a..5ba0cd9a70b0 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -5,18 +5,161 @@
> >
> >  #include <linux/mm_types.h>
> >  #include <linux/sched/mm.h>
> > -
> > +#include <linux/gfp.h>
> > +#include <linux/migrate.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/bitops.h>
> > +#include <linux/bitmap.h>
> > +#include <drm/drm_buddy.h>
> >  #include "xe_device_types.h"
> >  #include "xe_svm.h"
> > +#include "xe_migrate.h"
> > +#include "xe_ttm_vram_mgr_types.h"
> > +#include "xe_assert.h"
> >
> > +/**
> > + * struct xe_svm_block_meta - svm uses this data structure to manage each
> > + * block allocated from drm buddy. This will be set to the
> drm_buddy_block's
> > + * private field.
> > + *
> > + * @lru: used to link this block to drm's lru lists. This will be replace
> > + * with struct drm_lru_entity later.
> > + * @tile: tile from which we allocated this block
> > + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > + * to return this block to drm buddy subsystem.
> > + */
> > +struct xe_svm_block_meta {
> > +	struct list_head lru;
> > +	struct xe_tile *tile;
> > +	unsigned long bitmap[];
> > +};
> 
> This looks not needed to me but admittedly haven't looked at the LRU stuff.
> 
> I am thinking roughly...
> 
> - I think we drop all this special tracking (kill xe_svm_block_meta)
> - Have functions to allocate / free the buddy blocks, store buddy blocks in
> userptr
> - Blocks are allocated before migration to VRAM
> - Blocks can be freed on either CPU fault after migration or on VMA
>   destroy (probably depends on madvive hints for VMA where we free
>   blocks)
> - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
>   (conceptually the same if we switch to 1 to N ratio between xe_vma &
>   pt_state)
> - block->private == memory region so we can get pfn from block
> - When we need migrate_pfns we loop over buddy blocks populating
> migrate.dst

I thought into your scheme. The free of device memory is not completely controlled by driver. Core mm can call back to driver to free a device memory page. The xe_devm_page_free in this series is a callback function registered to core mm. this is why in above data structure I have to have a bitmap. This bitmap is used to mark which page is freed, when all pages in a buddy block are freed, it is time to free the whole buddy block.

In your scheme, we allocate/free at xe_vma granularity. So I imagine you would have a list of buddy blocks in usrptr, and free all blocks in the list when every pages in all blocks are freed. While my scheme free memory at the buddy block granularity - I think it natural because the buddy free interface is also block based.

You would eventually need to introduce a lru link to link each buddy block to a lru list when vram eviction come into picture.

So I just explained why above xe_svm_block_meta was introduced, such as why the bitmap and lru field is necessary to me. if you drop this data structure, they will have to show up in another way.

> 
> Also I noticed the drm_buddy_* calls in this file are not protected by a
> lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> VRAM mgr code, we either need to reach into there or move this lock to
> common place so the VRAM manager and block allocations for SVM don't
> race with each other.
> 

Yes, the lock has to be added. Thanks for pointing this out. Maybe move the tile->mem.vram_mgr->lock to the xe_tile level so it can be shared b/t BO-driver and system allocator?

Oak

> Matt
> 
> >
> >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> >  {
> >  	return 0;
> >  }
> >
> > -static void xe_devm_page_free(struct page *page)
> > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > +{
> > +	/** DRM buddy's block offset is 0-based*/
> > +	offset += mr->hpa_base;
> > +
> > +	return PHYS_PFN(offset);
> > +}
> > +
> > +/** FIXME: we locked page by calling zone_device_page_init
> > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > + */
> > +static void free_block(struct drm_buddy_block *block)
> > +{
> > +	struct xe_svm_block_meta *meta =
> > +		(struct xe_svm_block_meta *)block->private;
> > +	struct xe_tile *tile  = meta->tile;
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +
> > +	kfree(block->private);
> > +	drm_buddy_free_block(mm, block);
> > +}
> > +
> > +void xe_devm_page_free(struct page *page)
> > +{
> > +	struct drm_buddy_block *block =
> > +					(struct drm_buddy_block *)page-
> >zone_device_data;
> > +	struct xe_svm_block_meta *meta =
> > +					(struct xe_svm_block_meta *)block-
> >private;
> > +	struct xe_tile *tile  = meta->tile;
> > +	struct xe_mem_region *mr = &tile->mem.vram;
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +	u64 size = drm_buddy_block_size(mm, block);
> > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > +	u64 block_pfn_first =
> > +					block_offset_to_pfn(mr,
> drm_buddy_block_offset(block));
> > +	u64 page_pfn = page_to_pfn(page);
> > +	u64 i = page_pfn - block_pfn_first;
> > +
> > +	xe_assert(tile->xe, i < pages_per_block);
> > +	clear_bit(i, meta->bitmap);
> > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > +		free_block(block);
> > +}
> > +
> > +/**
> > + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> > + *
> > + * @xe_tile: which tile to allocate device memory from
> > + * @npages: how many pages to allocate
> > + * @blocks: used to return the allocated blocks
> > + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> > + * to hold at @npages entries.
> > + *
> > + * This function allocate blocks of memory from drm buddy allocator, and
> > + * performs initialization work: set struct page::zone_device_data to point
> > + * to the memory block; set/initialize drm_buddy_block::private field;
> > + * lock_page for each page allocated; add memory block to lru managers lru
> > + * list - this is TBD.
> > + *
> > + * return: 0 on success
> > + * error code otherwise
> > + */
> > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > +						unsigned long npages,
> > +						struct list_head *blocks,
> > +						unsigned long *pfn)
> > +{
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +	struct drm_buddy_block *block, *tmp;
> > +	u64 size = npages << PAGE_SHIFT;
> > +	int ret = 0, i, j = 0;
> > +
> > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > +						blocks,
> DRM_BUDDY_TOPDOWN_ALLOCATION);
> > +
> > +	if (unlikely(ret))
> > +		return ret;
> > +
> > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > +		struct xe_mem_region *mr = &tile->mem.vram;
> > +		u64 block_pfn_first, pages_per_block;
> > +		struct xe_svm_block_meta *meta;
> > +		u32 meta_size;
> > +
> > +		size = drm_buddy_block_size(mm, block);
> > +		pages_per_block = size >> PAGE_SHIFT;
> > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > +					sizeof(struct xe_svm_block_meta);
> > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > +		bitmap_fill(meta->bitmap, pages_per_block);
> > +		meta->tile = tile;
> > +		block->private = meta;
> > +		block_pfn_first =
> > +					block_offset_to_pfn(mr,
> drm_buddy_block_offset(block));
> > +		for(i = 0; i < pages_per_block; i++) {
> > +			struct page *page;
> > +
> > +			pfn[j++] = block_pfn_first + i;
> > +			page = pfn_to_page(block_pfn_first + i);
> > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > +			zone_device_page_init(page);
> > +			page->zone_device_data = block;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * xe_devm_free_blocks() - free all memory blocks
> > + *
> > + * @blocks: memory blocks list head
> > + */
> > +void xe_devm_free_blocks(struct list_head *blocks)
> >  {
> > +	struct drm_buddy_block *block, *tmp;
> > +
> > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > +		free_block(block);
> >  }
> >
> >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-15 20:13     ` Zeng, Oak
@ 2024-04-15 21:19       ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-15 21:19 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

On Mon, Apr 15, 2024 at 02:13:55PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 6:24 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free
> > device memory
> > 
> > On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > > Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> > > house keeping work for all the pages allocated, such as get a page
> > > refcount, keep a bitmap of all pages to denote whether a page is in
> > > use, put pages to a drm lru list for eviction purpose.
> > >
> > > Function xe_devm_free_blocks return list of memory blocks to drm buddy
> > > allocator.
> > >
> > > Function xe_devm_free_page is a call back function from hmm layer. It
> > > is called whenever a page's refcount reaches to 1. This function clears
> > > the bit of this page in the bitmap. If all the bits in the bitmap is
> > > cleared, it means all the pages have been freed, we return all the pages
> > > in this memory block back to drm buddy.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> > >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> > ++++++++++++++++++++++++++++-
> > 
> > See comments about file in previous patches, they apply here too.
> > 
> > >  2 files changed, 152 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index 624c1581f8ba..92a3ee90d5a7 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> > *xe_page_to_mem_region(struct page *page)
> > >  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
> > >  }
> > >
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn);
> > > +
> > > +void xe_devm_free_blocks(struct list_head *blocks);
> > > +void xe_devm_page_free(struct page *page);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > index 31af56e8285a..5ba0cd9a70b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -5,18 +5,161 @@
> > >
> > >  #include <linux/mm_types.h>
> > >  #include <linux/sched/mm.h>
> > > -
> > > +#include <linux/gfp.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/dma-fence.h>
> > > +#include <linux/bitops.h>
> > > +#include <linux/bitmap.h>
> > > +#include <drm/drm_buddy.h>
> > >  #include "xe_device_types.h"
> > >  #include "xe_svm.h"
> > > +#include "xe_migrate.h"
> > > +#include "xe_ttm_vram_mgr_types.h"
> > > +#include "xe_assert.h"
> > >
> > > +/**
> > > + * struct xe_svm_block_meta - svm uses this data structure to manage each
> > > + * block allocated from drm buddy. This will be set to the
> > drm_buddy_block's
> > > + * private field.
> > > + *
> > > + * @lru: used to link this block to drm's lru lists. This will be replace
> > > + * with struct drm_lru_entity later.
> > > + * @tile: tile from which we allocated this block
> > > + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> > > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > > + * to return this block to drm buddy subsystem.
> > > + */
> > > +struct xe_svm_block_meta {
> > > +	struct list_head lru;
> > > +	struct xe_tile *tile;
> > > +	unsigned long bitmap[];
> > > +};
> > 
> > This looks not needed to me but admittedly haven't looked at the LRU stuff.
> > 
> > I am thinking roughly...
> > 
> > - I think we drop all this special tracking (kill xe_svm_block_meta)
> > - Have functions to allocate / free the buddy blocks, store buddy blocks in
> > userptr
> > - Blocks are allocated before migration to VRAM
> > - Blocks can be freed on either CPU fault after migration or on VMA
> >   destroy (probably depends on madvive hints for VMA where we free
> >   blocks)
> > - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
> >   (conceptually the same if we switch to 1 to N ratio between xe_vma &
> >   pt_state)
> > - block->private == memory region so we can get pfn from block
> > - When we need migrate_pfns we loop over buddy blocks populating
> > migrate.dst
> 
> I thought into your scheme. The free of device memory is not completely controlled by driver. Core mm can call back to driver to free a device memory page. The xe_devm_page_free in this series is a callback function registered to core mm. this is why in above data structure I have to have a bitmap. This bitmap is used to mark which page is freed, when all pages in a buddy block are freed, it is time to free the whole buddy block.
>

Certainly in this scenario we'd also get a mmu notifier with
MMU_NOTIFY_UNMAP when pages are being free'd too, right? The notifier
puts the VMA in the garbage collector and the blocks are free'd when VMA
is destroyed.

The garbage collector only supports complete unmaps at the moment but if
we need to support partial unmaps (likely do as a users can munmap
partial buffers) we can with garbage collector transfering ownership of
blocks from one VMA to another as needed.

It is possible I don't fully understand the ref couting scheme for pages
either and we will need to implement the dev_pagemap_ops.free_pagses
(seems likely now that I am typing) rather than the notifier scheme
described above...

If we need to do this, then roughly...

- page->zone_device_data is still the Xe chunk (xe_vma currently)
- Ref count device pages in Xe chunk (or perhaps individual block?,
  need to think about this more but certainly bitmap is overkill)
- free_pages decrements ref count
- When ref count goes to zero, free blocks

Again this seems to align with Nouveu and AMD (haven't check Nvidia's
open source driver) and aligns with the design in Xe of everything being
a chunk grainularity (e.g. no partial migrations within a chunk).

I guess we will a need to a test to do partial unmaps to figure out all
of these details... 

> In your scheme, we allocate/free at xe_vma granularity. So I imagine you would have a list of buddy blocks in usrptr, and free all blocks in the list when every pages in all blocks are freed. While my scheme free memory at the buddy block granularity - I think it natural because the buddy free interface is also block based.
> 
> You would eventually need to introduce a lru link to link each buddy block to a lru list when vram eviction come into picture.
> 
> So I just explained why above xe_svm_block_meta was introduced, such as why the bitmap and lru field is necessary to me. if you drop this data structure, they will have to show up in another way.
>

LRU will likely be at chunk grainularity too (i.e. xe_vma not at block
level).

Also in most cases xe_vma == 1 block if the buddy allocator is doing its
job so no reason to optimize for block level here.
 
> > 
> > Also I noticed the drm_buddy_* calls in this file are not protected by a
> > lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> > VRAM mgr code, we either need to reach into there or move this lock to
> > common place so the VRAM manager and block allocations for SVM don't
> > race with each other.
> > 
> 
> Yes, the lock has to be added. Thanks for pointing this out. Maybe move the tile->mem.vram_mgr->lock to the xe_tile level so it can be shared b/t BO-driver and system allocator?
>

Yea, tile->mem.vram_lock might be a better location.

Matt
 
> Oak
> 
> > Matt
> > 
> > >
> > >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > >  {
> > >  	return 0;
> > >  }
> > >
> > > -static void xe_devm_page_free(struct page *page)
> > > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > +{
> > > +	/** DRM buddy's block offset is 0-based*/
> > > +	offset += mr->hpa_base;
> > > +
> > > +	return PHYS_PFN(offset);
> > > +}
> > > +
> > > +/** FIXME: we locked page by calling zone_device_page_init
> > > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > > + */
> > > +static void free_block(struct drm_buddy_block *block)
> > > +{
> > > +	struct xe_svm_block_meta *meta =
> > > +		(struct xe_svm_block_meta *)block->private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +
> > > +	kfree(block->private);
> > > +	drm_buddy_free_block(mm, block);
> > > +}
> > > +
> > > +void xe_devm_page_free(struct page *page)
> > > +{
> > > +	struct drm_buddy_block *block =
> > > +					(struct drm_buddy_block *)page-
> > >zone_device_data;
> > > +	struct xe_svm_block_meta *meta =
> > > +					(struct xe_svm_block_meta *)block-
> > >private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	u64 size = drm_buddy_block_size(mm, block);
> > > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > > +	u64 block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +	u64 page_pfn = page_to_pfn(page);
> > > +	u64 i = page_pfn - block_pfn_first;
> > > +
> > > +	xe_assert(tile->xe, i < pages_per_block);
> > > +	clear_bit(i, meta->bitmap);
> > > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > > +		free_block(block);
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> > > + *
> > > + * @xe_tile: which tile to allocate device memory from
> > > + * @npages: how many pages to allocate
> > > + * @blocks: used to return the allocated blocks
> > > + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> > > + * to hold at @npages entries.
> > > + *
> > > + * This function allocate blocks of memory from drm buddy allocator, and
> > > + * performs initialization work: set struct page::zone_device_data to point
> > > + * to the memory block; set/initialize drm_buddy_block::private field;
> > > + * lock_page for each page allocated; add memory block to lru managers lru
> > > + * list - this is TBD.
> > > + *
> > > + * return: 0 on success
> > > + * error code otherwise
> > > + */
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn)
> > > +{
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	struct drm_buddy_block *block, *tmp;
> > > +	u64 size = npages << PAGE_SHIFT;
> > > +	int ret = 0, i, j = 0;
> > > +
> > > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > > +						blocks,
> > DRM_BUDDY_TOPDOWN_ALLOCATION);
> > > +
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > > +		struct xe_mem_region *mr = &tile->mem.vram;
> > > +		u64 block_pfn_first, pages_per_block;
> > > +		struct xe_svm_block_meta *meta;
> > > +		u32 meta_size;
> > > +
> > > +		size = drm_buddy_block_size(mm, block);
> > > +		pages_per_block = size >> PAGE_SHIFT;
> > > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > > +					sizeof(struct xe_svm_block_meta);
> > > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > > +		bitmap_fill(meta->bitmap, pages_per_block);
> > > +		meta->tile = tile;
> > > +		block->private = meta;
> > > +		block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +		for(i = 0; i < pages_per_block; i++) {
> > > +			struct page *page;
> > > +
> > > +			pfn[j++] = block_pfn_first + i;
> > > +			page = pfn_to_page(block_pfn_first + i);
> > > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > > +			zone_device_page_init(page);
> > > +			page->zone_device_data = block;
> > > +		}
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_free_blocks() - free all memory blocks
> > > + *
> > > + * @blocks: memory blocks list head
> > > + */
> > > +void xe_devm_free_blocks(struct list_head *blocks)
> > >  {
> > > +	struct drm_buddy_block *block, *tmp;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > > +		free_block(block);
> > >  }
> > >
> > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
  2024-04-10 21:09   ` Matthew Brost
@ 2024-04-16 19:01   ` Matthew Brost
  1 sibling, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-16 19:01 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:23PM -0400, Oak Zeng wrote:
> Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> page is backed by a struct page.
> 
> Those struct pages are created to allow hmm migrate buffer b/t
> GPU vram and CPU system memory using existing Linux migration
> mechanism (i.e., migrating b/t CPU system memory and hard disk).
> 
> This is prepare work to enable svm (shared virtual memory) through
> Linux kernel hmm framework. The memory remap's page map type is set
> to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> vram page get a struct page and can be mapped in CPU page table,
> but such pages are treated as GPU's private resource, so CPU can't
> access them. If CPU access such page, a page fault is triggered
> and page will be migrate to system memory.
> 
> For GPU device which supports coherent memory protocol b/t CPU and
> GPU (such as CXL and CAPI protocol), we can remap device memory as
> MEMORY_DEVICE_COHERENT. This is TBD.
> 
> v1:
> Changes per code review feedback from Matt:
>     change .o order in Makefile
>     fix indentation
>     change code order in mmio_fini
>     remove unnecessary header file
>     uniform xe_svm_devm_add/_remove parameter
>     use tile (vs dev) as pagemap.owner during memremap
>     only remap vram for platform that support usm
> Changes per review feedback from Brian:
>     s/xe_svm_devm_add/xe_devm_add
>     s/xe_svm_devm_remove/xe_devm_remove
>     move calling of xe_devm_add to xe_tile.c
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile          |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  8 +++
>  drivers/gpu/drm/xe/xe_mmio.c         |  6 ++
>  drivers/gpu/drm/xe/xe_svm.h          | 15 +++++
>  drivers/gpu/drm/xe/xe_svm_devmem.c   | 89 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_tile.c         |  4 ++
>  6 files changed, 123 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index fff70fc9a09e..cd5213ba182b 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -129,6 +129,7 @@ xe-y += xe_bb.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
>  	xe_step.o \
> +	xe_svm_devmem.o \
>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index e73b9a086718..d6a14327986b 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -103,6 +103,14 @@ struct xe_mem_region {
>  	resource_size_t actual_physical_size;
>  	/** @mapping: pointer to VRAM mappable space */
>  	void __iomem *mapping;
> +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> +	struct dev_pagemap pagemap;
> +	/**
> +	 * @hpa_base: base host physical address
> +	 *
> +	 * This is generated when remap device memory as ZONE_DEVICE
> +	 */
> +	resource_size_t hpa_base;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> index 7ba2477452d7..12923fe6abae 100644
> --- a/drivers/gpu/drm/xe/xe_mmio.c
> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> @@ -22,6 +22,7 @@
>  #include "xe_module.h"
>  #include "xe_sriov.h"
>  #include "xe_tile.h"
> +#include "xe_svm.h"
>  
>  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
>  #define TILE_COUNT		REG_GENMASK(15, 8)
> @@ -354,6 +355,11 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
>  static void mmio_fini(struct drm_device *drm, void *arg)
>  {
>  	struct xe_device *xe = arg;
> +	struct xe_tile *tile;
> +	u8 id;
> +
> +	for_each_tile(tile, xe, id)
> +		xe_devm_remove(tile, &tile->mem.vram);
>  
>  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>  	if (xe->mem.vram.mapping)
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> new file mode 100644
> index 000000000000..e944971cfc6d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#ifndef __XE_SVM_H
> +#define __XE_SVM_H
> +
> +struct xe_tile;
> +struct xe_mem_region;
> +
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> new file mode 100644
> index 000000000000..31af56e8285a
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/mm_types.h>
> +#include <linux/sched/mm.h>
> +
> +#include "xe_device_types.h"
> +#include "xe_svm.h"
> +
> +
> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
> +static void xe_devm_page_free(struct page *page)
> +{
> +}
> +
> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> +	.page_free = xe_devm_page_free,
> +	.migrate_to_ram = xe_devm_migrate_to_ram,
> +};
> +
> +/**
> + * xe_devm_add: Remap and provide memmap backing for device memory
> + * @tile: tile that the memory region blongs to
> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> +	mr->pagemap.owner = xe;
> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start, resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> +			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> +	return 0;
> +}
> +
> +/**
> + * xe_devm_remove: Unmap device memory and free resources
> + * @tile: xe tile
> + * @mr: memory region to remove
> + */
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr)

Also I don't think function is not needed...

devm_memremap_pages registers devm_memremap_pages_release via
evm_add_action_or_reset...

And if it was we'd want to register a devm_fini function rather than
exporting a function and call it from the mmio layer.

Matt

> +{
> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> +
> +	/*FIXME: Does below cause a kernel hange during moduel remove?*/
> +	if (mr->hpa_base) {
> +		devm_memunmap_pages(dev, &mr->pagemap);
> +		devm_release_mem_region(dev, mr->pagemap.range.start,
> +			mr->pagemap.range.end - mr->pagemap.range.start + 1);
> +	}
> +}
> +
> diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> index 0650b2fa75ef..f1c4f9de51df 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -14,6 +14,7 @@
>  #include "xe_tile_sysfs.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_wa.h"
> +#include "xe_svm.h"
>  
>  /**
>   * DOC: Multi-tile Design
> @@ -158,6 +159,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
>   */
>  int xe_tile_init_noalloc(struct xe_tile *tile)
>  {
> +	struct xe_device *xe = tile_to_xe(tile);
>  	int err;
>  
>  	xe_device_mem_access_get(tile_to_xe(tile));
> @@ -175,6 +177,8 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
>  
>  	xe_tile_sysfs_init(tile);
>  
> +	if (xe->info.has_usm)
> +		xe_devm_add(tile, &tile->mem.vram);
>  err_mem_access:
>  	xe_device_mem_access_put(tile_to_xe(tile));
>  	return err;
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
  2024-04-10 22:23   ` Matthew Brost
@ 2024-04-17 20:55   ` Matthew Brost
  1 sibling, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-04-17 20:55 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> house keeping work for all the pages allocated, such as get a page
> refcount, keep a bitmap of all pages to denote whether a page is in
> use, put pages to a drm lru list for eviction purpose.
> 
> Function xe_devm_free_blocks return list of memory blocks to drm buddy
> allocator.
> 
> Function xe_devm_free_page is a call back function from hmm layer. It
> is called whenever a page's refcount reaches to 1. This function clears
> the bit of this page in the bitmap. If all the bits in the bitmap is
> cleared, it means all the pages have been freed, we return all the pages
> in this memory block back to drm buddy.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
>  drivers/gpu/drm/xe/xe_svm_devmem.c | 147 ++++++++++++++++++++++++++++-
>  2 files changed, 152 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 624c1581f8ba..92a3ee90d5a7 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -46,4 +46,11 @@ static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
>  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
>  }
>  
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn);
> +
> +void xe_devm_free_blocks(struct list_head *blocks);
> +void xe_devm_page_free(struct page *page);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> index 31af56e8285a..5ba0cd9a70b0 100644
> --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -5,18 +5,161 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched/mm.h>
> -
> +#include <linux/gfp.h>
> +#include <linux/migrate.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dma-fence.h>
> +#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> +#include <drm/drm_buddy.h>
>  #include "xe_device_types.h"
>  #include "xe_svm.h"
> +#include "xe_migrate.h"
> +#include "xe_ttm_vram_mgr_types.h"
> +#include "xe_assert.h"
>  
> +/**
> + * struct xe_svm_block_meta - svm uses this data structure to manage each
> + * block allocated from drm buddy. This will be set to the drm_buddy_block's
> + * private field.
> + *
> + * @lru: used to link this block to drm's lru lists. This will be replace
> + * with struct drm_lru_entity later.
> + * @tile: tile from which we allocated this block
> + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> + * 0 means this page is idle. When all bits of this block are 0, it is time
> + * to return this block to drm buddy subsystem.
> + */
> +struct xe_svm_block_meta {
> +	struct list_head lru;
> +	struct xe_tile *tile;
> +	unsigned long bitmap[];
> +};
>  
>  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
>  {
>  	return 0;
>  }
>  
> -static void xe_devm_page_free(struct page *page)
> +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> +{
> +	/** DRM buddy's block offset is 0-based*/
> +	offset += mr->hpa_base;
> +
> +	return PHYS_PFN(offset);
> +}
> +
> +/** FIXME: we locked page by calling zone_device_page_init
> + *  in xe_devm_alloc_pages. Should we unlock pages here?
> + */
> +static void free_block(struct drm_buddy_block *block)
> +{
> +	struct xe_svm_block_meta *meta =
> +		(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +
> +	kfree(block->private);
> +	drm_buddy_free_block(mm, block);
> +}
> +
> +void xe_devm_page_free(struct page *page)
> +{
> +	struct drm_buddy_block *block =
> +					(struct drm_buddy_block *)page->zone_device_data;
> +	struct xe_svm_block_meta *meta =
> +					(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct xe_mem_region *mr = &tile->mem.vram;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	u64 size = drm_buddy_block_size(mm, block);
> +	u64 pages_per_block = size >> PAGE_SHIFT;
> +	u64 block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +	u64 page_pfn = page_to_pfn(page);
> +	u64 i = page_pfn - block_pfn_first;
> +
> +	xe_assert(tile->xe, i < pages_per_block);
> +	clear_bit(i, meta->bitmap);
> +	if (bitmap_empty(meta->bitmap, pages_per_block))
> +		free_block(block);
> +}
> +
> +/**
> + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> + *
> + * @xe_tile: which tile to allocate device memory from
> + * @npages: how many pages to allocate
> + * @blocks: used to return the allocated blocks
> + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> + * to hold at @npages entries.
> + *
> + * This function allocate blocks of memory from drm buddy allocator, and
> + * performs initialization work: set struct page::zone_device_data to point
> + * to the memory block; set/initialize drm_buddy_block::private field;
> + * lock_page for each page allocated; add memory block to lru managers lru
> + * list - this is TBD.
> + *
> + * return: 0 on success
> + * error code otherwise
> + */
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn)
> +{
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	struct drm_buddy_block *block, *tmp;
> +	u64 size = npages << PAGE_SHIFT;
> +	int ret = 0, i, j = 0;
> +
> +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> +						blocks, DRM_BUDDY_TOPDOWN_ALLOCATION);

Realized this while discussing ref counting off the list, the buddy
allocation size can be either PAGE_SIZE or SZ_64K depending on platform
too. We store this in VM via XE_VM_FLAG_64K flag.

Matt

> +
> +	if (unlikely(ret))
> +		return ret;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link) {
> +		struct xe_mem_region *mr = &tile->mem.vram;
> +		u64 block_pfn_first, pages_per_block;
> +		struct xe_svm_block_meta *meta;
> +		u32 meta_size;
> +
> +		size = drm_buddy_block_size(mm, block);
> +		pages_per_block = size >> PAGE_SHIFT;
> +		meta_size = BITS_TO_BYTES(pages_per_block) +
> +					sizeof(struct xe_svm_block_meta);
> +		meta = kzalloc(meta_size, GFP_KERNEL);
> +		bitmap_fill(meta->bitmap, pages_per_block);
> +		meta->tile = tile;
> +		block->private = meta;
> +		block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +		for(i = 0; i < pages_per_block; i++) {
> +			struct page *page;
> +
> +			pfn[j++] = block_pfn_first + i;
> +			page = pfn_to_page(block_pfn_first + i);
> +			/**Lock page per hmm requirement, see hmm.rst.*/
> +			zone_device_page_init(page);
> +			page->zone_device_data = block;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * xe_devm_free_blocks() - free all memory blocks
> + *
> + * @blocks: memory blocks list head
> + */
> +void xe_devm_free_blocks(struct list_head *blocks)
>  {
> +	struct drm_buddy_block *block, *tmp;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link)
> +		free_block(block);
>  }
>  
>  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config
  2024-04-10 21:13   ` Matthew Brost
@ 2024-06-04 18:57     ` Zeng, Oak
  0 siblings, 0 replies; 72+ messages in thread
From: Zeng, Oak @ 2024-06-04 18:57 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

Hi Matt,

I replied a few to those review comments. Then I was interrupted by the v3 respin of this series.

I now come back to revisit those comments, to make sure those comments are either get address in v3, or I reply if I have more comments.

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 5:13 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config
> 
> On Tue, Apr 09, 2024 at 04:17:24PM -0400, Oak Zeng wrote:
> > Introduce a DRM_XE_SVM kernel config entry for
> 
> Maybe consider another name for this? I could see use cases for non-SVM
> where we still want private pages mapped (e.g. VRAM userptrs on
> non-faulting devices). Don't really have suggestion but worth
> considering.

It is first time I heard the concept of VRAM userptr. Do we plan to support it?

On a non-faulting device, you would have to allocate vram for userptr at vm_bind time, right?

> 
> > xe svm feature. xe svm feature allows share
> > virtual address space between CPU and GPU program.
> >
> > v1: Improve commit message (Thomas)
> >     Avoid using #if directive (Thomas)
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Kconfig   | 21 +++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_tile.c |  7 +++++--
> >  2 files changed, 26 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
> > index 449a1ecbc92a..0accb2cb81d6 100644
> > --- a/drivers/gpu/drm/xe/Kconfig
> > +++ b/drivers/gpu/drm/xe/Kconfig
> > @@ -84,6 +84,27 @@ config DRM_XE_FORCE_PROBE
> >  	  4571.
> >
> >  	  Use "!*" to block the probe of the driver for all known devices.
> > +config DRM_XE_SVM
> > +	bool "Enable Shared Virtual Memory support in xe"
> > +	depends on DRM_XE
> > +	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> > +	depends on ARCH_ENABLE_MEMORY_HOTREMOVE
> > +	depends on MEMORY_HOTPLUG
> > +	depends on MEMORY_HOTREMOVE
> > +	depends on ARCH_HAS_PTE_DEVMAP
> > +	depends on SPARSEMEM_VMEMMAP
> > +	depends on ZONE_DEVICE
> > +	depends on DEVICE_PRIVATE
> > +	depends on MMU
> > +	select HMM_MIRROR
> > +	select MMU_NOTIFIER
> > +	default y
> > +	help
> > +	  Choose this option if you want Shared Virtual Memory (SVM)
> > +	  support in xe. With SVM, virtual address space is shared
> > +	  between CPU and GPU. This means any virtual address such
> > +	  as malloc or mmap returns, variables on stack, or global
> > +	  memory pointers, can be used for GPU transparently.
> >
> >  menu "drm/Xe Debugging"
> >  depends on DRM_XE
> > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> > index f1c4f9de51df..a1a436912fe3 100644
> > --- a/drivers/gpu/drm/xe/xe_tile.c
> > +++ b/drivers/gpu/drm/xe/xe_tile.c
> > @@ -159,9 +159,12 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
> >   */
> >  int xe_tile_init_noalloc(struct xe_tile *tile)
> >  {
> > -	struct xe_device *xe = tile_to_xe(tile);
> > +	struct xe_device __maybe_unused *xe;
> 
> Just assign this here blindly? The __maybe_unused should suppress the
> warning CONFIG_DRM_XE_SVM is false and should just compile out if it is.

Did you mean:

struct xe_device __maybe_unused *xe = tile_to_xe(tile);

we have an extra xe assignment when DRM_XE_SVM is not enabled.

Also if we assign blindly, we wouldn't need __maybe_unused, as xe is always assigned (so used)

Oak


> 
> Matt
> 
> >  	int err;
> >
> > +	if (IS_ENABLED(CONFIG_DRM_XE_SVM))
> > +		xe = tile_to_xe(tile);
> > +
> >  	xe_device_mem_access_get(tile_to_xe(tile));
> >
> >  	err = tile_ttm_mgr_init(tile);
> > @@ -177,7 +180,7 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
> >
> >  	xe_tile_sysfs_init(tile);
> >
> > -	if (xe->info.has_usm)
> > +	if (IS_ENABLED(CONFIG_DRM_XE_SVM) && xe->info.has_usm)
> >  		xe_devm_add(tile, &tile->mem.vram);
> >  err_mem_access:
> >  	xe_device_mem_access_put(tile_to_xe(tile));
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory
  2024-04-10 21:56   ` Matthew Brost
@ 2024-06-05  2:29     ` Zeng, Oak
  0 siblings, 0 replies; 72+ messages in thread
From: Zeng, Oak @ 2024-06-05  2:29 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 5:57 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 19/31] drm/xe/svm: Determine a vma is backed by device
> memory
> 
> On Tue, Apr 09, 2024 at 04:17:30PM -0400, Oak Zeng wrote:
> > With system allocator, a userptr can now be back by device
> > memory also. Introduce a helper function xe_vma_is_devmem
> > to determine whether a vma is backed by device memory.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index 846e896edcb5..525092111be9 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -577,6 +577,17 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_bind_ops = {
> >  	.pt_entry = xe_pt_stage_bind_entry,
> >  };
> >
> > +static bool xe_vma_is_devmem(struct xe_vma *vma)
> 
> At some point we probably want to scrub the driver as we intermix
> devmem, vram, and lmem nomenclature. I think in case we mean the same
> thing too. Anwyays that is a little out of scope here.

Agree we are mix using those terms.

Devmem is still good to me in this case. I will keep it for now.

> 
> > +{
> > +	if (xe_vma_is_userptr(vma)) {
> > +		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> > +		return uvma->userptr.is_device_pages;
> 
> Helper itself LGTM. Maybe promote to xe_vm.c/xe_vm.h?
> 
> Also consider other options rather than userptr.is_device_pages flag
> here (e.g. look for buddy blocks, check a gpuvm flags, etc...). Can live
> with a flag but we can do without it, great.

For now this helper is only used in this file. I will keep it local static in xe_pt.c for now. We can consider a move if it is used in other file in the future.

In v3, I introduced range based page table update. Instead of update page table for the whole vma, we can now update only a sub-range of a vma. 

For system allocator, the backing store of a vma can now be a mixture placement of system memory and device memory. I will update this helper to reflect such change.

Oak

> 
> Matt
> 
> > +	} else {
> > +		struct xe_bo *bo = xe_vma_bo(vma);
> > +		return bo && (xe_bo_is_vram(bo) ||
> xe_bo_is_stolen_devmem(bo));
> > +	}
> > +}
> > +
> >  /**
> >   * xe_pt_stage_bind() - Build a disconnected page-table tree for a given
> address
> >   * range.
> > @@ -601,8 +612,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
> >  {
> >  	struct xe_device *xe = tile_to_xe(tile);
> >  	struct xe_bo *bo = xe_vma_bo(vma);
> > -	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
> > -		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
> > +	bool is_devmem = xe_vma_is_devmem(vma);
> >  	struct xe_res_cursor curs;
> >  	struct xe_pt_stage_bind_walk xe_walk = {
> >  		.base = {
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-10 22:23   ` Matthew Brost
  2024-04-15 20:13     ` Zeng, Oak
@ 2024-06-05 22:16     ` Zeng, Oak
  2024-06-05 23:37       ` Matthew Brost
  1 sibling, 1 reply; 72+ messages in thread
From: Zeng, Oak @ 2024-06-05 22:16 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 6:24 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and
> free device memory
> 
> On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > Function xe_devm_alloc_pages allocate pages from drm buddy and
> perform
> > house keeping work for all the pages allocated, such as get a page
> > refcount, keep a bitmap of all pages to denote whether a page is in
> > use, put pages to a drm lru list for eviction purpose.
> >
> > Function xe_devm_free_blocks return list of memory blocks to drm buddy
> > allocator.
> >
> > Function xe_devm_free_page is a call back function from hmm layer. It
> > is called whenever a page's refcount reaches to 1. This function clears
> > the bit of this page in the bitmap. If all the bits in the bitmap is
> > cleared, it means all the pages have been freed, we return all the pages
> > in this memory block back to drm buddy.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> ++++++++++++++++++++++++++++-
> 
> See comments about file in previous patches, they apply here too.
> 
> >  2 files changed, 152 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> > index 624c1581f8ba..92a3ee90d5a7 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> *xe_page_to_mem_region(struct page *page)
> >  	return container_of(page->pgmap, struct xe_mem_region,
> pagemap);
> >  }
> >
> > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > +						unsigned long npages,
> > +						struct list_head *blocks,
> > +						unsigned long *pfn);
> > +
> > +void xe_devm_free_blocks(struct list_head *blocks);
> > +void xe_devm_page_free(struct page *page);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > index 31af56e8285a..5ba0cd9a70b0 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -5,18 +5,161 @@
> >
> >  #include <linux/mm_types.h>
> >  #include <linux/sched/mm.h>
> > -
> > +#include <linux/gfp.h>
> > +#include <linux/migrate.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/bitops.h>
> > +#include <linux/bitmap.h>
> > +#include <drm/drm_buddy.h>
> >  #include "xe_device_types.h"
> >  #include "xe_svm.h"
> > +#include "xe_migrate.h"
> > +#include "xe_ttm_vram_mgr_types.h"
> > +#include "xe_assert.h"
> >
> > +/**
> > + * struct xe_svm_block_meta - svm uses this data structure to manage
> each
> > + * block allocated from drm buddy. This will be set to the
> drm_buddy_block's
> > + * private field.
> > + *
> > + * @lru: used to link this block to drm's lru lists. This will be replace
> > + * with struct drm_lru_entity later.
> > + * @tile: tile from which we allocated this block
> > + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > + * to return this block to drm buddy subsystem.
> > + */
> > +struct xe_svm_block_meta {
> > +	struct list_head lru;
> > +	struct xe_tile *tile;
> > +	unsigned long bitmap[];
> > +};
> 
> This looks not needed to me but admittedly haven't looked at the LRU stuff.

I am moving to page granularity memory eviction, so we can either use the lru in the struct page itself, or I will have to introduce some other data structure which have a lru.

Yes, I removed block_meta in v3.

> 
> I am thinking roughly...
> 
> - I think we drop all this special tracking (kill xe_svm_block_meta)
Agreed.

> - Have functions to allocate / free the buddy blocks, store buddy blocks in
> userptr

Why we need to store buddy blocks in userptr? If you do this, you would need another block_list in userptr.

I currently use the struct page's zone_device_data to point to buddy_block. It seems work for me.

> - Blocks are allocated before migration to VRAM

Agreed

> - Blocks can be freed on either CPU fault after migration or on VMA
>   destroy (probably depends on madvive hints for VMA where we free
>   blocks)

My current understanding is, once device pages is allocated and handed over to core mm/hmm, driver doesn't need to worry about the life cycle of device pages, i.e., core mm/hmm will take care by calling back to page_free vfunc.

> - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
>   (conceptually the same if we switch to 1 to N ratio between xe_vma &
>   pt_state)

As said, I am moving to page granularity, moving away from xe_vma granularity, to address concerns from the drm community discussion.

> - block->private == memory region so we can get pfn from block

In v3 code, block->private is not used. Will use it if needed.

Each struct page has a pgmap pointer which points to xe_mem_region's pgmap memory. We can use this information to get a page/fpn's memory region.

> - When we need migrate_pfns we loop over buddy blocks populating
> migrate.dst
> 
> Also I noticed the drm_buddy_* calls in this file are not protected by a
> lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> VRAM mgr code, we either need to reach into there or move this lock to
> common place so the VRAM manager and block allocations for SVM don't
> race with each other.

Ok, will add this lock. Lets keep the lock in vram_mgr->lock for now for simplicity

Oak


> 
> Matt
> 
> >
> >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> >  {
> >  	return 0;
> >  }
> >
> > -static void xe_devm_page_free(struct page *page)
> > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > +{
> > +	/** DRM buddy's block offset is 0-based*/
> > +	offset += mr->hpa_base;
> > +
> > +	return PHYS_PFN(offset);
> > +}
> > +
> > +/** FIXME: we locked page by calling zone_device_page_init
> > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > + */
> > +static void free_block(struct drm_buddy_block *block)
> > +{
> > +	struct xe_svm_block_meta *meta =
> > +		(struct xe_svm_block_meta *)block->private;
> > +	struct xe_tile *tile  = meta->tile;
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +
> > +	kfree(block->private);
> > +	drm_buddy_free_block(mm, block);
> > +}
> > +
> > +void xe_devm_page_free(struct page *page)
> > +{
> > +	struct drm_buddy_block *block =
> > +					(struct drm_buddy_block *)page-
> >zone_device_data;
> > +	struct xe_svm_block_meta *meta =
> > +					(struct xe_svm_block_meta *)block-
> >private;
> > +	struct xe_tile *tile  = meta->tile;
> > +	struct xe_mem_region *mr = &tile->mem.vram;
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +	u64 size = drm_buddy_block_size(mm, block);
> > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > +	u64 block_pfn_first =
> > +					block_offset_to_pfn(mr,
> drm_buddy_block_offset(block));
> > +	u64 page_pfn = page_to_pfn(page);
> > +	u64 i = page_pfn - block_pfn_first;
> > +
> > +	xe_assert(tile->xe, i < pages_per_block);
> > +	clear_bit(i, meta->bitmap);
> > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > +		free_block(block);
> > +}
> > +
> > +/**
> > + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> > + *
> > + * @xe_tile: which tile to allocate device memory from
> > + * @npages: how many pages to allocate
> > + * @blocks: used to return the allocated blocks
> > + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> > + * to hold at @npages entries.
> > + *
> > + * This function allocate blocks of memory from drm buddy allocator, and
> > + * performs initialization work: set struct page::zone_device_data to point
> > + * to the memory block; set/initialize drm_buddy_block::private field;
> > + * lock_page for each page allocated; add memory block to lru managers
> lru
> > + * list - this is TBD.
> > + *
> > + * return: 0 on success
> > + * error code otherwise
> > + */
> > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > +						unsigned long npages,
> > +						struct list_head *blocks,
> > +						unsigned long *pfn)
> > +{
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +	struct drm_buddy_block *block, *tmp;
> > +	u64 size = npages << PAGE_SHIFT;
> > +	int ret = 0, i, j = 0;
> > +
> > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > +						blocks,
> DRM_BUDDY_TOPDOWN_ALLOCATION);
> > +
> > +	if (unlikely(ret))
> > +		return ret;
> > +
> > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > +		struct xe_mem_region *mr = &tile->mem.vram;
> > +		u64 block_pfn_first, pages_per_block;
> > +		struct xe_svm_block_meta *meta;
> > +		u32 meta_size;
> > +
> > +		size = drm_buddy_block_size(mm, block);
> > +		pages_per_block = size >> PAGE_SHIFT;
> > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > +					sizeof(struct xe_svm_block_meta);
> > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > +		bitmap_fill(meta->bitmap, pages_per_block);
> > +		meta->tile = tile;
> > +		block->private = meta;
> > +		block_pfn_first =
> > +					block_offset_to_pfn(mr,
> drm_buddy_block_offset(block));
> > +		for(i = 0; i < pages_per_block; i++) {
> > +			struct page *page;
> > +
> > +			pfn[j++] = block_pfn_first + i;
> > +			page = pfn_to_page(block_pfn_first + i);
> > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > +			zone_device_page_init(page);
> > +			page->zone_device_data = block;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * xe_devm_free_blocks() - free all memory blocks
> > + *
> > + * @blocks: memory blocks list head
> > + */
> > +void xe_devm_free_blocks(struct list_head *blocks)
> >  {
> > +	struct drm_buddy_block *block, *tmp;
> > +
> > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > +		free_block(block);
> >  }
> >
> >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-06-05 22:16     ` Zeng, Oak
@ 2024-06-05 23:37       ` Matthew Brost
  2024-06-06  3:30         ` Zeng, Oak
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-06-05 23:37 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

On Wed, Jun 05, 2024 at 04:16:32PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 6:24 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and
> > free device memory
> > 
> > On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > > Function xe_devm_alloc_pages allocate pages from drm buddy and
> > perform
> > > house keeping work for all the pages allocated, such as get a page
> > > refcount, keep a bitmap of all pages to denote whether a page is in
> > > use, put pages to a drm lru list for eviction purpose.
> > >
> > > Function xe_devm_free_blocks return list of memory blocks to drm buddy
> > > allocator.
> > >
> > > Function xe_devm_free_page is a call back function from hmm layer. It
> > > is called whenever a page's refcount reaches to 1. This function clears
> > > the bit of this page in the bitmap. If all the bits in the bitmap is
> > > cleared, it means all the pages have been freed, we return all the pages
> > > in this memory block back to drm buddy.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> > >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> > ++++++++++++++++++++++++++++-
> > 
> > See comments about file in previous patches, they apply here too.
> > 
> > >  2 files changed, 152 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > > index 624c1581f8ba..92a3ee90d5a7 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> > *xe_page_to_mem_region(struct page *page)
> > >  	return container_of(page->pgmap, struct xe_mem_region,
> > pagemap);
> > >  }
> > >
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn);
> > > +
> > > +void xe_devm_free_blocks(struct list_head *blocks);
> > > +void xe_devm_page_free(struct page *page);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > index 31af56e8285a..5ba0cd9a70b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -5,18 +5,161 @@
> > >
> > >  #include <linux/mm_types.h>
> > >  #include <linux/sched/mm.h>
> > > -
> > > +#include <linux/gfp.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/dma-fence.h>
> > > +#include <linux/bitops.h>
> > > +#include <linux/bitmap.h>
> > > +#include <drm/drm_buddy.h>
> > >  #include "xe_device_types.h"
> > >  #include "xe_svm.h"
> > > +#include "xe_migrate.h"
> > > +#include "xe_ttm_vram_mgr_types.h"
> > > +#include "xe_assert.h"
> > >
> > > +/**
> > > + * struct xe_svm_block_meta - svm uses this data structure to manage
> > each
> > > + * block allocated from drm buddy. This will be set to the
> > drm_buddy_block's
> > > + * private field.
> > > + *
> > > + * @lru: used to link this block to drm's lru lists. This will be replace
> > > + * with struct drm_lru_entity later.
> > > + * @tile: tile from which we allocated this block
> > > + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> > > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > > + * to return this block to drm buddy subsystem.
> > > + */
> > > +struct xe_svm_block_meta {
> > > +	struct list_head lru;
> > > +	struct xe_tile *tile;
> > > +	unsigned long bitmap[];
> > > +};
> > 
> > This looks not needed to me but admittedly haven't looked at the LRU stuff.
> 
> I am moving to page granularity memory eviction, so we can either use the lru in the struct page itself, or I will have to introduce some other data structure which have a lru.
> 

You almost certainly cannot use struct page, pretty sure that will not
be well recieved putting a subsystem memory management feature into the
core memory management.

I'm going to say this again, I disagree with this design decession as I
do not think using a BO / or migration + eviction at allocation
grainularity should be dismissed. Using a BO offers eviction more or
less for free and possibly dma-buf reuse for multi-GPU. Please study my
PoC [1]. It has SVM full featured and largely working.

If you have a different design, great. But I'd expect the next post to
have feature parity, thorough testing, and well thought out design
choices with explainations of said design choices beyond someone in the
community said something so I am doing it this way (i.e. deep thought
and understanding of how all the pieces fit together and why this design
was chosen).

[1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post/-/tree/post?ref_type=heads

> Yes, I removed block_meta in v3.
>

I don't like speculation, said this many times. Let's see what rev3
looks like and will review then.

> > 
> > I am thinking roughly...
> > 
> > - I think we drop all this special tracking (kill xe_svm_block_meta)
> Agreed.
> 
> > - Have functions to allocate / free the buddy blocks, store buddy blocks in
> > userptr
> 
> Why we need to store buddy blocks in userptr? If you do this, you would need another block_list in userptr.
> 
> I currently use the struct page's zone_device_data to point to buddy_block. It seems work for me.
> 

It doesn't. See my working PoC [1] zdd structure, almost certainly
something like that will be needed for a variety of reasons. Also if you
allocate a buddy block then it really isn't page grainularity nor does
offer an advantage over BO wrt page grainularity. I stated this multiple
time in lengthly explainations and not heard a reasonable argument
against this.

> > - Blocks are allocated before migration to VRAM
> 
> Agreed
> 
> > - Blocks can be freed on either CPU fault after migration or on VMA
> >   destroy (probably depends on madvive hints for VMA where we free
> >   blocks)
> 
> My current understanding is, once device pages is allocated and handed over to core mm/hmm, driver doesn't need to worry about the life cycle of device pages, i.e., core mm/hmm will take care by calling back to page_free vfunc.
>

Yes, we definitely need to the page_free vfunc callback implemented.
This was a misunderstanding on my part, certainly you can make it work
without this but it is not correct.
 
> > - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
> >   (conceptually the same if we switch to 1 to N ratio between xe_vma &
> >   pt_state)
> 
> As said, I am moving to page granularity, moving away from xe_vma granularity, to address concerns from the drm community discussion.
>

That's not what the community is saying or it is not at least how I
interrupt this. The community is saying on a fault, allocate at a page
granularity of the CPU mapping. e.g. If the CPU mapping is 4k, allocate
4k. If the CPU mapping is 2M, allocate 2M. There quite a bit of core MM
work that would need to be done for this to properly work though, namly
the migration layer splits TPH into smaller pages. I think that should
fixed (e.g. teach core MM to understand device private pages that are
larger than 4k, likewise coherent pages). Conceptually I think this is
the right approach but would take quite a bit of work. But with that, if
we layer the code properly only the DRM layer needs to be aware of this.
e.g. We could build SVM without this like I do in my PoC and then switch
over to this model once we complete the core MM work is done mostly only
modifying the DRM layer. This would eliminate the partial unmapping /
invalidation scenarios which my PoC implenents and provides
documentation for.

Below general comment:
If we post something that we view as good and working, community
feedback can be pushed back on with a justifiable design. Blindly doing
things without understanding (e.g. community said do it this way) of the
larger issues is not a great way to work. Nor is posting untested code
with a partial implementation.
 
> > - block->private == memory region so we can get pfn from block
> 
> In v3 code, block->private is not used. Will use it if needed.
> 
> Each struct page has a pgmap pointer which points to xe_mem_region's pgmap memory. We can use this information to get a page/fpn's memory region.
> 

That probably works but there may be scenario where you need to from
block to mr in which the device pages are not readily available. You
always can go from block -> pages -> mr but it may be advantageous to
short circuit that with block -> mr, especially in hot (performance
critial) paths.

Matt

> > - When we need migrate_pfns we loop over buddy blocks populating
> > migrate.dst
> > 
> > Also I noticed the drm_buddy_* calls in this file are not protected by a
> > lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> > VRAM mgr code, we either need to reach into there or move this lock to
> > common place so the VRAM manager and block allocations for SVM don't
> > race with each other.
> 
> Ok, will add this lock. Lets keep the lock in vram_mgr->lock for now for simplicity
> 
> Oak
> 
> 
> > 
> > Matt
> > 
> > >
> > >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > >  {
> > >  	return 0;
> > >  }
> > >
> > > -static void xe_devm_page_free(struct page *page)
> > > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > +{
> > > +	/** DRM buddy's block offset is 0-based*/
> > > +	offset += mr->hpa_base;
> > > +
> > > +	return PHYS_PFN(offset);
> > > +}
> > > +
> > > +/** FIXME: we locked page by calling zone_device_page_init
> > > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > > + */
> > > +static void free_block(struct drm_buddy_block *block)
> > > +{
> > > +	struct xe_svm_block_meta *meta =
> > > +		(struct xe_svm_block_meta *)block->private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +
> > > +	kfree(block->private);
> > > +	drm_buddy_free_block(mm, block);
> > > +}
> > > +
> > > +void xe_devm_page_free(struct page *page)
> > > +{
> > > +	struct drm_buddy_block *block =
> > > +					(struct drm_buddy_block *)page-
> > >zone_device_data;
> > > +	struct xe_svm_block_meta *meta =
> > > +					(struct xe_svm_block_meta *)block-
> > >private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	u64 size = drm_buddy_block_size(mm, block);
> > > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > > +	u64 block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +	u64 page_pfn = page_to_pfn(page);
> > > +	u64 i = page_pfn - block_pfn_first;
> > > +
> > > +	xe_assert(tile->xe, i < pages_per_block);
> > > +	clear_bit(i, meta->bitmap);
> > > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > > +		free_block(block);
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> > > + *
> > > + * @xe_tile: which tile to allocate device memory from
> > > + * @npages: how many pages to allocate
> > > + * @blocks: used to return the allocated blocks
> > > + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> > > + * to hold at @npages entries.
> > > + *
> > > + * This function allocate blocks of memory from drm buddy allocator, and
> > > + * performs initialization work: set struct page::zone_device_data to point
> > > + * to the memory block; set/initialize drm_buddy_block::private field;
> > > + * lock_page for each page allocated; add memory block to lru managers
> > lru
> > > + * list - this is TBD.
> > > + *
> > > + * return: 0 on success
> > > + * error code otherwise
> > > + */
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn)
> > > +{
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	struct drm_buddy_block *block, *tmp;
> > > +	u64 size = npages << PAGE_SHIFT;
> > > +	int ret = 0, i, j = 0;
> > > +
> > > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > > +						blocks,
> > DRM_BUDDY_TOPDOWN_ALLOCATION);
> > > +
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > > +		struct xe_mem_region *mr = &tile->mem.vram;
> > > +		u64 block_pfn_first, pages_per_block;
> > > +		struct xe_svm_block_meta *meta;
> > > +		u32 meta_size;
> > > +
> > > +		size = drm_buddy_block_size(mm, block);
> > > +		pages_per_block = size >> PAGE_SHIFT;
> > > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > > +					sizeof(struct xe_svm_block_meta);
> > > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > > +		bitmap_fill(meta->bitmap, pages_per_block);
> > > +		meta->tile = tile;
> > > +		block->private = meta;
> > > +		block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +		for(i = 0; i < pages_per_block; i++) {
> > > +			struct page *page;
> > > +
> > > +			pfn[j++] = block_pfn_first + i;
> > > +			page = pfn_to_page(block_pfn_first + i);
> > > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > > +			zone_device_page_init(page);
> > > +			page->zone_device_data = block;
> > > +		}
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_free_blocks() - free all memory blocks
> > > + *
> > > + * @blocks: memory blocks list head
> > > + */
> > > +void xe_devm_free_blocks(struct list_head *blocks)
> > >  {
> > > +	struct drm_buddy_block *block, *tmp;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > > +		free_block(block);
> > >  }
> > >
> > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-06-05 23:37       ` Matthew Brost
@ 2024-06-06  3:30         ` Zeng, Oak
  2024-06-06  4:44           ` Matthew Brost
  0 siblings, 1 reply; 72+ messages in thread
From: Zeng, Oak @ 2024-06-06  3:30 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian,
	Auld, Matthew

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, June 5, 2024 7:37 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and
> free device memory
> 
> On Wed, Jun 05, 2024 at 04:16:32PM -0600, Zeng, Oak wrote:
> > Hi Matt,
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Wednesday, April 10, 2024 6:24 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty,
> > > Brian <brian.welty@intel.com>
> > > Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and
> > > free device memory
> > >
> > > On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > > > Function xe_devm_alloc_pages allocate pages from drm buddy and
> > > perform
> > > > house keeping work for all the pages allocated, such as get a page
> > > > refcount, keep a bitmap of all pages to denote whether a page is in
> > > > use, put pages to a drm lru list for eviction purpose.
> > > >
> > > > Function xe_devm_free_blocks return list of memory blocks to drm
> buddy
> > > > allocator.
> > > >
> > > > Function xe_devm_free_page is a call back function from hmm layer. It
> > > > is called whenever a page's refcount reaches to 1. This function clears
> > > > the bit of this page in the bitmap. If all the bits in the bitmap is
> > > > cleared, it means all the pages have been freed, we return all the pages
> > > > in this memory block back to drm buddy.
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> > > >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> > > ++++++++++++++++++++++++++++-
> > >
> > > See comments about file in previous patches, they apply here too.
> > >
> > > >  2 files changed, 152 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > index 624c1581f8ba..92a3ee90d5a7 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> > > *xe_page_to_mem_region(struct page *page)
> > > >  	return container_of(page->pgmap, struct xe_mem_region,
> > > pagemap);
> > > >  }
> > > >
> > > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > > +						unsigned long npages,
> > > > +						struct list_head *blocks,
> > > > +						unsigned long *pfn);
> > > > +
> > > > +void xe_devm_free_blocks(struct list_head *blocks);
> > > > +void xe_devm_page_free(struct page *page);
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > index 31af56e8285a..5ba0cd9a70b0 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > @@ -5,18 +5,161 @@
> > > >
> > > >  #include <linux/mm_types.h>
> > > >  #include <linux/sched/mm.h>
> > > > -
> > > > +#include <linux/gfp.h>
> > > > +#include <linux/migrate.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/dma-fence.h>
> > > > +#include <linux/bitops.h>
> > > > +#include <linux/bitmap.h>
> > > > +#include <drm/drm_buddy.h>
> > > >  #include "xe_device_types.h"
> > > >  #include "xe_svm.h"
> > > > +#include "xe_migrate.h"
> > > > +#include "xe_ttm_vram_mgr_types.h"
> > > > +#include "xe_assert.h"
> > > >
> > > > +/**
> > > > + * struct xe_svm_block_meta - svm uses this data structure to manage
> > > each
> > > > + * block allocated from drm buddy. This will be set to the
> > > drm_buddy_block's
> > > > + * private field.
> > > > + *
> > > > + * @lru: used to link this block to drm's lru lists. This will be replace
> > > > + * with struct drm_lru_entity later.
> > > > + * @tile: tile from which we allocated this block
> > > > + * @bitmap: A bitmap of each page in this block. 1 means this page is
> used,
> > > > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > > > + * to return this block to drm buddy subsystem.
> > > > + */
> > > > +struct xe_svm_block_meta {
> > > > +	struct list_head lru;
> > > > +	struct xe_tile *tile;
> > > > +	unsigned long bitmap[];
> > > > +};
> > >
> > > This looks not needed to me but admittedly haven't looked at the LRU
> stuff.
> >
> > I am moving to page granularity memory eviction, so we can either use the
> lru in the struct page itself, or I will have to introduce some other data
> structure which have a lru.
> >
> 
> You almost certainly cannot use struct page, pretty sure that will not
> be well recieved putting a subsystem memory management feature into the
> core memory management.

The reason I was think of struct page's lru because, the lru field is not used right now for device pages...the lru field is used for system pages only right now. So I don't see a conflict there. It is just a member not used by core mm and we can make use of it if we want.

But it is very possible that we don't use struct page's lru. In this series https://patchwork.freedesktop.org/patch/565501/?series=125879&rev=1, I have some drm_lru_entity concept. I need to go back to that series to work out the page granularity vram eviction part.


> 
> I'm going to say this again, I disagree with this design decession as I
> do not think using a BO / or migration + eviction at allocation
> grainularity should be dismissed. Using a BO offers eviction more or
> less for free and possibly dma-buf reuse for multi-GPU. Please study my
> PoC [1]. It has SVM full featured and largely working.

It is completely fine that people disagree with a design. Removing BO from system allocator was aligned with DRM community long time ago. The main problem of BO is the memory eviction and migration is not page granularity.

I know in v2 I am not truly page granularity either. But I am moving to true page granularity in v3, the memory allocation, free, migration and eviction are all at page granularity. It is very similar to the scheme Linux core mm has. There are much more work at buddy allocator and eviction lru, compared to the BO approach. 

Please also note, when said page granularity, it could be one page, but can also multiple pages.


> 
> If you have a different design, great. But I'd expect the next post to
> have feature parity, thorough testing, and well thought out design
> choices with explainations of said design choices beyond someone in the
> community said something so I am doing it this way (i.e. deep thought
> and understanding of how all the pieces fit together and why this design
> was chosen).

I understand feature parity and testing are important. The previous posts (v1 and v2) were mainly to get some feedback on the design. Since we are at the 3rd respin of this series, it is probably a good idea to do more test before post.


> 
> [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post/-
> /tree/post?ref_type=heads
> 
> > Yes, I removed block_meta in v3.
> >
> 
> I don't like speculation, said this many times. Let's see what rev3
> looks like and will review then.
> 
> > >
> > > I am thinking roughly...
> > >
> > > - I think we drop all this special tracking (kill xe_svm_block_meta)
> > Agreed.
> >
> > > - Have functions to allocate / free the buddy blocks, store buddy blocks in
> > > userptr
> >
> > Why we need to store buddy blocks in userptr? If you do this, you would
> need another block_list in userptr.
> >
> > I currently use the struct page's zone_device_data to point to buddy_block.
> It seems work for me.
> >
> 
> It doesn't. See my working PoC [1] zdd structure, almost certainly
> something like that will be needed for a variety of reasons. Also if you
> allocate a buddy block then it really isn't page grainularity nor does
> offer an advantage over BO wrt page grainularity. I stated this multiple
> time in lengthly explainations and not heard a reasonable argument
> against this.

It is true that the drm buddy interface is not page granularity. But this should be mainly looked as a drawback of the buddy interface. If you look at the core mm buddy allocator, the interface is page centric. We are working on the drm buddy allocator to improve in this perspective.

> 
> > > - Blocks are allocated before migration to VRAM
> >
> > Agreed
> >
> > > - Blocks can be freed on either CPU fault after migration or on VMA
> > >   destroy (probably depends on madvive hints for VMA where we free
> > >   blocks)
> >
> > My current understanding is, once device pages is allocated and handed
> over to core mm/hmm, driver doesn't need to worry about the life cycle of
> device pages, i.e., core mm/hmm will take care by calling back to page_free
> vfunc.
> >
> 
> Yes, we definitely need to the page_free vfunc callback implemented.
> This was a misunderstanding on my part, certainly you can make it work
> without this but it is not correct.
> 
> > > - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
> > >   (conceptually the same if we switch to 1 to N ratio between xe_vma &
> > >   pt_state)
> >
> > As said, I am moving to page granularity, moving away from xe_vma
> granularity, to address concerns from the drm community discussion.
> >
> 
> That's not what the community is saying or it is not at least how I
> interrupt this. The community is saying on a fault, allocate at a page
> granularity of the CPU mapping. e.g. If the CPU mapping is 4k, allocate
> 4k. If the CPU mapping is 2M, allocate 2M. 

I guess by CPU mapping you meant CPU vma.

I think we can create a mmu notifier to match the range of the cpu vma.

Regarding how much gpu memory be allocated and migrated,  my current thinking is, this can be decided by a memory hint attribute. i.e., if CPU vma is 16M, but memory attribute say 2M, we can allocate/migrate 2M. if memory attributes says 32M, we will migrate 16M.

So I don't understand why you want to allocate the same size as CPU vma...


There quite a bit of core MM
> work that would need to be done for this to properly work though, namly
> the migration layer splits TPH into smaller pages. I think that should
> fixed (e.g. teach core MM to understand device private pages that are
> larger than 4k, likewise coherent pages). Conceptually I think this is
> the right approach but would take quite a bit of work. But with that, if
> we layer the code properly only the DRM layer needs to be aware of this.
> e.g. We could build SVM without this like I do in my PoC and then switch
> over to this model once we complete the core MM work is done mostly only
> modifying the DRM layer. This would eliminate the partial unmapping /
> invalidation scenarios which my PoC implenents and provides
> documentation for.

I don't quite follow here. Afaik, THP is only enabled for linux core mm, managing system memory. I am not aware of it in xekmd. By THP do you mean THP in xekmd? Or huge page (compound page, 2M or 1G etc) to back up cpu vma in linux core mm? 


> 
> Below general comment:
> If we post something that we view as good and working, community
> feedback can be pushed back on with a justifiable design. Blindly doing
> things without understanding (e.g. community said do it this way) of the
> larger issues is not a great way to work. Nor is posting untested code
> with a partial implementation.
> 
> > > - block->private == memory region so we can get pfn from block
> >
> > In v3 code, block->private is not used. Will use it if needed.
> >
> > Each struct page has a pgmap pointer which points to xe_mem_region's
> pgmap memory. We can use this information to get a page/fpn's memory
> region.
> >
> 
> That probably works but there may be scenario where you need to from
> block to mr in which the device pages are not readily available. You
> always can go from block -> pages -> mr but it may be advantageous to
> short circuit that with block -> mr, especially in hot (performance
> critial) paths.

In my v3, I don't have a need for block to mr. but it is a good idea to use block private for this purpose.

Oak

> 
> Matt
> 
> > > - When we need migrate_pfns we loop over buddy blocks populating
> > > migrate.dst
> > >
> > > Also I noticed the drm_buddy_* calls in this file are not protected by a
> > > lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> > > VRAM mgr code, we either need to reach into there or move this lock to
> > > common place so the VRAM manager and block allocations for SVM don't
> > > race with each other.
> >
> > Ok, will add this lock. Lets keep the lock in vram_mgr->lock for now for
> simplicity
> >
> > Oak
> >
> >
> > >
> > > Matt
> > >
> > > >
> > > >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > >
> > > > -static void xe_devm_page_free(struct page *page)
> > > > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > > +{
> > > > +	/** DRM buddy's block offset is 0-based*/
> > > > +	offset += mr->hpa_base;
> > > > +
> > > > +	return PHYS_PFN(offset);
> > > > +}
> > > > +
> > > > +/** FIXME: we locked page by calling zone_device_page_init
> > > > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > > > + */
> > > > +static void free_block(struct drm_buddy_block *block)
> > > > +{
> > > > +	struct xe_svm_block_meta *meta =
> > > > +		(struct xe_svm_block_meta *)block->private;
> > > > +	struct xe_tile *tile  = meta->tile;
> > > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > > +
> > > > +	kfree(block->private);
> > > > +	drm_buddy_free_block(mm, block);
> > > > +}
> > > > +
> > > > +void xe_devm_page_free(struct page *page)
> > > > +{
> > > > +	struct drm_buddy_block *block =
> > > > +					(struct drm_buddy_block *)page-
> > > >zone_device_data;
> > > > +	struct xe_svm_block_meta *meta =
> > > > +					(struct xe_svm_block_meta *)block-
> > > >private;
> > > > +	struct xe_tile *tile  = meta->tile;
> > > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > > +	u64 size = drm_buddy_block_size(mm, block);
> > > > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > > > +	u64 block_pfn_first =
> > > > +					block_offset_to_pfn(mr,
> > > drm_buddy_block_offset(block));
> > > > +	u64 page_pfn = page_to_pfn(page);
> > > > +	u64 i = page_pfn - block_pfn_first;
> > > > +
> > > > +	xe_assert(tile->xe, i < pages_per_block);
> > > > +	clear_bit(i, meta->bitmap);
> > > > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > > > +		free_block(block);
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_devm_alloc_pages() - allocate device pages from buddy
> allocator
> > > > + *
> > > > + * @xe_tile: which tile to allocate device memory from
> > > > + * @npages: how many pages to allocate
> > > > + * @blocks: used to return the allocated blocks
> > > > + * @pfn: used to return the pfn of all allocated pages. Must be big
> enough
> > > > + * to hold at @npages entries.
> > > > + *
> > > > + * This function allocate blocks of memory from drm buddy allocator,
> and
> > > > + * performs initialization work: set struct page::zone_device_data to
> point
> > > > + * to the memory block; set/initialize drm_buddy_block::private field;
> > > > + * lock_page for each page allocated; add memory block to lru
> managers
> > > lru
> > > > + * list - this is TBD.
> > > > + *
> > > > + * return: 0 on success
> > > > + * error code otherwise
> > > > + */
> > > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > > +						unsigned long npages,
> > > > +						struct list_head *blocks,
> > > > +						unsigned long *pfn)
> > > > +{
> > > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > > +	struct drm_buddy_block *block, *tmp;
> > > > +	u64 size = npages << PAGE_SHIFT;
> > > > +	int ret = 0, i, j = 0;
> > > > +
> > > > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > > > +						blocks,
> > > DRM_BUDDY_TOPDOWN_ALLOCATION);
> > > > +
> > > > +	if (unlikely(ret))
> > > > +		return ret;
> > > > +
> > > > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > > > +		struct xe_mem_region *mr = &tile->mem.vram;
> > > > +		u64 block_pfn_first, pages_per_block;
> > > > +		struct xe_svm_block_meta *meta;
> > > > +		u32 meta_size;
> > > > +
> > > > +		size = drm_buddy_block_size(mm, block);
> > > > +		pages_per_block = size >> PAGE_SHIFT;
> > > > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > > > +					sizeof(struct xe_svm_block_meta);
> > > > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > > > +		bitmap_fill(meta->bitmap, pages_per_block);
> > > > +		meta->tile = tile;
> > > > +		block->private = meta;
> > > > +		block_pfn_first =
> > > > +					block_offset_to_pfn(mr,
> > > drm_buddy_block_offset(block));
> > > > +		for(i = 0; i < pages_per_block; i++) {
> > > > +			struct page *page;
> > > > +
> > > > +			pfn[j++] = block_pfn_first + i;
> > > > +			page = pfn_to_page(block_pfn_first + i);
> > > > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > > > +			zone_device_page_init(page);
> > > > +			page->zone_device_data = block;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_devm_free_blocks() - free all memory blocks
> > > > + *
> > > > + * @blocks: memory blocks list head
> > > > + */
> > > > +void xe_devm_free_blocks(struct list_head *blocks)
> > > >  {
> > > > +	struct drm_buddy_block *block, *tmp;
> > > > +
> > > > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > > > +		free_block(block);
> > > >  }
> > > >
> > > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-06-06  3:30         ` Zeng, Oak
@ 2024-06-06  4:44           ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-06-06  4:44 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian,
	Auld, Matthew

On Wed, Jun 05, 2024 at 09:30:01PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, June 5, 2024 7:37 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and
> > free device memory
> > 
> > On Wed, Jun 05, 2024 at 04:16:32PM -0600, Zeng, Oak wrote:
> > > Hi Matt,
> > >
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: Wednesday, April 10, 2024 6:24 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com;
> > Welty,
> > > > Brian <brian.welty@intel.com>
> > > > Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and
> > > > free device memory
> > > >
> > > > On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > > > > Function xe_devm_alloc_pages allocate pages from drm buddy and
> > > > perform
> > > > > house keeping work for all the pages allocated, such as get a page
> > > > > refcount, keep a bitmap of all pages to denote whether a page is in
> > > > > use, put pages to a drm lru list for eviction purpose.
> > > > >
> > > > > Function xe_devm_free_blocks return list of memory blocks to drm
> > buddy
> > > > > allocator.
> > > > >
> > > > > Function xe_devm_free_page is a call back function from hmm layer. It
> > > > > is called whenever a page's refcount reaches to 1. This function clears
> > > > > the bit of this page in the bitmap. If all the bits in the bitmap is
> > > > > cleared, it means all the pages have been freed, we return all the pages
> > > > > in this memory block back to drm buddy.
> > > > >
> > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> > > > ++++++++++++++++++++++++++++-
> > > >
> > > > See comments about file in previous patches, they apply here too.
> > > >
> > > > >  2 files changed, 152 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > index 624c1581f8ba..92a3ee90d5a7 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> > > > *xe_page_to_mem_region(struct page *page)
> > > > >  	return container_of(page->pgmap, struct xe_mem_region,
> > > > pagemap);
> > > > >  }
> > > > >
> > > > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > > > +						unsigned long npages,
> > > > > +						struct list_head *blocks,
> > > > > +						unsigned long *pfn);
> > > > > +
> > > > > +void xe_devm_free_blocks(struct list_head *blocks);
> > > > > +void xe_devm_page_free(struct page *page);
> > > > >  #endif
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > index 31af56e8285a..5ba0cd9a70b0 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > @@ -5,18 +5,161 @@
> > > > >
> > > > >  #include <linux/mm_types.h>
> > > > >  #include <linux/sched/mm.h>
> > > > > -
> > > > > +#include <linux/gfp.h>
> > > > > +#include <linux/migrate.h>
> > > > > +#include <linux/dma-mapping.h>
> > > > > +#include <linux/dma-fence.h>
> > > > > +#include <linux/bitops.h>
> > > > > +#include <linux/bitmap.h>
> > > > > +#include <drm/drm_buddy.h>
> > > > >  #include "xe_device_types.h"
> > > > >  #include "xe_svm.h"
> > > > > +#include "xe_migrate.h"
> > > > > +#include "xe_ttm_vram_mgr_types.h"
> > > > > +#include "xe_assert.h"
> > > > >
> > > > > +/**
> > > > > + * struct xe_svm_block_meta - svm uses this data structure to manage
> > > > each
> > > > > + * block allocated from drm buddy. This will be set to the
> > > > drm_buddy_block's
> > > > > + * private field.
> > > > > + *
> > > > > + * @lru: used to link this block to drm's lru lists. This will be replace
> > > > > + * with struct drm_lru_entity later.
> > > > > + * @tile: tile from which we allocated this block
> > > > > + * @bitmap: A bitmap of each page in this block. 1 means this page is
> > used,
> > > > > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > > > > + * to return this block to drm buddy subsystem.
> > > > > + */
> > > > > +struct xe_svm_block_meta {
> > > > > +	struct list_head lru;
> > > > > +	struct xe_tile *tile;
> > > > > +	unsigned long bitmap[];
> > > > > +};
> > > >
> > > > This looks not needed to me but admittedly haven't looked at the LRU
> > stuff.
> > >
> > > I am moving to page granularity memory eviction, so we can either use the
> > lru in the struct page itself, or I will have to introduce some other data
> > structure which have a lru.
> > >
> > 
> > You almost certainly cannot use struct page, pretty sure that will not
> > be well recieved putting a subsystem memory management feature into the
> > core memory management.
> 
> The reason I was think of struct page's lru because, the lru field is not used right now for device pages...the lru field is used for system pages only right now. So I don't see a conflict there. It is just a member not used by core mm and we can make use of it if we want.
> 

If the core implementation changes, then are you broken. If it's not an
explicitly provided private field to the upper layers, then no, I don't
think you should just use it. At least, that's my understanding.

> But it is very possible that we don't use struct page's lru. In this series https://patchwork.freedesktop.org/patch/565501/?series=125879&rev=1, I have some drm_lru_entity concept. I need to go back to that series to work out the page granularity vram eviction part.
>

I've seen that series and it raises some pretty serious concerns which I
won't get into here.
 
> 
> > 
> > I'm going to say this again, I disagree with this design decession as I
> > do not think using a BO / or migration + eviction at allocation
> > grainularity should be dismissed. Using a BO offers eviction more or
> > less for free and possibly dma-buf reuse for multi-GPU. Please study my
> > PoC [1]. It has SVM full featured and largely working.
> 
> It is completely fine that people disagree with a design. Removing BO from system allocator was aligned with DRM community long time ago. The main problem of BO is the memory eviction and migration is not page granularity.
>

Of course, it's okay to disagree. This is about how you're responding.
I'm not buying, or even hearing an argument about how using a buddy
allocation solves the page granularity problem. Simply stating something
as a fact without providing examples of how it solves the problem means
nothing. This is a continued pattern of behavior which I interrupt as
feedback being ignored.

Furthermore, I don't believe the VRAM backing actually has anything to
do with page migration. I see this as a core MM and DRM layer issue
rather than a decision made at the driver level regarding the VRAM
backing store. I'll elaborate on this below.

My concern here is that a design decision is being made without valid
reasoning. If I hear valid reasoning, of course, I'm open to other
ideas.

I wasn't involved the discussions a long time ago, I am involved now and
have an opinion on a Xe level design choice.

> I know in v2 I am not truly page granularity either. But I am moving to true page granularity in v3, the memory allocation, free, migration and eviction are all at page granularity. It is very similar to the scheme Linux core mm has. There are much more work at buddy allocator and eviction lru, compared to the BO approach. 
> 
> Please also note, when said page granularity, it could be one page, but can also multiple pages.
> 
> 
> > 
> > If you have a different design, great. But I'd expect the next post to
> > have feature parity, thorough testing, and well thought out design
> > choices with explainations of said design choices beyond someone in the
> > community said something so I am doing it this way (i.e. deep thought
> > and understanding of how all the pieces fit together and why this design
> > was chosen).
> 
> I understand feature parity and testing are important. The previous posts (v1 and v2) were mainly to get some feedback on the design. Since we are at the 3rd respin of this series, it is probably a good idea to do more test before post.
> 
> 
> > 
> > [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post/-
> > /tree/post?ref_type=heads
> > 
> > > Yes, I removed block_meta in v3.
> > >
> > 
> > I don't like speculation, said this many times. Let's see what rev3
> > looks like and will review then.
> > 
> > > >
> > > > I am thinking roughly...
> > > >
> > > > - I think we drop all this special tracking (kill xe_svm_block_meta)
> > > Agreed.
> > >
> > > > - Have functions to allocate / free the buddy blocks, store buddy blocks in
> > > > userptr
> > >
> > > Why we need to store buddy blocks in userptr? If you do this, you would
> > need another block_list in userptr.
> > >
> > > I currently use the struct page's zone_device_data to point to buddy_block.
> > It seems work for me.
> > >
> > 
> > It doesn't. See my working PoC [1] zdd structure, almost certainly
> > something like that will be needed for a variety of reasons. Also if you
> > allocate a buddy block then it really isn't page grainularity nor does
> > offer an advantage over BO wrt page grainularity. I stated this multiple
> > time in lengthly explainations and not heard a reasonable argument
> > against this.
> 
> It is true that the drm buddy interface is not page granularity. But this should be mainly looked as a drawback of the buddy interface. If you look at the core mm buddy allocator, the interface is page centric. We are working on the drm buddy allocator to improve in this perspective.
> 

Let's put this another way: installing buddy_block to a zone_device_data
doesn't work from a layering perspective. A DRM SVM layer cannot enforce
a driver to use a buddy allocation for the VRAM backing — the VRAM backing
needs to be opaque, with the driver making a choice of the backing store
(BO, buddy allocation, or potentially something else).

Again, this is why in my PoC the VRAM allocation is a void * and a
driver vfunc to signal release of the memory. I'm not saying I have this
100% correct; I probably don't. But conceptually, the VRAM store needs
to be transparent to the DRM SVM layer. Any design that requires a
specific type of backing store at the DRM level is not correct.

To give an example of this, neither a GEM BO nor a TTM BO is aware of
the backing store. This is why TTM has memory managers which do the
actual allocation/freeing of memory. Granted, it provides a TT manager
for driver use for system memory, but that is just a reference
implementation too — a driver can override it (at least that is my
understanding). For VRAM allocations, the driver owns this. See
xe_ttm_vram_mgr.c, which happens to be built on top of the buddy
allocator — this is not a requirement.

> > 
> > > > - Blocks are allocated before migration to VRAM
> > >
> > > Agreed
> > >
> > > > - Blocks can be freed on either CPU fault after migration or on VMA
> > > >   destroy (probably depends on madvive hints for VMA where we free
> > > >   blocks)
> > >
> > > My current understanding is, once device pages is allocated and handed
> > over to core mm/hmm, driver doesn't need to worry about the life cycle of
> > device pages, i.e., core mm/hmm will take care by calling back to page_free
> > vfunc.
> > >
> > 
> > Yes, we definitely need to the page_free vfunc callback implemented.
> > This was a misunderstanding on my part, certainly you can make it work
> > without this but it is not correct.
> > 
> > > > - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
> > > >   (conceptually the same if we switch to 1 to N ratio between xe_vma &
> > > >   pt_state)
> > >
> > > As said, I am moving to page granularity, moving away from xe_vma
> > granularity, to address concerns from the drm community discussion.
> > >
> > 
> > That's not what the community is saying or it is not at least how I
> > interrupt this. The community is saying on a fault, allocate at a page
> > granularity of the CPU mapping. e.g. If the CPU mapping is 4k, allocate
> > 4k. If the CPU mapping is 2M, allocate 2M. 
> 
> I guess by CPU mapping you meant CPU vma.
>

No. This would be when hmm_range_fault is called, the order size
returned in the pfn. This is page-level granularity.
 
> I think we can create a mmu notifier to match the range of the cpu vma.
>

That has nothing to do with page granularity. We should be able to
create arbitrary sized notifiers which span many pages / cpu vma / gpu
mappins. Your design likely should support that per feedback from Jason
which I also happen to agree with.

> Regarding how much gpu memory be allocated and migrated,  my current thinking is, this can be decided by a memory hint attribute. i.e., if CPU vma is 16M, but memory attribute say 2M, we can allocate/migrate 2M. if memory attributes says 32M, we will migrate 16M.
> 

By definition, this is not page granularity. If you want to migrate more
CPU pages upon GPU fault (which we might), that is fine, but it needs to
be done in a way that ensures allocations match CPU page sizes. So, upon
CPU fault or MMU notifier event, we are operating on the CPU page size.
Hopefully, a user is smart enough to use THP allocations in a
performance-critical app. This is my view of page granularity. Maybe I
am way off here, but I really don't think so.

I'm not arguing that page granularity is a must; in fact, I didn't
implement it in my PoC. I'm saying that if we did, this is how it would
look.

> So I don't understand why you want to allocate the same size as CPU vma...
>

I'm not saying that. See above.

> 
> There quite a bit of core MM
> > work that would need to be done for this to properly work though, namly
> > the migration layer splits TPH into smaller pages. I think that should
> > fixed (e.g. teach core MM to understand device private pages that are
> > larger than 4k, likewise coherent pages). Conceptually I think this is
> > the right approach but would take quite a bit of work. But with that, if
> > we layer the code properly only the DRM layer needs to be aware of this.
> > e.g. We could build SVM without this like I do in my PoC and then switch
> > over to this model once we complete the core MM work is done mostly only
> > modifying the DRM layer. This would eliminate the partial unmapping /
> > invalidation scenarios which my PoC implenents and provides
> > documentation for.
> 
> I don't quite follow here. Afaik, THP is only enabled for linux core mm, managing system memory. I am not aware of it in xekmd. By THP do you mean THP in xekmd? Or huge page (compound page, 2M or 1G etc) to back up cpu vma in linux core mm? 
> 

Page granularity would be matching the CPU page size in migrations and
not spliting THP upon migration.

With all of the above considered, the VRAM backing actually has nothing
to do with page granularity. That is, the allocation size of the backing
store is the key; the implementation of the backing store is immaterial
here.

I concede that using a BO for every 4k allocation will be quite
wasteful, but a buddy allocation wouldn't be all that much better
either. Really, it is up to the user to ensure 2M (THP) allocations are
used if supporting page granularity. If we can't trust the user to do
so, then don't support page granularity and create arbitrary sizes based
on the CPU VMA, as I do in my PoC.

Matt

> 
> > 
> > Below general comment:
> > If we post something that we view as good and working, community
> > feedback can be pushed back on with a justifiable design. Blindly doing
> > things without understanding (e.g. community said do it this way) of the
> > larger issues is not a great way to work. Nor is posting untested code
> > with a partial implementation.
> > 
> > > > - block->private == memory region so we can get pfn from block
> > >
> > > In v3 code, block->private is not used. Will use it if needed.
> > >
> > > Each struct page has a pgmap pointer which points to xe_mem_region's
> > pgmap memory. We can use this information to get a page/fpn's memory
> > region.
> > >
> > 
> > That probably works but there may be scenario where you need to from
> > block to mr in which the device pages are not readily available. You
> > always can go from block -> pages -> mr but it may be advantageous to
> > short circuit that with block -> mr, especially in hot (performance
> > critial) paths.
> 
> In my v3, I don't have a need for block to mr. but it is a good idea to use block private for this purpose.
> 
> Oak
> 
> > 
> > Matt
> > 
> > > > - When we need migrate_pfns we loop over buddy blocks populating
> > > > migrate.dst
> > > >
> > > > Also I noticed the drm_buddy_* calls in this file are not protected by a
> > > > lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> > > > VRAM mgr code, we either need to reach into there or move this lock to
> > > > common place so the VRAM manager and block allocations for SVM don't
> > > > race with each other.
> > >
> > > Ok, will add this lock. Lets keep the lock in vram_mgr->lock for now for
> > simplicity
> > >
> > > Oak
> > >
> > >
> > > >
> > > > Matt
> > > >
> > > > >
> > > > >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > >  {
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > -static void xe_devm_page_free(struct page *page)
> > > > > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > > > +{
> > > > > +	/** DRM buddy's block offset is 0-based*/
> > > > > +	offset += mr->hpa_base;
> > > > > +
> > > > > +	return PHYS_PFN(offset);
> > > > > +}
> > > > > +
> > > > > +/** FIXME: we locked page by calling zone_device_page_init
> > > > > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > > > > + */
> > > > > +static void free_block(struct drm_buddy_block *block)
> > > > > +{
> > > > > +	struct xe_svm_block_meta *meta =
> > > > > +		(struct xe_svm_block_meta *)block->private;
> > > > > +	struct xe_tile *tile  = meta->tile;
> > > > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > > > +
> > > > > +	kfree(block->private);
> > > > > +	drm_buddy_free_block(mm, block);
> > > > > +}
> > > > > +
> > > > > +void xe_devm_page_free(struct page *page)
> > > > > +{
> > > > > +	struct drm_buddy_block *block =
> > > > > +					(struct drm_buddy_block *)page-
> > > > >zone_device_data;
> > > > > +	struct xe_svm_block_meta *meta =
> > > > > +					(struct xe_svm_block_meta *)block-
> > > > >private;
> > > > > +	struct xe_tile *tile  = meta->tile;
> > > > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > > > +	u64 size = drm_buddy_block_size(mm, block);
> > > > > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > > > > +	u64 block_pfn_first =
> > > > > +					block_offset_to_pfn(mr,
> > > > drm_buddy_block_offset(block));
> > > > > +	u64 page_pfn = page_to_pfn(page);
> > > > > +	u64 i = page_pfn - block_pfn_first;
> > > > > +
> > > > > +	xe_assert(tile->xe, i < pages_per_block);
> > > > > +	clear_bit(i, meta->bitmap);
> > > > > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > > > > +		free_block(block);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * xe_devm_alloc_pages() - allocate device pages from buddy
> > allocator
> > > > > + *
> > > > > + * @xe_tile: which tile to allocate device memory from
> > > > > + * @npages: how many pages to allocate
> > > > > + * @blocks: used to return the allocated blocks
> > > > > + * @pfn: used to return the pfn of all allocated pages. Must be big
> > enough
> > > > > + * to hold at @npages entries.
> > > > > + *
> > > > > + * This function allocate blocks of memory from drm buddy allocator,
> > and
> > > > > + * performs initialization work: set struct page::zone_device_data to
> > point
> > > > > + * to the memory block; set/initialize drm_buddy_block::private field;
> > > > > + * lock_page for each page allocated; add memory block to lru
> > managers
> > > > lru
> > > > > + * list - this is TBD.
> > > > > + *
> > > > > + * return: 0 on success
> > > > > + * error code otherwise
> > > > > + */
> > > > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > > > +						unsigned long npages,
> > > > > +						struct list_head *blocks,
> > > > > +						unsigned long *pfn)
> > > > > +{
> > > > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > > > +	struct drm_buddy_block *block, *tmp;
> > > > > +	u64 size = npages << PAGE_SHIFT;
> > > > > +	int ret = 0, i, j = 0;
> > > > > +
> > > > > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > > > > +						blocks,
> > > > DRM_BUDDY_TOPDOWN_ALLOCATION);
> > > > > +
> > > > > +	if (unlikely(ret))
> > > > > +		return ret;
> > > > > +
> > > > > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > > > > +		struct xe_mem_region *mr = &tile->mem.vram;
> > > > > +		u64 block_pfn_first, pages_per_block;
> > > > > +		struct xe_svm_block_meta *meta;
> > > > > +		u32 meta_size;
> > > > > +
> > > > > +		size = drm_buddy_block_size(mm, block);
> > > > > +		pages_per_block = size >> PAGE_SHIFT;
> > > > > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > > > > +					sizeof(struct xe_svm_block_meta);
> > > > > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > > > > +		bitmap_fill(meta->bitmap, pages_per_block);
> > > > > +		meta->tile = tile;
> > > > > +		block->private = meta;
> > > > > +		block_pfn_first =
> > > > > +					block_offset_to_pfn(mr,
> > > > drm_buddy_block_offset(block));
> > > > > +		for(i = 0; i < pages_per_block; i++) {
> > > > > +			struct page *page;
> > > > > +
> > > > > +			pfn[j++] = block_pfn_first + i;
> > > > > +			page = pfn_to_page(block_pfn_first + i);
> > > > > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > > > > +			zone_device_page_init(page);
> > > > > +			page->zone_device_data = block;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * xe_devm_free_blocks() - free all memory blocks
> > > > > + *
> > > > > + * @blocks: memory blocks list head
> > > > > + */
> > > > > +void xe_devm_free_blocks(struct list_head *blocks)
> > > > >  {
> > > > > +	struct drm_buddy_block *block, *tmp;
> > > > > +
> > > > > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > > > > +		free_block(block);
> > > > >  }
> > > > >
> > > > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > > > --
> > > > > 2.26.3
> > > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-11  2:07   ` Matthew Brost
  2024-04-12 17:24     ` Zeng, Oak
@ 2024-06-07  4:30     ` Zeng, Oak
  1 sibling, 0 replies; 72+ messages in thread
From: Zeng, Oak @ 2024-06-07  4:30 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 10:07 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > Under the picture of svm, CPU and GPU program share one same
> > virtual address space. The backing store of this virtual address
> > space can be either in system memory or device memory. Since GPU
> > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > Any CPU access to device memory causes a page fault. Implement
> > a page fault handler to migrate memory back to system memory and
> > map it to CPU page table so the CPU program can proceed.
> >
> > Also unbind this page from GPU side, and free the original GPU
> > device page
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile         |   1 +
> >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> ++++++++++++++++++++++++++++
> >  4 files changed, 230 insertions(+), 8 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index f89d77b6d654..65289acdd563 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> >  	xe_step.o \
> >  	xe_svm.o \
> >  	xe_svm_devmem.o \
> > +	xe_svm_migrate.o \
> 
> See comments about file org, same thing applies here. Let's put all of
> the svm implementation in a single file.

Did this in v3.

> 
> >  	xe_sync.o \
> >  	xe_tile.o \
> >  	xe_tile_sysfs.o \
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> > index f601dffe3fc1..c9e4239c44b4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -7,11 +7,11 @@
> >  #define __XE_SVM_H
> >
> >  #include <linux/mm_types.h>
> > +#include <linux/mm.h>
> >  #include "xe_device_types.h"
> >  #include "xe_device.h"
> >  #include "xe_assert.h"
> > -
> > -struct xe_vm;
> > +#include "xe_vm_types.h"
> >
> >  /**
> >   * struct xe_svm - data structure to represent a shared
> > @@ -31,6 +31,9 @@ struct xe_svm {
> >  	struct list_head vm_list;
> >  };
> >
> > +#define xe_svm_for_each_vm(svm, vm)
> 	\
> > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > +
> 
> Don't think this is need, see below.
> 
> >  extern struct xe_svm *xe_create_svm(void);
> >  void xe_destroy_svm(struct xe_svm *svm);
> >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >
> >  void xe_devm_free_blocks(struct list_head *blocks);
> >  void xe_devm_page_free(struct page *page);
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > index 088ac209ad80..32ada458f1dd 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> >  	unsigned long bitmap[];
> >  };
> >
> > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > -{
> > -	return 0;
> > -}
> > -
> >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> >  {
> >  	/** DRM buddy's block offset is 0-based*/
> > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
> >
> >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> >  	.page_free = xe_devm_page_free,
> > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> 
> Again single file so this will be static function, no reason to export
> this.

Agreed.

> 
> >  };
> >
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > new file mode 100644
> > index 000000000000..0db831af098e
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -0,0 +1,222 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> > + */
> > +
> > +#include <linux/gfp.h>
> > +#include <linux/migrate.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/bitops.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/kernel.h>
> > +#include <linux/slab.h>
> > +#include <drm/drm_buddy.h>
> > +#include "xe_device_types.h"
> > +#include "xe_device.h"
> > +#include "xe_trace.h"
> > +#include "xe_migrate.h"
> > +#include "xe_ttm_vram_mgr_types.h"
> > +#include "xe_assert.h"
> > +#include "xe_pt.h"
> > +#include "xe_svm.h"
> > +#include "xe_vm.h"
> > +
> > +
> > +/**
> > + * alloc_host_page() - allocate one host page for the fault vma
> > + *
> > + * @dev: (GPU) device that will access the allocated page
> > + * @vma: the fault vma that we need allocate page for
> > + * @addr: the fault address. The allocated page is for this address
> > + * @dma_addr: used to output the dma address of the allocated page.
> > + * This dma address will be used for gpu to access this page. GPU
> > + * access host page through a dma mapped address.
> > + * @pfn: used to output the pfn of the allocated page.
> > + *
> > + * This function allocate one host page for the specified vma. It
> > + * also does some prepare work for GPU to access this page, such
> > + * as map this page to iommu (by calling dma_map_page).
> > + *
> > + * When this function returns, the page is locked.
> > + *
> > + * Return struct page pointer when success
> > + * NULL otherwise
> > + */
> > +static struct page *alloc_host_page(struct device *dev,
> > +							 struct
> vm_area_struct *vma,
> > +							 unsigned long addr,
> > +							 dma_addr_t
> *dma_addr,
> > +							 unsigned long *pfn)
> 
> Weird alignment, also I don't think we are want to allocate a page at
> time...
> 
> Beyond that, can't say I'm a fan of 2 arguments being return and
> populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> haven't seen a lot that style function in Linux.
> 
> Probably makes more sense to have a function which allocates pages,
> locks them, and populates the pfn array (migrate_pfn) rather than doing
> this a page at a time.


Agreed. In v3, I adopted Nvidia's new dma-mapping API which also requires a 2 steps way: 1) allocate pages 2) dma-mapping

> 
> > +{
> > +	struct page *page;
> > +
> > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > +	if (unlikely(!page))
> > +		return NULL;
> > +
> > +	/**Lock page per hmm requirement, see hmm.rst*/
> > +	lock_page(page);
> > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> DMA_FROM_DEVICE);
> 
> The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> needed, right? 

We are copying from device memory to system memory here. Yes device writes to those pages, but the direction parameter is about the copy direction, not about read or write. I think FROM_DEVICE is still correct.


As mentioned above I think this should be broken out into
> a different step too.


Agreed.

> 
> > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > +		unlock_page(page);
> > +		__free_page(page);
> > +		return NULL;
> > +	}
> > +
> > +	*pfn = migrate_pfn(page_to_pfn(page));
> > +	return page;
> > +}
> > +
> > +static void free_host_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > + *
> > + * @vma: The vma that the page is mapped to
> > + * @addr: The virtual address that the page is mapped to
> > + * @src_pfn: src page's page frame number
> > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > + *
> > + * Allocate one page in system ram and copy memory from device
> memory
> > + * to system ram.
> > + *
> > + * Return: 0 if this page is already in sram (no need to migrate)
> > + * 1: successfully migrated this page from vram to sram.
> > + * error code otherwise
> > + */
> > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> unsigned long addr,
> > +						unsigned long src_pfn,
> unsigned long *dst_pfn)
> > +{
> 
> We definitely don't want to copy 1 page at time. I touch on this in [1].
> Basically this going to perform poorly unless we use larger copies, the
> migrate code supports non-contigous sram addresses, and vram addresses
> will likely be contigous due to the buddy allocator.

Totally agreed we need to use one blitter command to copy many pages.. I was aware of this when I wrote this v2. And this is addressed in v3.

But also be aware, we might end up migrate multiple pages from system memory to device memory on each gpu page fault. But on CPU page fault, we only want to migrate 1 page to cover the fault address. I know in this v2, I migrated whole CPU VMA. I realized this is not correct after read more core mm and hmm code. The main reason is, the core mm only program one pte entry (covers 4k), even if we migrate multiple pages. See logic in function handle_pte_fault and do_swap_page. So I switched to one page scheme on v3. At least this is my current understanding.

Regarding vram contiguous, I think if we don't pass DRM_BUDDY_CONTIGUOUS_ALLOCATION, we still have chance run into non-contiguous vram. Even if the original allocation is contiguous, a portion of the contiguous vram can be migrated to system memory (and freed from device side), then migrate back to vram. We end up with non-contiguous vram in such case. 

I am thinking not using identity mapping for vram ppgtt mapping for this case, just like what we did for system memory ppgtt mapping. This way we don't care whether vram is contiguous or not, one shot blitter command can do the work. Otherwise, we will still have to split the copy task per contiguous/non-contiguous vram. This is marked as FIXME in v3. What do you think? 


> 
> [1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> 
> > +	struct xe_mem_region *mr;
> > +	struct xe_tile *tile;
> > +	struct xe_device *xe;
> > +	struct device *dev;
> > +	dma_addr_t dma_addr = 0;
> > +	struct dma_fence *fence;
> > +	struct page *host_page;
> > +	struct page *src_page;
> > +	u64 src_dpa;
> > +
> > +	src_page = migrate_pfn_to_page(src_pfn);
> > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> 
> I'm going to say this is a bug if !src_page ||
> !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE)
> and
> we return -EFAULT (or another error code if that makes more sense). We
> are migrating from the device where we know we have backing store from
> the original fault.

Agreed under the context of this v2.

In v3, I am moving to a more page centric scheme. Think of this scenario: a virtual address range, let's say 2M, is migrated gpu device memory. Now CPU accessed one page in the middle of this 2M range, this single 4k page will be migrate back to system memory. HMM also triggers a invalidation callback to driver to invalidate this 4k range. Now gpu access this 4k range. Let's say our GPU migration granularity is 2M (the concept of chunk in your series) and migration policy says we need to migrate back to device memory, the migrate_vma_setup will be set up to cover a 2M range but hmm will tell us there is only one 4k page in system memory need to be migrated while all the rest (2M - 4K) need no migration. In this scenario MIGRATE_PFN_MIGRATE is not set for the 2M-4k. of course this scenario doesn't apply to above vram_to_sram migration.

> 
> > +		return 0;
> > +
> > +	mr = xe_page_to_mem_region(src_page);
> > +	tile = xe_mem_region_to_tile(mr);
> > +	xe = tile_to_xe(tile);
> > +	dev = xe->drm.dev;
> > +
> > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > +	if (!host_page)
> > +		return -ENOMEM;
> > +
> > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > +						dma_addr, false, PAGE_SIZE);
> > +	if (IS_ERR(fence)) {
> > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> DMA_FROM_DEVICE);
> > +		free_host_page(host_page);
> > +		return PTR_ERR(fence);
> > +	}
> > +
> > +	dma_fence_wait(fence, false);
> 
> Even if we did want to migrate a page at a time, we only need to wait on
> the last fence due to the ordered nature of exec queues.

Sure. As said we migrate all pages in one shot in v3. Above comment doesn't apply anymore.

> 
> > +	dma_fence_put(fence);
> > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> DMA_FROM_DEVICE);
> 
> With above, will likely unmap all dma pages in a single function once
> the last fence is signaled.

Yes, this was handled properly in v3.

> 
> > +	return 1;
> > +}
> > +
> > +/**
> > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU
> page fault
> > + *
> > + * @vmf: cpu vm fault structure, contains fault information such as vma
> etc.
> > + *
> > + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> > + *
> > + * This function migrate one gpu vma which contains the fault address to
> sram.
> > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> create one
> > + * gpu vma for one cpu vma initially and try not to split it). So this scheme
> end
> > + * up migrate at the vma granularity. This might not be the best
> performant scheme
> > + *
> > + * This can be tunned with a migration granularity for  performance, for
> example,
> > + * migration 2M for each CPU page fault, or let user specify how much to
> migrate.
> > + * This is more complex due to vma splitting.
> > + *
> > + * This function should also update GPU page table, so the fault virtual
> address
> > + * points to the same sram location from GPU side. This is TBD.
> > + *
> > + * Return:
> > + * 0 on success
> > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> application
> > + * will be signaled a SIGBUG
> > + */
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > +{
> > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > +	struct xe_device *xe = tile_to_xe(tile);
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> 
> I don't think this is needed... More below.

Yes. I removed the xe_svm concept in v3. 

In my v3, we have a mm_struct pointer in gpuvm. If a gpuvm participate svm, the mm struct pointer will be set. This is from your PoC and I think this is the simplest implementation.

Depending on how people view it, we might still invite back the xe_svm or drm_gpusvm (as in your new series) though. It not a huge deal. Let's see how things go.

> 
> > +	unsigned long addr = vma->vm_start;
> > +	u64 npages = vma_pages(vma);
> > +	struct xe_vma *xe_vma;
> > +	vm_fault_t ret = 0;
> > +	struct xe_vm *vm;
> > +	void *buf;
> > +	int i;
> > +
> > +	struct migrate_vma migrate_vma = {
> > +		.vma		= vmf->vma,
> > +		.start		= vma->vm_start,
> > +		.end		= vma->vm_end,
> 
> So I know in my PoC I had the fault user pointer (xe_vma) == struct
> vm_area_struct of the GPU fault. That is definitely wrong. 

Can you explain why this is definitely wrong? I still think creating xe_vma/fault userptr to cover the whole struct vm_area_struct range is not a bad idea. Our mmu notififer is xe-vma based, so basically for each vm_area_struct/xe-vma we have a mmu interval notifier.

The migration and gpu page table update can be any sub-range of the xe-vma, as long as the boundary of the sub-range is page aligned. 

We likely
> want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> finally 4k in sizes trying the largest first... The chunk sizes are
> trivial as we likely can just have table with values, the key here is
> the vm_area_struct vm_start / vm_end are not what we want to use here
> rather xe_vma_start(vma) and xe_vma_end(vma). 

This is cpu page fault handler. I think it is just fine to use vma_area_struct.vm_start/end. In my opinion, referring to some gpu structure xe-vma-start/end in cpu page fault handler is a little strange. 

As said, I have moved to one-page based scheme in cpu page fault handling. But even if we migrate whole vma, I don't see a problem of above codes. Can you explain why you want to use xe-vma here? My guess is, in your picture, you want one xe-vma be either backed by system memory, or gpu device memory but not a mixture. I am moving away from this scheme and adopting a page centric design where xe-vma can be backed by a mixture of sram and vram.


I think we get the
> xe_vma
> from the faulting page vmf->page->zone_device_data field unless you have
> another use that field...

I know this scheme works for you. But it is a design choice of how to use zone-device-data. In my v3, this is still used to hold buddy-block. I am not saying we can't do what you did, just we can do it differently.

Struct page is all about physical memory management. Couple it with virtual memory structure such as xe-vma is not a good sign to me.

> 
> I also comment on my patch with my suggestion implement chunk sizes too.

I did look at your patch and basically you are creating svm-rang at chunk size...

In my v3, I am staying with the same concept of v2: create xe-vma per cpu vma, but changed the migration/gpu pt mapping to be chunk size based, roughly.


> 
> > +		.pgmap_owner	= xe,
> 
> Again helper for this.


Addressed in v3
> 
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page = vmf->page,
> > +	};
> > +
> > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > +	migrate_vma.src = buf;
> > +	migrate_vma.dst = buf + npages;
> > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > +		ret = VM_FAULT_SIGBUS;
> > +		goto free_buf;
> > +	}
> > +
> > +	if (!migrate_vma.cpages)
> 
> This is an error, need to set a return value.

Agreed

> 
> > +		goto free_buf;
> > +
> 
> We probably should check migrate.cpages != npages too as I also think
> this is an error.

In the scheme of "no mixture placement of vma", yes. As said, I am moving away from this scheme. Let's see how well it moves.

> 
> > +	for (i = 0; i < npages; i++) {
> > +		ret = migrate_page_vram_to_ram(vma, addr,
> migrate_vma.src[i],
> > +							migrate_vma.dst + i);
> > +		if (ret < 0) {
> > +			ret = VM_FAULT_SIGBUS;
> > +			break;
> > +		}
> > +
> > +		/** Migration has been successful, free source page */
> > +		if (ret == 1) {
> > +			struct page *src_page =
> migrate_pfn_to_page(migrate_vma.src[i]);
> > +
> > +			xe_devm_page_free(src_page);
> > +		}
> > +
> > +		addr += PAGE_SIZE;
> > +	}
> 
> I touch on this above, this should be reworked to roughly:
> 
> - alloc pages and populate migrate_vma.dst
> - dma map sram pages
> - migrate a chunk of contigous vram addresses at a time
> - wait on last dma fence from migrate
> - unmap dma pages
> - unlock and free all pages

In v3, I roughly have above process. But a lot of things have changed, let's review it once it is ready.


> 
> > +
> > +	xe_svm_for_each_vm(svm, vm) {
> > +		xe_assert(xe, vm->mm == mm);
> > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > +		if (xe_vma)
> > +			xe_vm_invalidate_vma(xe_vma);
> > +	}
> 
> I've touched on why this isn't needed... I think one of these
> migrate_vma_* functions will trigger all MMU notifiers registered for
> the range. The notifier owns the invalidate then.
> 
> Beyond this, maybe I'm confused about a few things and how this fits all
> together. Doesn't every user process have its own unique mm, fd, and vm
> (e.g. own address space)? If a user want a shared address space then use
> threads with a single mm, fd, and vm.
> 
> So even if we had to resolve the xe_vma's here and do an invalidate here
> very confused what this is doing. This is this the case with multiple
> devices and each VM points to a different device? Again so that case I
> don't think a xe_svm structure would be needed, on GPU fault we should
> be to detect from the faulting page zone_device_data and pgmap owner
> if the fault already has a physical backing on another GPU and resolve
> how to map it into GPU with a fault... Jason suggests this in the
> following thread [2] and I think I agree with him.
> 
> [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> 240632fd3e35@amd.com/T/

Thanks, I agree with above analysis. Yes invalidation is already triggered during migrate_vma_setup, and above codes are deleted in v3.

> 
> > +	migrate_vma_pages(&migrate_vma);
> 
> This logic is going to change but ...
> 
> On an error I think we only want to call migrate_vma_finalize to revert
> pages back to the original state (i.e. migrate_vma_pages commits the
> page changes which we don't want to do on an error).

Agreed. I found this logic in your series. I will fix it in v3. 

> 
> > +	migrate_vma_finalize(&migrate_vma);
> > +free_buf:
> > +	kvfree(buf);
> > +	return 0;
> 
> I don't think 0 should blindly be return here, if there is an error
> return VM_FAULT_SIGBUS. We likely want a high level error message too.

That is correct. Fixed in v3.

Regards,
Oak
> 
> Matt
> 
> > +}
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-12 18:10       ` Matthew Brost
  2024-04-12 18:39         ` Zeng, Oak
@ 2024-06-07  4:44         ` Zeng, Oak
  1 sibling, 0 replies; 72+ messages in thread
From: Zeng, Oak @ 2024-06-07  4:44 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Friday, April 12, 2024 2:11 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Fri, Apr 12, 2024 at 11:24:06AM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Wednesday, April 10, 2024 10:07 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty,
> > > Brian <brian.welty@intel.com>
> > > Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> > >
> > > On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > > > Under the picture of svm, CPU and GPU program share one same
> > > > virtual address space. The backing store of this virtual address
> > > > space can be either in system memory or device memory. Since GPU
> > > > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > > > Any CPU access to device memory causes a page fault. Implement
> > > > a page fault handler to migrate memory back to system memory and
> > > > map it to CPU page table so the CPU program can proceed.
> > > >
> > > > Also unbind this page from GPU side, and free the original GPU
> > > > device page
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile         |   1 +
> > > >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> > > >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> > > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> > > ++++++++++++++++++++++++++++
> > > >  4 files changed, 230 insertions(+), 8 deletions(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> > > > index f89d77b6d654..65289acdd563 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> > > >  	xe_step.o \
> > > >  	xe_svm.o \
> > > >  	xe_svm_devmem.o \
> > > > +	xe_svm_migrate.o \
> > >
> > > See comments about file org, same thing applies here. Let's put all of
> > > the svm implementation in a single file.
> > >
> > > >  	xe_sync.o \
> > > >  	xe_tile.o \
> > > >  	xe_tile_sysfs.o \
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> > > > index f601dffe3fc1..c9e4239c44b4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -7,11 +7,11 @@
> > > >  #define __XE_SVM_H
> > > >
> > > >  #include <linux/mm_types.h>
> > > > +#include <linux/mm.h>
> > > >  #include "xe_device_types.h"
> > > >  #include "xe_device.h"
> > > >  #include "xe_assert.h"
> > > > -
> > > > -struct xe_vm;
> > > > +#include "xe_vm_types.h"
> > > >
> > > >  /**
> > > >   * struct xe_svm - data structure to represent a shared
> > > > @@ -31,6 +31,9 @@ struct xe_svm {
> > > >  	struct list_head vm_list;
> > > >  };
> > > >
> > > > +#define xe_svm_for_each_vm(svm, vm)
> > > 	\
> > > > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > > > +
> > >
> > > Don't think this is need, see below.
> > >
> > > >  extern struct xe_svm *xe_create_svm(void);
> > > >  void xe_destroy_svm(struct xe_svm *svm);
> > > >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct
> *mm);
> > > > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > > >
> > > >  void xe_devm_free_blocks(struct list_head *blocks);
> > > >  void xe_devm_page_free(struct page *page);
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > index 088ac209ad80..32ada458f1dd 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> > > >  	unsigned long bitmap[];
> > > >  };
> > > >
> > > > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > -{
> > > > -	return 0;
> > > > -}
> > > > -
> > > >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > >  {
> > > >  	/** DRM buddy's block offset is 0-based*/
> > > > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head
> *blocks)
> > > >
> > > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > >  	.page_free = xe_devm_page_free,
> > > > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> > >
> > > Again single file so this will be static function, no reason to export
> > > this.
> > >
> > > >  };
> > > >
> > > >  /**
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > new file mode 100644
> > > > index 000000000000..0db831af098e
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > @@ -0,0 +1,222 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2023 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/gfp.h>
> > > > +#include <linux/migrate.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/dma-fence.h>
> > > > +#include <linux/bitops.h>
> > > > +#include <linux/bitmap.h>
> > > > +#include <linux/kernel.h>
> > > > +#include <linux/slab.h>
> > > > +#include <drm/drm_buddy.h>
> > > > +#include "xe_device_types.h"
> > > > +#include "xe_device.h"
> > > > +#include "xe_trace.h"
> > > > +#include "xe_migrate.h"
> > > > +#include "xe_ttm_vram_mgr_types.h"
> > > > +#include "xe_assert.h"
> > > > +#include "xe_pt.h"
> > > > +#include "xe_svm.h"
> > > > +#include "xe_vm.h"
> > > > +
> > > > +
> > > > +/**
> > > > + * alloc_host_page() - allocate one host page for the fault vma
> > > > + *
> > > > + * @dev: (GPU) device that will access the allocated page
> > > > + * @vma: the fault vma that we need allocate page for
> > > > + * @addr: the fault address. The allocated page is for this address
> > > > + * @dma_addr: used to output the dma address of the allocated page.
> > > > + * This dma address will be used for gpu to access this page. GPU
> > > > + * access host page through a dma mapped address.
> > > > + * @pfn: used to output the pfn of the allocated page.
> > > > + *
> > > > + * This function allocate one host page for the specified vma. It
> > > > + * also does some prepare work for GPU to access this page, such
> > > > + * as map this page to iommu (by calling dma_map_page).
> > > > + *
> > > > + * When this function returns, the page is locked.
> > > > + *
> > > > + * Return struct page pointer when success
> > > > + * NULL otherwise
> > > > + */
> > > > +static struct page *alloc_host_page(struct device *dev,
> > > > +							 struct
> vm_area_struct
> > > *vma,
> > > > +							 unsigned long addr,
> > > > +							 dma_addr_t
> > > *dma_addr,
> > > > +							 unsigned long *pfn)
> > >
> > > Weird alignment, also I don't think we are want to allocate a page at
> > > time...
> > >
> > > Beyond that, can't say I'm a fan of 2 arguments being return and
> > > populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> > > haven't seen a lot that style function in Linux.
> > >
> > > Probably makes more sense to have a function which allocates pages,
> > > locks them, and populates the pfn array (migrate_pfn) rather than doing
> > > this a page at a time.
> > >
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > > > +	if (unlikely(!page))
> > > > +		return NULL;
> > > > +
> > > > +	/**Lock page per hmm requirement, see hmm.rst*/
> > > > +	lock_page(page);
> > > > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > >
> > > The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> > > needed, right? As mentioned above I think this should be broken out into
> > > a different step too.
> > >
> > > > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > > > +		unlock_page(page);
> > > > +		__free_page(page);
> > > > +		return NULL;
> > > > +	}
> > > > +
> > > > +	*pfn = migrate_pfn(page_to_pfn(page));
> > > > +	return page;
> > > > +}
> > > > +
> > > > +static void free_host_page(struct page *page)
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > > > + *
> > > > + * @vma: The vma that the page is mapped to
> > > > + * @addr: The virtual address that the page is mapped to
> > > > + * @src_pfn: src page's page frame number
> > > > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > > > + *
> > > > + * Allocate one page in system ram and copy memory from device
> memory
> > > > + * to system ram.
> > > > + *
> > > > + * Return: 0 if this page is already in sram (no need to migrate)
> > > > + * 1: successfully migrated this page from vram to sram.
> > > > + * error code otherwise
> > > > + */
> > > > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> > > unsigned long addr,
> > > > +						unsigned long src_pfn,
> > > unsigned long *dst_pfn)
> > > > +{
> > >
> > > We definitely don't want to copy 1 page at time. I touch on this in [1].
> > > Basically this going to perform poorly unless we use larger copies, the
> > > migrate code supports non-contigous sram addresses, and vram
> addresses
> > > will likely be contigous due to the buddy allocator.
> > >
> > > [1]
> https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> > >
> > > > +	struct xe_mem_region *mr;
> > > > +	struct xe_tile *tile;
> > > > +	struct xe_device *xe;
> > > > +	struct device *dev;
> > > > +	dma_addr_t dma_addr = 0;
> > > > +	struct dma_fence *fence;
> > > > +	struct page *host_page;
> > > > +	struct page *src_page;
> > > > +	u64 src_dpa;
> > > > +
> > > > +	src_page = migrate_pfn_to_page(src_pfn);
> > > > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> > >
> > > I'm going to say this is a bug if !src_page ||
> > > !is_zone_device_page(src_page) || !(src_pfn &
> MIGRATE_PFN_MIGRATE) and
> > > we return -EFAULT (or another error code if that makes more sense). We
> > > are migrating from the device where we know we have backing store
> from
> > > the original fault.
> > >
> > > > +		return 0;
> > > > +
> > > > +	mr = xe_page_to_mem_region(src_page);
> > > > +	tile = xe_mem_region_to_tile(mr);
> > > > +	xe = tile_to_xe(tile);
> > > > +	dev = xe->drm.dev;
> > > > +
> > > > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > > > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > > > +	if (!host_page)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > > > +						dma_addr, false, PAGE_SIZE);
> > > > +	if (IS_ERR(fence)) {
> > > > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > > > +		free_host_page(host_page);
> > > > +		return PTR_ERR(fence);
> > > > +	}
> > > > +
> > > > +	dma_fence_wait(fence, false);
> > >
> > > Even if we did want to migrate a page at a time, we only need to wait on
> > > the last fence due to the ordered nature of exec queues.
> > >
> > > > +	dma_fence_put(fence);
> > > > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> DMA_FROM_DEVICE);
> > >
> > > With above, will likely unmap all dma pages in a single function once
> > > the last fence is signaled.
> > >
> > > > +	return 1;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU
> page
> > > fault
> > > > + *
> > > > + * @vmf: cpu vm fault structure, contains fault information such as
> vma etc.
> > > > + *
> > > > + * Note, this is in CPU's vm fault handler, caller holds the mmap read
> lock.
> > > > + *
> > > > + * This function migrate one gpu vma which contains the fault address
> to
> > > sram.
> > > > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> > > create one
> > > > + * gpu vma for one cpu vma initially and try not to split it). So this
> scheme
> > > end
> > > > + * up migrate at the vma granularity. This might not be the best
> performant
> > > scheme
> > > > + *
> > > > + * This can be tunned with a migration granularity for  performance,
> for
> > > example,
> > > > + * migration 2M for each CPU page fault, or let user specify how much
> to
> > > migrate.
> > > > + * This is more complex due to vma splitting.
> > > > + *
> > > > + * This function should also update GPU page table, so the fault virtual
> > > address
> > > > + * points to the same sram location from GPU side. This is TBD.
> > > > + *
> > > > + * Return:
> > > > + * 0 on success
> > > > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> > > application
> > > > + * will be signaled a SIGBUG
> > > > + */
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > > > +{
> > > > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > > > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > > > +	struct xe_device *xe = tile_to_xe(tile);
> > > > +	struct vm_area_struct *vma = vmf->vma;
> > > > +	struct mm_struct *mm = vma->vm_mm;
> > > > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> > >
> > > I don't think this is needed... More below.
> > >
> > > > +	unsigned long addr = vma->vm_start;
> > > > +	u64 npages = vma_pages(vma);
> > > > +	struct xe_vma *xe_vma;
> > > > +	vm_fault_t ret = 0;
> > > > +	struct xe_vm *vm;
> > > > +	void *buf;
> > > > +	int i;
> > > > +
> > > > +	struct migrate_vma migrate_vma = {
> > > > +		.vma		= vmf->vma,
> > > > +		.start		= vma->vm_start,
> > > > +		.end		= vma->vm_end,
> > >
> > > So I know in my PoC I had the fault user pointer (xe_vma) == struct
> > > vm_area_struct of the GPU fault. That is definitely wrong. We likely
> > > want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> > > this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> > > finally 4k in sizes trying the largest first... The chunk sizes are
> > > trivial as we likely can just have table with values, the key here is
> > > the vm_area_struct vm_start / vm_end are not what we want to use
> here
> > > rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the
> xe_vma
> 
> After I typed this, realized I made a mistake here...
> 
> s/xe_vma_start/xe_vma_userptr/
> s/xe_vma_end/xe_vma_userptr + xe_vma_size/
> 
> But you get the idea - the zone_device_data points Xe specific chunk
> data (currently xe_vma, could be xe_pt_state our something later if we
> switch to 1:N).
> 
> Check AMD's + Nvidia's drivers and they both use this field in a similar
> way.

I got your idea. But as explained in a reply earlier, I am not convinced it is a good idea of using zdd to point to xe-vma.

> 
> > > from the faulting page vmf->page->zone_device_data field unless you
> have
> > > another use that field...
> >
> > You are right. Such work is planned in the memory attributes part that
> Himal is working on. We have a migration_granularity attribute which allow
> user to set the chunk size.
> >
> 
> That makes sense. The chunk size is always just hint though, right?

We will have a default chunk size of 2M, which can be overwritten by hint


> 
> > >
> > > I also comment on my patch with my suggestion implement chunk sizes
> too.
> > >
> > > > +		.pgmap_owner	= xe,
> > >
> > > Again helper for this.
> > >
> > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page = vmf->page,
> > > > +	};
> > > > +
> > > > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > > > +	migrate_vma.src = buf;
> > > > +	migrate_vma.dst = buf + npages;
> > > > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +		goto free_buf;
> > > > +	}
> > > > +
> > > > +	if (!migrate_vma.cpages)
> > >
> > > This is an error, need to set a return value.
> > >
> > > > +		goto free_buf;
> > > > +
> > >
> > > We probably should check migrate.cpages != npages too as I also think
> > > this is an error.
> > >
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		ret = migrate_page_vram_to_ram(vma, addr,
> > > migrate_vma.src[i],
> > > > +							migrate_vma.dst + i);
> > > > +		if (ret < 0) {
> > > > +			ret = VM_FAULT_SIGBUS;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		/** Migration has been successful, free source page */
> > > > +		if (ret == 1) {
> > > > +			struct page *src_page =
> > > migrate_pfn_to_page(migrate_vma.src[i]);
> > > > +
> > > > +			xe_devm_page_free(src_page);
> > > > +		}
> > > > +
> > > > +		addr += PAGE_SIZE;
> > > > +	}
> > >
> > > I touch on this above, this should be reworked to roughly:
> > >
> > > - alloc pages and populate migrate_vma.dst
> > > - dma map sram pages
> > > - migrate a chunk of contigous vram addresses at a time
> > > - wait on last dma fence from migrate
> > > - unmap dma pages
> > > - unlock and free all pages
> > >
> > > > +
> > > > +	xe_svm_for_each_vm(svm, vm) {
> > > > +		xe_assert(xe, vm->mm == mm);
> > > > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > > > +		if (xe_vma)
> > > > +			xe_vm_invalidate_vma(xe_vma);
> > > > +	}
> > >
> > > I've touched on why this isn't needed... I think one of these
> > > migrate_vma_* functions will trigger all MMU notifiers registered for
> > > the range. The notifier owns the invalidate then.
> >
> > Very good point. Yes after read migrate_vma_setup function, I agree this
> function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today
> we invalidate vma with this event. So yes, above codes are not needed.
> >
> > I do identified one potential improvement: when mmu notifier is called
> back with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is
> called from the gpu page fault path, we can ignore the gpu vma invalidation
> as we will update gpu page table later after migration anyway. I think an page
> table invalidation is not needed in such case. But this should be just a minor
> improvement.
> >
> 
> We skip invalidations if the initial_bind flag is clear which should
> cover at the initial GPU fault. There is certainly room for improvement
> / optimizations in the MMU notifier though, it is kinda messy right now
> too. IMO work like this can be done once the basic design is working +
> tests in place to verify changes / optimizations.

Agreed.

> 
> >
> > >
> > > Beyond this, maybe I'm confused about a few things and how this fits all
> > > together. Doesn't every user process have its own unique mm, fd, and
> vm
> > > (e.g. own address space)? If a user want a shared address space then use
> > > threads with a single mm, fd, and vm.
> >
> > Yes, this is also my understanding. Each user process has its own mm struct
> and device fds.
> >
> > In a shared address space case, such as there are multiple pthread created
> through pthread_create in one process, all those pthreads should have
> different kernel task_struct, but all those task_struct (say we get it from
> "current" macro) should share one same mm struct, which means they all
> lives inside one cpu address space.
> >
> > Now with this work, we are now basically extending this shared cpu
> address space to gpu program. So both cpu program and gpu program can
> share one address space.
> >
> > Since we allow user to create multiple gpu vm for one device (lets focus on
> one device for now), so each shared address space can have multiple gpu
> vm... each gpuvm should be able to register its own mmu notifier to core mm,
> even if those notifier has the same address range. But I will have to test this
> out. If all this works, above codes are not needed. If different gpuvm can't
> register mmu notifier for same address range, then we would need a fix....
> >
> 
> The mmu notifier code is implemented with the interval tree which
> supports overlapping rangers (i.e. we can have multiple VM's register
> notifiers with the sam address range in a single MM).

Cool. I understand it now.


> 
> >
> > >
> > > So even if we had to resolve the xe_vma's here and do an invalidate here
> > > very confused what this is doing. This is this the case with multiple
> > > devices and each VM points to a different device?
> >
> > Right now I only focus on single device. See above. This is to solve one gpu
> device but multiple gpu vm case. But as said above, for now I don't think this
> is needed. I need to test more on the mmu notifier behavior: whether it
> allow us to insert two notifiers for the same range for one mm....
> >
> 
> Agree that our focus should be on single device now. If that design it
> well thought out I don't think extending this to multiple devices will
> be a huge change either.

Agreed.

Regards,
Oak

> 
> Matt
> 
> > Oak
> >
> > Again so that case I
> > > don't think a xe_svm structure would be needed, on GPU fault we should
> > > be to detect from the faulting page zone_device_data and pgmap owner
> > > if the fault already has a physical backing on another GPU and resolve
> > > how to map it into GPU with a fault... Jason suggests this in the
> > > following thread [2] and I think I agree with him.
> > >
> > > [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> > > 240632fd3e35@amd.com/T/
> > >
> > > > +	migrate_vma_pages(&migrate_vma);
> > >
> > > This logic is going to change but ...
> > >
> > > On an error I think we only want to call migrate_vma_finalize to revert
> > > pages back to the original state (i.e. migrate_vma_pages commits the
> > > page changes which we don't want to do on an error).
> > >
> > > > +	migrate_vma_finalize(&migrate_vma);
> > > > +free_buf:
> > > > +	kvfree(buf);
> > > > +	return 0;
> > >
> > > I don't think 0 should blindly be return here, if there is an error
> > > return VM_FAULT_SIGBUS. We likely want a high level error message too.
> > >
> > > Matt
> > >
> > > > +}
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-15 19:40       ` Matthew Brost
@ 2024-06-07 17:12         ` Zeng, Oak
  2024-06-07 17:56           ` Matthew Brost
  0 siblings, 1 reply; 72+ messages in thread
From: Zeng, Oak @ 2024-06-07 17:12 UTC (permalink / raw)
  To: Brost, Matthew, Bommu, Krishnaiah
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Thomas.Hellstrom@linux.intel.com, Welty, Brian

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Monday, April 15, 2024 3:40 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to
> vram
> 
> On Fri, Apr 12, 2024 at 03:21:04PM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Wednesday, April 10, 2024 10:49 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty,
> > > Brian <brian.welty@intel.com>
> > > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to
> vram
> > >
> > > On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > > > Introduce a helper function xe_svm_migrate_vma_to_vram.
> > > >
> > > > Since the source pages of the svm range can be physically not
> > > > contiguous, and the destination vram pages can also be not
> > > > contiguous, there is no easy way to migrate multiple pages per
> > > > blitter command. We do page by page migration for now.
> > > >
> > > > Migration is best effort. Even if we fail to migrate some pages,
> > > > we will try to migrate the rest pages.
> > > >
> > > > FIXME: Use one blitter command to copy when both src and dst are
> > > > physically contiguous
> > > >
> > >
> > > Yep, touch in this throughout the series. Only vram needs to be
> > > contiguous though as we dynamically create PT mappings for sram pages
> in
> > > the migrate code. Getting this in a must and should be done immediately
> > > IMO as this a very, very basic perform thing we know needs to be done.
> > > We will likely have to tune this code quite a bit for performance so
> > > getting known things done would be helpful.
> > >
> > > > FIXME: when a vma is partially migrated, split vma as we assume
> > > > no mixture vma placement.
> > > >
> > >
> > > Agree we do not want support partial migrations. We likely want to
> > > return -EAGAIN for something and fault back to a smaller xe_vma chunk
> > > size which I discussed in [1] and add comment on in [2].


I thought into this when I worked on v3. I end up with keep the xe-vma not splitting, but migrate partially of a vma. A vma can have mixed backing store now... To support this, I had to make the page table update and invalidation range based (vs whole xe-vma based) I am testing it. Not very sure how well this goes.

One thing I realized that might now work well is the range based invalidation. Say we a pde entry covering a huge 2M or 1G page. If we only zap 4k in this rang, we will need to split pde entry and replace it with many pde/pte entries. Even though the zap_ptes/page table walk functions are range based, I doubt it works for such case. I will have to test it out. I think even if it doesn't work, it worth to make it work as an improvement area. For example, core mm obviously support such function, see zap_pte_range.



> > >
> > > Migration should be best case too if we fail migrate we can always leave
> > > backing store in sram too.
> > >
> > > I do have question though, when do we get partial migrations? A user
> > > having called mlock on some of the pages? I just want to make sure I
> > > fully understand that case.
> >
> > Yah, mlock could be one case...
> >
> > I also looked the hmm code. There are few other cases where
> MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte
> is NULL and vma is file-backed (not anonymous); entry is swapped out to
> hard disk etc. see more details in function migrate_vma_collect_pmd.
> >
> >
> > >
> > > [1]
> https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > > [2]
> https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> > > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> > > ++++++++++++++++++++++++++++
> > >
> > > Same comment on file structure throughout the series apply here too.
> > >
> > > >  2 files changed, 117 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> > > > index c9e4239c44b4..18ce2e3757c5 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > > >  void xe_devm_free_blocks(struct list_head *blocks);
> > > >  void xe_devm_page_free(struct page *page);
> > > >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma
> *vma,
> > > > +							struct xe_tile *tile);
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > index 0db831af098e..ab8dd1f58aa4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> > > vm_fault *vmf)
> > > >  	kvfree(buf);
> > > >  	return 0;
> > > >  }
> > > > +
> > > > +/**
> > > > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma
> to
> > > vram
> > > > + * Must be called with mmap_read_lock held.
> > > > + * @vm: the vm that the vma belongs to
> > > > + * @vma: the vma to migrate.
> > > > + * @tile: the destination tile which holds the new backing store of the
> range
> > > > + *
> > > > + * Returns: negative errno on faiure, 0 on success
> > > > + */
> > > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > > > +							struct xe_vma *vma,
> > > > +							struct xe_tile *tile)
> > > > +{
> > > > +	struct mm_struct *mm = vm->mm;
> > > > +	unsigned long start = xe_vma_start(vma);
> > > > +	unsigned long end = xe_vma_end(vma);
> > > > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > > +	struct vm_area_struct *vas;
> > > > +
> > > > +	struct migrate_vma migrate = {
> > > > +		.start		= start,
> > > > +		.end		= end,
> > > > +		.pgmap_owner	= tile->xe,
> > >
> > > Again helper to assign owner.
> > >
> > > > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > > > +	};
> > > > +	struct device *dev = tile->xe->drm.dev;
> > > > +	dma_addr_t *src_dma_addr;
> > > > +	struct dma_fence *fence;
> > > > +	struct page *src_page;
> > > > +	LIST_HEAD(blocks);
> > > > +	int ret = 0, i;
> > > > +	u64 dst_dpa;
> > > > +	void *buf;
> > > > +
> > > > +	mmap_assert_locked(mm);
> > >
> > > This mmap_assert_locked is ambiguous, we should make it clear if this
> > > read or write locked. Doesn't it have to be write to do the migrate
> > > pages?
> >
> > I followed hmm document (Documents/mm/hmm.rst), see section
> "Migration to and from device memory". It explicitly write a read_lock in this
> document.
> >
> > I believe a read_lock is enough for the
> migrate_vma_setup/migrate_vma_finalize().
> >
> > As I understand it, the mm.mmap_lock protect the process's address space.
> When we modify process's address space such as mmap/munmap, we need
> to hold a write mode lock; if we only read process's address space, such as in
> the migrate_vma_setup/finalize, or in the cpu page fault handler case, we
> only need a read mode lock.
> >
> > I also cross checked amd driver. They also use a read lock.. see function
> svm_range_restore_pages in kfd_svm.c....
> >
> 
> Yea, I see that too. Trying to figure out the locking, IMO the locking
> document might actually be wrong, or the very least the locking design
> is very ill-conceived. We can discuss internally a bit before I
> publically share my grievances.
> 
> >
> > >
> > > A larger question about the locking... The CPU fault handler holds the
> > > mmap lock in write mode, right?
> >
> > No. since cpu fault handler doesn't modify process address space, instead it
> only fill up cpu page table for some valid address range, a read lock is enough.
> >
> 
> Ah, yes after digging around a bit I see this.
> 
> > >
> > > I'm asking as basically I think at least initially we want to hold the
> > > mmap lock in a way that the GPU handler and CPU handler do not race.
> > > i.e. From fault userptr create in GPU fault handler to issuing the bind
> > > we prevent the CPU fault handler from running.
> >
> > Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid
> that race.
> >
> 
> That's not how rw lock work. 2 threads can both hold the read lock in
> parallel (shared read access), only 1 thread hold the write lock
> (exclusive write access, no one can hold read lock either). Thus my
> concern about the cpu and gpu fault handler running in parallel and the
> larger locking design questions. Again we can talk through this in
> detail internally a bit.
> 


It is a great question of the gpu/cpu handler race. Obviously my above thinking is wrong. Here is my new understanding: 

think about this scenario: address backing store is in vram - cpu access - cpu fault- migrate_vma_setup - invalidate gpu pt  - gpu access same address - gpu fault - migrate_vma_setup triggered by gpu fault - this read back pfn of vram backing store because cpu fault is not finished/migrate_vma_finalized is not called yet, but since gpu side would only migrate memory in sram, so migrate_vma_setup from gpu side wouldn't return any valid pfn to migrate and gpu page fault handler would return successfully and gpu hw execution is resumed but immediately trigger another gpu page fault, until cpu side migration is done. 


Similar when gpu fault happens during the other stage of cpu fault handler. Same thing when CPU fault handler interrupt a gpu fault handler.

This seems work to me but we need to test it out. We will add a test case for this race condition. @Bommu, Krishnaiah fyi.

> > In user mmap/munmap (such as kernel function vm_munmap), we hold a
> mmap_write_lock which prevent it racing with cpu and gpu fault handler.
> >
> >
> > >
> > > I'm having issues figuring out how to prevent races between initial
> > > binds of userptrs and userptr invalidates on faulting VMs. This race is
> > > seen any xe_exec_fault_mode for example... So preventing races
> between
> > > CPU / GPU fault handler with the mmap probably is a good idea initially.
> > > Likely can make the locking finer grained once this is all working and I
> > > figure out how to handle this race better.
> >
> >
> > I don't quite follow here.
> >
> > Initial bind of user ptr... if you meant the bind in gpu page fault handler,
> then the racing with invalidation is roughly like below:
> > Invalidation is called with mmap_write_lock
> 
> Is it? If the notifer does the invalidation via migrate_vma_setup in the
> CPU fault handler we established about that only the mmap_read_lock is
> held.

Yes the cpu fault triggered migrate-vma-setup/invalidation only hold read_lock. 

It was the munmap triggered invalidation in my mind. For this case, we hold mmap_write_lock during invalidation, so it is mutual exclusive with gpu triggered migration.

But a cpu fault triggered invalidation can proceed during gpu fault triggered migration. As explained above, it cause extra gpu page table invalidation which is just fine because the consequence is just extra gpu page fault (and later migration). So functionally I don't see a problem of this scheme.

The main reason we hold a read lock in gpu fault handler is, gpu fault handler doesn't modify process address space, rather it only traverse the cpu page table.

> 
> > In userptr_pin_page, we hold a mmap_read_lock, so we know during pin
> page, invalidation is excluded.
> 
> Nope, see above invalidation can happen when userptr_pin_page is
> executing because of the read lock. The seqno check (described below) is
> what prevents programming of bad page tables.

You are right.


> 
> > After pin, before programming gpu page table, we check whether there is
> invalidation happen *after last pin but before programming page table*, if
> that happened, we retry
> >
> 
> Yes, that is how it works on tip but I am refactoring it in [1]. I was
> trying to avoid the retry loop by turning PDE/PTE writes into clears if an
> invalidation happened but not sure if that works without a larger
> refactor due to nature PDEs being shared. I may need the retry loop but
> that also gets tricky with array of binds... A few options here and will
> land a on solution once [2] is merged.

I discussed this retry loop (I refer to the retry loop in xe-pt-userptr-pre-commit function before your change)with Thomas some time ago. I agree the retry loop scheme work

But to me other scheme without loop also work. As long as we have a retry check before programming gpu page table, it is fine to me. 

If you have further fix in this area, I will keep watching and pick it. 

> 
> Regardless my point here is still valid - it may not be the worst idea
> when getting this code initially up and running just to grab
> mmap_write_lock in GPU fault handler as a BLK (big kernel lock) to
> prevent all races. Once the code is stable and stress testing in place,
> switch to finer grained locking as define in HMM document (or newly
> defined if we fine locking design is insufficient).

As explained, I will keep the read-lock for now. And we will test it. If we do run into race condition, I will try write-lock and come back.

Thanks a lot for the great questions Matt! It is very helpful!


Oak

> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/125608/
> [2] https://patchwork.freedesktop.org/series/132246/
> 
> >
> >
> > Oak
> >
> > >
> > > > +
> > > > +	vas = find_vma_intersection(mm, start, start + 4);
> > >
> > > find_vma should work fine here.
> > >
> > > > +	if (!vas)
> > > > +		return -ENOENT;
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) +
> sizeof(*src_dma_addr),
> > > > +					GFP_KERNEL);
> > > > +	if(!buf)
> > > > +		return -ENOMEM;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > > > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> > >
> > > Again as I discussed in [3] I think this should be broken out into a
> > > different step with the blocks allocated before this, and here just
> > > populate migrate.dst from the existing blocks.
> > >
> > > [3]
> https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> > >
> > > > +	if (ret)
> > > > +		goto kfree_buf;
> > > > +
> > > > +	ret = migrate_vma_setup(&migrate);
> > > > +	if (ret) {
> > > > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> > > [%lx - %lx]\n",
> > > > +				ret, start, end);
> > > > +		goto free_dst_pages;
> > > > +	}
> > > > +
> > > > +	/**FIXME: partial migration of a range print a warning for now.
> > > > +	 * If this message is printed, we need to split xe_vma as we
> > > > +	 * don't support a mixture placement of one vma
> > > > +	 */
> > > > +	if (migrate.cpages != npages)
> > > > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
> > >  %lx], range is %ld pages, migrate only %ld pages\n",
> > > > +				start, end, npages, migrate.cpages);
> > >
> > > As discussed above, we shouldn't support this. We should fall back to
> > > smaller xe_vma chunk size until we find one that works or simply leave
> > > the pages in sram and map those pages to GPU.
> > >
> > > > +
> > > > +	/**Migrate page by page for now.
> > > > +	 * Both source pages and destination pages can physically not
> > > contiguous,
> > > > +	 * there is no good way to migrate multiple pages per blitter
> command.
> > > > +	 */
> > >
> > > Touched on this a bunch throughout the series, lets do better than a
> > > page a time migration.
> > >
> > > Algorithm should be very similar to what I discussed here [4] but with a
> > > few key differences.
> > >
> > > - I think the sram pages can be unpopulated (page == NULL) if the user
> > >   has not yet touched the page
> > > - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> > >
> > > In these cases this indicate we have to issue a copy for the pages we
> > > are accumulating with contigous vram addresses.
> > >
> > > [4]
> https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > >
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > > > +		if (unlikely(!src_page || !(migrate.src[i] &
> > > MIGRATE_PFN_MIGRATE)))
> > >
> > > Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> > > on subsequent errors we need to call migrate_vma_finalize to revert the
> > > pages to the original state. At least I think if I am reading the doc
> > > after this correctly.
> > >
> > > Here on error we just free the pages...
> > >
> > > Matt
> > >
> > > > +			goto free_dst_page;
> > > > +
> > > > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > > > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> > > PAGE_SIZE, DMA_TO_DEVICE);
> > > > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > > > +			drm_warn(&tile->xe->drm, "dma map error for host
> > > pfn %lx\n", migrate.src[i]);
> > > > +			goto free_dst_page;
> > > > +		}
> > > > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > > > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > > > +				dst_dpa, true, PAGE_SIZE);
> > > > +		if (IS_ERR(fence)) {
> > > > +			drm_warn(&tile->xe->drm, "migrate host page
> > > (pfn: %lx) to vram failed\n",
> > > > +					migrate.src[i]);
> > > > +			/**Migration is best effort. Even we failed here, we
> > > continue*/
> > > > +			goto free_dst_page;
> > > > +		}
> > > > +		/**FIXME: Use the first migration's out fence as the second
> > > migration's input fence,
> > > > +		 * and so on. Only wait the out fence of last migration?
> > > > +		 */
> > > > +		dma_fence_wait(fence, false);
> > > > +		dma_fence_put(fence);
> > > > +free_dst_page:
> > > > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > > > +	}
> > > > +
> > > > +	for (i = 0; i < npages; i++)
> > > > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > > > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> > > DMA_TO_DEVICE);
> > > > +
> > > > +	migrate_vma_pages(&migrate);
> > > > +	migrate_vma_finalize(&migrate);
> > > > +free_dst_pages:
> > > > +	if (ret)
> > > > +		xe_devm_free_blocks(&blocks);
> > > > +kfree_buf:
> > > > +	kfree(buf);
> > > > +	return ret;
> > > > +}
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator
  2024-04-11  2:55   ` Matthew Brost
@ 2024-06-07 17:22     ` Zeng, Oak
  2024-06-07 18:18       ` Matthew Brost
  0 siblings, 1 reply; 72+ messages in thread
From: Zeng, Oak @ 2024-06-07 17:22 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 10:55 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 31/31] drm/xe/svm: Migration from sram to vram for system
> allocator
> 
> On Tue, Apr 09, 2024 at 04:17:42PM -0400, Oak Zeng wrote:
> > If applicable, migrate a vma from sram to vram for system allocator.
> > Traditional userptr is not migrated. Only userptr created during
> > fault (aka userptr splitted from system allocator vma) can be
> > migrated.
> >
> > FIXME: The migration should be conditional on user memory attributes
> > setting. Add this logic when memory attributes are supported
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_pagefault.c | 9 ++++++++-
> >  drivers/gpu/drm/xe/xe_vm.c           | 4 ----
> >  2 files changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > index 668984f0769e..c6ba00049964 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > @@ -20,6 +20,7 @@
> >  #include "xe_guc_ct.h"
> >  #include "xe_migrate.h"
> >  #include "xe_trace.h"
> > +#include "xe_svm.h"
> >  #include "xe_vm.h"
> >
> >  struct pagefault {
> > @@ -209,12 +210,18 @@ static int handle_pagefault(struct xe_gt *gt,
> struct pagefault *pf)
> >
> >  	if (xe_vma_is_userptr(vma) && write_locked) {
> >  		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> > +		struct xe_userptr *userptr = &uvma->userptr;
> >
> >  		spin_lock(&vm->userptr.invalidated_lock);
> > -		list_del_init(&uvma->userptr.invalidate_link);
> > +		list_del_init(&userptr->invalidate_link);
> >  		spin_unlock(&vm->userptr.invalidated_lock);
> >
> > +		mmap_read_lock(userptr->notifier.mm);
> > +		/**FIXME: Add migration policy here*/
> > +		if (xe_vma_is_fault_userptr(vma))
> > +			xe_svm_migrate_vma_to_vram(vm, vma, tile);
> 
> Agree we need a policy here...
> 
> See my comments about locking in [1] thinking if we migrate we likely
> want to hold the mmap lock until at least the bind being issued to
> prevent races with the CPU fault handler, at least initially.

As explained in [1], I will continue to use mmap read lock for now. And the read lock only covers migration and userptr-pin-pages but not the bind. The bind correctness is guaranteed by a retry mechanism and a userptr's notifier_lock.

> 
> [1] https://patchwork.freedesktop.org/patch/588542/?series=132229&rev=1
> 
> >  		ret = xe_vma_userptr_pin_pages(uvma);
> > +		mmap_read_unlock(userptr->notifier.mm);
> >  		if (ret)
> >  			goto unlock_vm;
> >
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 498b36469d00..8a58fe144a02 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -71,16 +71,12 @@ int xe_vma_userptr_pin_pages(struct
> xe_userptr_vma *uvma)
> >  	struct xe_vma *vma = &uvma->vma;
> >  	struct xe_vm *vm = xe_vma_vm(vma);
> >  	struct xe_device *xe = vm->xe;
> > -	struct xe_userptr *userptr;
> >  	int ret;
> >
> >  	lockdep_assert_held(&vm->lock);
> >  	xe_assert(xe, xe_vma_is_userptr(vma));
> >
> > -	userptr = &uvma->userptr;
> > -	mmap_read_lock(userptr->notifier.mm);
> >  	ret = xe_userptr_populate_range(uvma);
> > -	mmap_read_unlock(userptr->notifier.mm);
> 
> Now you won't have the lock here other callers of this function...
> Probably need to have locked / unlocked version or arguments here.

This is addressed in v3.

Oak

> 
> Matt
> 
> >
> >  	return ret;
> >  }
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-06-07 17:12         ` Zeng, Oak
@ 2024-06-07 17:56           ` Matthew Brost
  2024-06-07 18:10             ` Matthew Brost
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-06-07 17:56 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Bommu, Krishnaiah, intel-xe@lists.freedesktop.org,
	Ghimiray,  Himal Prasad, Thomas.Hellstrom@linux.intel.com,
	Welty, Brian

On Fri, Jun 07, 2024 at 11:12:00AM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Monday, April 15, 2024 3:40 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to
> > vram
> > 
> > On Fri, Apr 12, 2024 at 03:21:04PM -0600, Zeng, Oak wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: Wednesday, April 10, 2024 10:49 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com;
> > Welty,
> > > > Brian <brian.welty@intel.com>
> > > > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to
> > vram
> > > >
> > > > On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > > > > Introduce a helper function xe_svm_migrate_vma_to_vram.
> > > > >
> > > > > Since the source pages of the svm range can be physically not
> > > > > contiguous, and the destination vram pages can also be not
> > > > > contiguous, there is no easy way to migrate multiple pages per
> > > > > blitter command. We do page by page migration for now.
> > > > >
> > > > > Migration is best effort. Even if we fail to migrate some pages,
> > > > > we will try to migrate the rest pages.
> > > > >
> > > > > FIXME: Use one blitter command to copy when both src and dst are
> > > > > physically contiguous
> > > > >
> > > >
> > > > Yep, touch in this throughout the series. Only vram needs to be
> > > > contiguous though as we dynamically create PT mappings for sram pages
> > in
> > > > the migrate code. Getting this in a must and should be done immediately
> > > > IMO as this a very, very basic perform thing we know needs to be done.
> > > > We will likely have to tune this code quite a bit for performance so
> > > > getting known things done would be helpful.
> > > >
> > > > > FIXME: when a vma is partially migrated, split vma as we assume
> > > > > no mixture vma placement.
> > > > >
> > > >
> > > > Agree we do not want support partial migrations. We likely want to
> > > > return -EAGAIN for something and fault back to a smaller xe_vma chunk
> > > > size which I discussed in [1] and add comment on in [2].
> 
> 
> I thought into this when I worked on v3. I end up with keep the xe-vma not splitting, but migrate partially of a vma. A vma can have mixed backing store now... To support this, I had to make the page table update and invalidation range based (vs whole xe-vma based) I am testing it. Not very sure how well this goes.
> 

A few things, I strongly oppose mixed storage within a single GPU
mapping. I view this as an unnecessary complication.

A GPU mapping is not a VMA - it is a subset of a VMA. It has been agreed
upon we have a 1:N relationship in SVM. With '1' being a GPU VMA which
reflects user IOCTLs (e.g. initial large bind, madvise calls which can
split the VMA). The 'N' being mappings dynamically created on upon GPU
page fault. This a xe_svm_range (subclass of drm_svm_range) in my PoC.
My expectation is that any post has something similar to this agreed
upon relationship implemented.

Here my view is the 'N' never is mixed storage.

See my comments about 'page grainularity' migration in 'implement
functions to allocate and free device memory'. If you are doing page
grainularity migration, then 'N' is exactly 1 CPU page (could be THP)
thus you never have a case where partial migrations or mixed storage is
needed. Any spliting / mixed storage by definition is not page
grainularity.

If we don't fully support page grainularity, I think the grainularity
for any operation is allocation size of 'N'. e.g. Upon CPU page, CPU
unmap, evict, 'N' is fully moved, invalidated, GPU unmapped. This
simplifies the code quite a bit. See 'Migrataion', 'Partial Unmapping of
Ranges', and 'Examples' in drm_gpusvm.c. Of course open to others idea
here but anything that is more complicated that what I'm suggesting,
expect push back.

> One thing I realized that might now work well is the range based invalidation. Say we a pde entry covering a huge 2M or 1G page. If we only zap 4k in this rang, we will need to split pde entry and replace it with many pde/pte entries. Even though the zap_ptes/page table walk functions are range based, I doubt it works for such case. I will have to test it out. I think even if it doesn't work, it worth to make it work as an improvement area. For example, core mm obviously support such function, see zap_pte_range.
>

If we have GPU page tables larger than the CPU pages, then we are not
doing 'page grainularity'.

This is a large source of confusion for as you seem to have conflicting
ideas in reply to reply. Do plan on supporting 'page grainularity' or
not? This is unclear to me.

If we don't support 'page grainularity', then see my above for how I
expect an impedance mismatch between the CPU page size and 'N'. That
applies here as well.

> 
> 
> > > >
> > > > Migration should be best case too if we fail migrate we can always leave
> > > > backing store in sram too.
> > > >
> > > > I do have question though, when do we get partial migrations? A user
> > > > having called mlock on some of the pages? I just want to make sure I
> > > > fully understand that case.
> > >
> > > Yah, mlock could be one case...
> > >
> > > I also looked the hmm code. There are few other cases where
> > MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte
> > is NULL and vma is file-backed (not anonymous); entry is swapped out to
> > hard disk etc. see more details in function migrate_vma_collect_pmd.
> > >
> > >
> > > >
> > > > [1]
> > https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > > > [2]
> > https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> > > >
> > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> > > > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> > > > ++++++++++++++++++++++++++++
> > > >
> > > > Same comment on file structure throughout the series apply here too.
> > > >
> > > > >  2 files changed, 117 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > index c9e4239c44b4..18ce2e3757c5 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > > > >  void xe_devm_free_blocks(struct list_head *blocks);
> > > > >  void xe_devm_page_free(struct page *page);
> > > > >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma
> > *vma,
> > > > > +							struct xe_tile *tile);
> > > > >  #endif
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > > index 0db831af098e..ab8dd1f58aa4 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> > > > vm_fault *vmf)
> > > > >  	kvfree(buf);
> > > > >  	return 0;
> > > > >  }
> > > > > +
> > > > > +/**
> > > > > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma
> > to
> > > > vram
> > > > > + * Must be called with mmap_read_lock held.
> > > > > + * @vm: the vm that the vma belongs to
> > > > > + * @vma: the vma to migrate.
> > > > > + * @tile: the destination tile which holds the new backing store of the
> > range
> > > > > + *
> > > > > + * Returns: negative errno on faiure, 0 on success
> > > > > + */
> > > > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > > > > +							struct xe_vma *vma,
> > > > > +							struct xe_tile *tile)
> > > > > +{
> > > > > +	struct mm_struct *mm = vm->mm;
> > > > > +	unsigned long start = xe_vma_start(vma);
> > > > > +	unsigned long end = xe_vma_end(vma);
> > > > > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > > > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > > > +	struct vm_area_struct *vas;
> > > > > +
> > > > > +	struct migrate_vma migrate = {
> > > > > +		.start		= start,
> > > > > +		.end		= end,
> > > > > +		.pgmap_owner	= tile->xe,
> > > >
> > > > Again helper to assign owner.
> > > >
> > > > > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > > > > +	};
> > > > > +	struct device *dev = tile->xe->drm.dev;
> > > > > +	dma_addr_t *src_dma_addr;
> > > > > +	struct dma_fence *fence;
> > > > > +	struct page *src_page;
> > > > > +	LIST_HEAD(blocks);
> > > > > +	int ret = 0, i;
> > > > > +	u64 dst_dpa;
> > > > > +	void *buf;
> > > > > +
> > > > > +	mmap_assert_locked(mm);
> > > >
> > > > This mmap_assert_locked is ambiguous, we should make it clear if this
> > > > read or write locked. Doesn't it have to be write to do the migrate
> > > > pages?
> > >
> > > I followed hmm document (Documents/mm/hmm.rst), see section
> > "Migration to and from device memory". It explicitly write a read_lock in this
> > document.
> > >
> > > I believe a read_lock is enough for the
> > migrate_vma_setup/migrate_vma_finalize().
> > >
> > > As I understand it, the mm.mmap_lock protect the process's address space.
> > When we modify process's address space such as mmap/munmap, we need
> > to hold a write mode lock; if we only read process's address space, such as in
> > the migrate_vma_setup/finalize, or in the cpu page fault handler case, we
> > only need a read mode lock.
> > >
> > > I also cross checked amd driver. They also use a read lock.. see function
> > svm_range_restore_pages in kfd_svm.c....
> > >
> > 
> > Yea, I see that too. Trying to figure out the locking, IMO the locking
> > document might actually be wrong, or the very least the locking design
> > is very ill-conceived. We can discuss internally a bit before I
> > publically share my grievances.
> > 
> > >
> > > >
> > > > A larger question about the locking... The CPU fault handler holds the
> > > > mmap lock in write mode, right?
> > >
> > > No. since cpu fault handler doesn't modify process address space, instead it
> > only fill up cpu page table for some valid address range, a read lock is enough.
> > >
> > 
> > Ah, yes after digging around a bit I see this.
> > 
> > > >
> > > > I'm asking as basically I think at least initially we want to hold the
> > > > mmap lock in a way that the GPU handler and CPU handler do not race.
> > > > i.e. From fault userptr create in GPU fault handler to issuing the bind
> > > > we prevent the CPU fault handler from running.
> > >
> > > Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid
> > that race.
> > >
> > 
> > That's not how rw lock work. 2 threads can both hold the read lock in
> > parallel (shared read access), only 1 thread hold the write lock
> > (exclusive write access, no one can hold read lock either). Thus my
> > concern about the cpu and gpu fault handler running in parallel and the
> > larger locking design questions. Again we can talk through this in
> > detail internally a bit.
> > 
> 
> 
> It is a great question of the gpu/cpu handler race. Obviously my above thinking is wrong. Here is my new understanding: 
> 
> think about this scenario: address backing store is in vram - cpu access - cpu fault- migrate_vma_setup - invalidate gpu pt  - gpu access same address - gpu fault - migrate_vma_setup triggered by gpu fault - this read back pfn of vram backing store because cpu fault is not finished/migrate_vma_finalized is not called yet, but since gpu side would only migrate memory in sram, so migrate_vma_setup from gpu side wouldn't return any valid pfn to migrate and gpu page fault handler would return successfully and gpu hw execution is resumed but immediately trigger another gpu page fault, until cpu side migration is done. 
>

This is hard to follow, especially given your comments drop wrap at 80
lines or so. Please use an email client which wraps. I don't think you
reasoning is right.

I have worked this out and landed on HMM locking documentation is wrong
or at least doesn't work unless a notifier is installed upon every GPU
fault. Bascially the GPU handler takes mmap lock in write mode to work
around this (a notifier installation takes mmap in write mode which is
why I reasoned HMM lock doc suggestion works in some cases). Longterm
will need some depth though about to fix this. It might be an extra
MMU invalidation event upon migrate_vma_finalize or somethig... Again
see my PoC, it has comments in the code explaining this.

> 
> Similar when gpu fault happens during the other stage of cpu fault handler. Same thing when CPU fault handler interrupt a gpu fault handler.
> 
> This seems work to me but we need to test it out. We will add a test case for this race condition. @Bommu, Krishnaiah fyi.
>

I have a test for this, see race sections in [1].

[1] https://patchwork.freedesktop.org/series/133846/
 
> > > In user mmap/munmap (such as kernel function vm_munmap), we hold a
> > mmap_write_lock which prevent it racing with cpu and gpu fault handler.
> > >
> > >
> > > >
> > > > I'm having issues figuring out how to prevent races between initial
> > > > binds of userptrs and userptr invalidates on faulting VMs. This race is
> > > > seen any xe_exec_fault_mode for example... So preventing races
> > between
> > > > CPU / GPU fault handler with the mmap probably is a good idea initially.
> > > > Likely can make the locking finer grained once this is all working and I
> > > > figure out how to handle this race better.
> > >
> > >
> > > I don't quite follow here.
> > >
> > > Initial bind of user ptr... if you meant the bind in gpu page fault handler,
> > then the racing with invalidation is roughly like below:
> > > Invalidation is called with mmap_write_lock
> > 
> > Is it? If the notifer does the invalidation via migrate_vma_setup in the
> > CPU fault handler we established about that only the mmap_read_lock is
> > held.
> 
> Yes the cpu fault triggered migrate-vma-setup/invalidation only hold read_lock. 
> 
> It was the munmap triggered invalidation in my mind. For this case, we hold mmap_write_lock during invalidation, so it is mutual exclusive with gpu triggered migration.
> 
> But a cpu fault triggered invalidation can proceed during gpu fault triggered migration. As explained above, it cause extra gpu page table invalidation which is just fine because the consequence is just extra gpu page fault (and later migration). So functionally I don't see a problem of this scheme.
> 
> The main reason we hold a read lock in gpu fault handler is, gpu fault handler doesn't modify process address space, rather it only traverse the cpu page table.
>

Again not following, but quite sure your understand is not correct. See
above.
 
> > 
> > > In userptr_pin_page, we hold a mmap_read_lock, so we know during pin
> > page, invalidation is excluded.
> > 
> > Nope, see above invalidation can happen when userptr_pin_page is
> > executing because of the read lock. The seqno check (described below) is
> > what prevents programming of bad page tables.
> 
> You are right.
> 
> 
> > 
> > > After pin, before programming gpu page table, we check whether there is
> > invalidation happen *after last pin but before programming page table*, if
> > that happened, we retry
> > >
> > 
> > Yes, that is how it works on tip but I am refactoring it in [1]. I was
> > trying to avoid the retry loop by turning PDE/PTE writes into clears if an
> > invalidation happened but not sure if that works without a larger
> > refactor due to nature PDEs being shared. I may need the retry loop but
> > that also gets tricky with array of binds... A few options here and will
> > land a on solution once [2] is merged.
> 
> I discussed this retry loop (I refer to the retry loop in xe-pt-userptr-pre-commit function before your change)with Thomas some time ago. I agree the retry loop scheme work
> 
> But to me other scheme without loop also work. As long as we have a retry check before programming gpu page table, it is fine to me. 
> 
> If you have further fix in this area, I will keep watching and pick it. 
>

I have this properly fixed now in [2]. Faulting VMs will return with
-EAGAIN if the seqno check fails in a bind. If will be up to the page
fault handler to retry. In the case of user binds, this so race I
currently am just kicking the -EAGAIN to userpace to retry. I could
change the later if people disagre and retry in the kernel too.

[2] https://patchwork.freedesktop.org/series/133034/.
 
> > 
> > Regardless my point here is still valid - it may not be the worst idea
> > when getting this code initially up and running just to grab
> > mmap_write_lock in GPU fault handler as a BLK (big kernel lock) to
> > prevent all races. Once the code is stable and stress testing in place,
> > switch to finer grained locking as define in HMM document (or newly
> > defined if we fine locking design is insufficient).
> 
> As explained, I will keep the read-lock for now. And we will test it. If we do run into race condition, I will try write-lock and come back.
>

See above, you will hit a race.

I strongly suggest you checkout my working PoC and IGT and play around
with that code. If you change from write lock to read lock in GPU
handler around VRAM migration and hmm range fault, the 'race' sections
fail.

Matt
 
> Thanks a lot for the great questions Matt! It is very helpful!
> 
> 
> Oak
> 
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/125608/
> > [2] https://patchwork.freedesktop.org/series/132246/
> > 
> > >
> > >
> > > Oak
> > >
> > > >
> > > > > +
> > > > > +	vas = find_vma_intersection(mm, start, start + 4);
> > > >
> > > > find_vma should work fine here.
> > > >
> > > > > +	if (!vas)
> > > > > +		return -ENOENT;
> > > > > +
> > > > > +	migrate.vma = vas;
> > > > > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) +
> > sizeof(*src_dma_addr),
> > > > > +					GFP_KERNEL);
> > > > > +	if(!buf)
> > > > > +		return -ENOMEM;
> > > > > +	migrate.src = buf;
> > > > > +	migrate.dst = migrate.src + npages;
> > > > > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > > > > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> > > >
> > > > Again as I discussed in [3] I think this should be broken out into a
> > > > different step with the blocks allocated before this, and here just
> > > > populate migrate.dst from the existing blocks.
> > > >
> > > > [3]
> > https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> > > >
> > > > > +	if (ret)
> > > > > +		goto kfree_buf;
> > > > > +
> > > > > +	ret = migrate_vma_setup(&migrate);
> > > > > +	if (ret) {
> > > > > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> > > > [%lx - %lx]\n",
> > > > > +				ret, start, end);
> > > > > +		goto free_dst_pages;
> > > > > +	}
> > > > > +
> > > > > +	/**FIXME: partial migration of a range print a warning for now.
> > > > > +	 * If this message is printed, we need to split xe_vma as we
> > > > > +	 * don't support a mixture placement of one vma
> > > > > +	 */
> > > > > +	if (migrate.cpages != npages)
> > > > > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
> > > >  %lx], range is %ld pages, migrate only %ld pages\n",
> > > > > +				start, end, npages, migrate.cpages);
> > > >
> > > > As discussed above, we shouldn't support this. We should fall back to
> > > > smaller xe_vma chunk size until we find one that works or simply leave
> > > > the pages in sram and map those pages to GPU.
> > > >
> > > > > +
> > > > > +	/**Migrate page by page for now.
> > > > > +	 * Both source pages and destination pages can physically not
> > > > contiguous,
> > > > > +	 * there is no good way to migrate multiple pages per blitter
> > command.
> > > > > +	 */
> > > >
> > > > Touched on this a bunch throughout the series, lets do better than a
> > > > page a time migration.
> > > >
> > > > Algorithm should be very similar to what I discussed here [4] but with a
> > > > few key differences.
> > > >
> > > > - I think the sram pages can be unpopulated (page == NULL) if the user
> > > >   has not yet touched the page
> > > > - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> > > >
> > > > In these cases this indicate we have to issue a copy for the pages we
> > > > are accumulating with contigous vram addresses.
> > > >
> > > > [4]
> > https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > > >
> > > > > +	for (i = 0; i < npages; i++) {
> > > > > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > > > > +		if (unlikely(!src_page || !(migrate.src[i] &
> > > > MIGRATE_PFN_MIGRATE)))
> > > >
> > > > Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> > > > on subsequent errors we need to call migrate_vma_finalize to revert the
> > > > pages to the original state. At least I think if I am reading the doc
> > > > after this correctly.
> > > >
> > > > Here on error we just free the pages...
> > > >
> > > > Matt
> > > >
> > > > > +			goto free_dst_page;
> > > > > +
> > > > > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > > > > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> > > > PAGE_SIZE, DMA_TO_DEVICE);
> > > > > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > > > > +			drm_warn(&tile->xe->drm, "dma map error for host
> > > > pfn %lx\n", migrate.src[i]);
> > > > > +			goto free_dst_page;
> > > > > +		}
> > > > > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > > > > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > > > > +				dst_dpa, true, PAGE_SIZE);
> > > > > +		if (IS_ERR(fence)) {
> > > > > +			drm_warn(&tile->xe->drm, "migrate host page
> > > > (pfn: %lx) to vram failed\n",
> > > > > +					migrate.src[i]);
> > > > > +			/**Migration is best effort. Even we failed here, we
> > > > continue*/
> > > > > +			goto free_dst_page;
> > > > > +		}
> > > > > +		/**FIXME: Use the first migration's out fence as the second
> > > > migration's input fence,
> > > > > +		 * and so on. Only wait the out fence of last migration?
> > > > > +		 */
> > > > > +		dma_fence_wait(fence, false);
> > > > > +		dma_fence_put(fence);
> > > > > +free_dst_page:
> > > > > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > > > > +	}
> > > > > +
> > > > > +	for (i = 0; i < npages; i++)
> > > > > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > > > > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> > > > DMA_TO_DEVICE);
> > > > > +
> > > > > +	migrate_vma_pages(&migrate);
> > > > > +	migrate_vma_finalize(&migrate);
> > > > > +free_dst_pages:
> > > > > +	if (ret)
> > > > > +		xe_devm_free_blocks(&blocks);
> > > > > +kfree_buf:
> > > > > +	kfree(buf);
> > > > > +	return ret;
> > > > > +}
> > > > > --
> > > > > 2.26.3
> > > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-06-07 17:56           ` Matthew Brost
@ 2024-06-07 18:10             ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-06-07 18:10 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Bommu, Krishnaiah, intel-xe@lists.freedesktop.org,
	Ghimiray,  Himal Prasad, Thomas.Hellstrom@linux.intel.com,
	Welty, Brian

On Fri, Jun 07, 2024 at 05:56:16PM +0000, Matthew Brost wrote:
> On Fri, Jun 07, 2024 at 11:12:00AM -0600, Zeng, Oak wrote:
> > Hi Matt,
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Monday, April 15, 2024 3:40 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > > Brian <brian.welty@intel.com>
> > > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to
> > > vram
> > > 
> > > On Fri, Apr 12, 2024 at 03:21:04PM -0600, Zeng, Oak wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: Wednesday, April 10, 2024 10:49 PM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com;
> > > Welty,
> > > > > Brian <brian.welty@intel.com>
> > > > > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to
> > > vram
> > > > >
> > > > > On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > > > > > Introduce a helper function xe_svm_migrate_vma_to_vram.
> > > > > >
> > > > > > Since the source pages of the svm range can be physically not
> > > > > > contiguous, and the destination vram pages can also be not
> > > > > > contiguous, there is no easy way to migrate multiple pages per
> > > > > > blitter command. We do page by page migration for now.
> > > > > >
> > > > > > Migration is best effort. Even if we fail to migrate some pages,
> > > > > > we will try to migrate the rest pages.
> > > > > >
> > > > > > FIXME: Use one blitter command to copy when both src and dst are
> > > > > > physically contiguous
> > > > > >
> > > > >
> > > > > Yep, touch in this throughout the series. Only vram needs to be
> > > > > contiguous though as we dynamically create PT mappings for sram pages
> > > in
> > > > > the migrate code. Getting this in a must and should be done immediately
> > > > > IMO as this a very, very basic perform thing we know needs to be done.
> > > > > We will likely have to tune this code quite a bit for performance so
> > > > > getting known things done would be helpful.
> > > > >
> > > > > > FIXME: when a vma is partially migrated, split vma as we assume
> > > > > > no mixture vma placement.
> > > > > >
> > > > >
> > > > > Agree we do not want support partial migrations. We likely want to
> > > > > return -EAGAIN for something and fault back to a smaller xe_vma chunk
> > > > > size which I discussed in [1] and add comment on in [2].
> > 
> > 
> > I thought into this when I worked on v3. I end up with keep the xe-vma not splitting, but migrate partially of a vma. A vma can have mixed backing store now... To support this, I had to make the page table update and invalidation range based (vs whole xe-vma based) I am testing it. Not very sure how well this goes.
> > 
> 
> A few things, I strongly oppose mixed storage within a single GPU
> mapping. I view this as an unnecessary complication.
> 
> A GPU mapping is not a VMA - it is a subset of a VMA. It has been agreed
> upon we have a 1:N relationship in SVM. With '1' being a GPU VMA which
> reflects user IOCTLs (e.g. initial large bind, madvise calls which can
> split the VMA). The 'N' being mappings dynamically created on upon GPU
> page fault. This a xe_svm_range (subclass of drm_svm_range) in my PoC.
> My expectation is that any post has something similar to this agreed
> upon relationship implemented.
> 
> Here my view is the 'N' never is mixed storage.
> 
> See my comments about 'page grainularity' migration in 'implement
> functions to allocate and free device memory'. If you are doing page
> grainularity migration, then 'N' is exactly 1 CPU page (could be THP)
> thus you never have a case where partial migrations or mixed storage is
> needed. Any spliting / mixed storage by definition is not page
> grainularity.
> 
> If we don't fully support page grainularity, I think the grainularity
> for any operation is allocation size of 'N'. e.g. Upon CPU page, CPU
> unmap, evict, 'N' is fully moved, invalidated, GPU unmapped. This
> simplifies the code quite a bit. See 'Migrataion', 'Partial Unmapping of
> Ranges', and 'Examples' in drm_gpusvm.c. Of course open to others idea
> here but anything that is more complicated that what I'm suggesting,
> expect push back.
> 
> > One thing I realized that might now work well is the range based invalidation. Say we a pde entry covering a huge 2M or 1G page. If we only zap 4k in this rang, we will need to split pde entry and replace it with many pde/pte entries. Even though the zap_ptes/page table walk functions are range based, I doubt it works for such case. I will have to test it out. I think even if it doesn't work, it worth to make it work as an improvement area. For example, core mm obviously support such function, see zap_pte_range.
> >
> 
> If we have GPU page tables larger than the CPU pages, then we are not
> doing 'page grainularity'.
> 
> This is a large source of confusion for as you seem to have conflicting
> ideas in reply to reply. Do plan on supporting 'page grainularity' or
> not? This is unclear to me.
> 
> If we don't support 'page grainularity', then see my above for how I
> expect an impedance mismatch between the CPU page size and 'N'. That
> applies here as well.
> 
> > 
> > 
> > > > >
> > > > > Migration should be best case too if we fail migrate we can always leave
> > > > > backing store in sram too.
> > > > >
> > > > > I do have question though, when do we get partial migrations? A user
> > > > > having called mlock on some of the pages? I just want to make sure I
> > > > > fully understand that case.
> > > >
> > > > Yah, mlock could be one case...
> > > >
> > > > I also looked the hmm code. There are few other cases where
> > > MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte
> > > is NULL and vma is file-backed (not anonymous); entry is swapped out to
> > > hard disk etc. see more details in function migrate_vma_collect_pmd.
> > > >
> > > >
> > > > >
> > > > > [1]
> > > https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > > > > [2]
> > > https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> > > > >
> > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> > > > > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> > > > > ++++++++++++++++++++++++++++
> > > > >
> > > > > Same comment on file structure throughout the series apply here too.
> > > > >
> > > > > >  2 files changed, 117 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > index c9e4239c44b4..18ce2e3757c5 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > > > > >  void xe_devm_free_blocks(struct list_head *blocks);
> > > > > >  void xe_devm_page_free(struct page *page);
> > > > > >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > > > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > > > > > +							struct xe_tile *tile);
> > > > > >  #endif
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > > > index 0db831af098e..ab8dd1f58aa4 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > > > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> > > > > vm_fault *vmf)
> > > > > >  	kvfree(buf);
> > > > > >  	return 0;
> > > > > >  }
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma
> > > to
> > > > > vram
> > > > > > + * Must be called with mmap_read_lock held.
> > > > > > + * @vm: the vm that the vma belongs to
> > > > > > + * @vma: the vma to migrate.
> > > > > > + * @tile: the destination tile which holds the new backing store of the
> > > range
> > > > > > + *
> > > > > > + * Returns: negative errno on faiure, 0 on success
> > > > > > + */
> > > > > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > > > > > +							struct xe_vma *vma,
> > > > > > +							struct xe_tile *tile)
> > > > > > +{
> > > > > > +	struct mm_struct *mm = vm->mm;
> > > > > > +	unsigned long start = xe_vma_start(vma);
> > > > > > +	unsigned long end = xe_vma_end(vma);
> > > > > > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > > > > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > > > > +	struct vm_area_struct *vas;
> > > > > > +
> > > > > > +	struct migrate_vma migrate = {
> > > > > > +		.start		= start,
> > > > > > +		.end		= end,
> > > > > > +		.pgmap_owner	= tile->xe,
> > > > >
> > > > > Again helper to assign owner.
> > > > >
> > > > > > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > > > > > +	};
> > > > > > +	struct device *dev = tile->xe->drm.dev;
> > > > > > +	dma_addr_t *src_dma_addr;
> > > > > > +	struct dma_fence *fence;
> > > > > > +	struct page *src_page;
> > > > > > +	LIST_HEAD(blocks);
> > > > > > +	int ret = 0, i;
> > > > > > +	u64 dst_dpa;
> > > > > > +	void *buf;
> > > > > > +
> > > > > > +	mmap_assert_locked(mm);
> > > > >
> > > > > This mmap_assert_locked is ambiguous, we should make it clear if this
> > > > > read or write locked. Doesn't it have to be write to do the migrate
> > > > > pages?
> > > >
> > > > I followed hmm document (Documents/mm/hmm.rst), see section
> > > "Migration to and from device memory". It explicitly write a read_lock in this
> > > document.
> > > >
> > > > I believe a read_lock is enough for the
> > > migrate_vma_setup/migrate_vma_finalize().
> > > >
> > > > As I understand it, the mm.mmap_lock protect the process's address space.
> > > When we modify process's address space such as mmap/munmap, we need
> > > to hold a write mode lock; if we only read process's address space, such as in
> > > the migrate_vma_setup/finalize, or in the cpu page fault handler case, we
> > > only need a read mode lock.
> > > >
> > > > I also cross checked amd driver. They also use a read lock.. see function
> > > svm_range_restore_pages in kfd_svm.c....
> > > >
> > > 
> > > Yea, I see that too. Trying to figure out the locking, IMO the locking
> > > document might actually be wrong, or the very least the locking design
> > > is very ill-conceived. We can discuss internally a bit before I
> > > publically share my grievances.
> > > 
> > > >
> > > > >
> > > > > A larger question about the locking... The CPU fault handler holds the
> > > > > mmap lock in write mode, right?
> > > >
> > > > No. since cpu fault handler doesn't modify process address space, instead it
> > > only fill up cpu page table for some valid address range, a read lock is enough.
> > > >
> > > 
> > > Ah, yes after digging around a bit I see this.
> > > 
> > > > >
> > > > > I'm asking as basically I think at least initially we want to hold the
> > > > > mmap lock in a way that the GPU handler and CPU handler do not race.
> > > > > i.e. From fault userptr create in GPU fault handler to issuing the bind
> > > > > we prevent the CPU fault handler from running.
> > > >
> > > > Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid
> > > that race.
> > > >
> > > 
> > > That's not how rw lock work. 2 threads can both hold the read lock in
> > > parallel (shared read access), only 1 thread hold the write lock
> > > (exclusive write access, no one can hold read lock either). Thus my
> > > concern about the cpu and gpu fault handler running in parallel and the
> > > larger locking design questions. Again we can talk through this in
> > > detail internally a bit.
> > > 
> > 
> > 
> > It is a great question of the gpu/cpu handler race. Obviously my above thinking is wrong. Here is my new understanding: 
> > 
> > think about this scenario: address backing store is in vram - cpu access - cpu fault- migrate_vma_setup - invalidate gpu pt  - gpu access same address - gpu fault - migrate_vma_setup triggered by gpu fault - this read back pfn of vram backing store because cpu fault is not finished/migrate_vma_finalized is not called yet, but since gpu side would only migrate memory in sram, so migrate_vma_setup from gpu side wouldn't return any valid pfn to migrate and gpu page fault handler would return successfully and gpu hw execution is resumed but immediately trigger another gpu page fault, until cpu side migration is done. 
> >
> 
> This is hard to follow, especially given your comments drop wrap at 80
> lines or so. Please use an email client which wraps. I don't think you
> reasoning is right.
> 
> I have worked this out and landed on HMM locking documentation is wrong
> or at least doesn't work unless a notifier is installed upon every GPU
> fault. Bascially the GPU handler takes mmap lock in write mode to work
> around this (a notifier installation takes mmap in write mode which is
> why I reasoned HMM lock doc suggestion works in some cases). Longterm
> will need some depth though about to fix this. It might be an extra
> MMU invalidation event upon migrate_vma_finalize or somethig... Again
> see my PoC, it has comments in the code explaining this.
> 
> > 
> > Similar when gpu fault happens during the other stage of cpu fault handler. Same thing when CPU fault handler interrupt a gpu fault handler.
> > 
> > This seems work to me but we need to test it out. We will add a test case for this race condition. @Bommu, Krishnaiah fyi.
> >
> 
> I have a test for this, see race sections in [1].
> 
> [1] https://patchwork.freedesktop.org/series/133846/
>  
> > > > In user mmap/munmap (such as kernel function vm_munmap), we hold a
> > > mmap_write_lock which prevent it racing with cpu and gpu fault handler.
> > > >
> > > >
> > > > >
> > > > > I'm having issues figuring out how to prevent races between initial
> > > > > binds of userptrs and userptr invalidates on faulting VMs. This race is
> > > > > seen any xe_exec_fault_mode for example... So preventing races
> > > between
> > > > > CPU / GPU fault handler with the mmap probably is a good idea initially.
> > > > > Likely can make the locking finer grained once this is all working and I
> > > > > figure out how to handle this race better.
> > > >
> > > >
> > > > I don't quite follow here.
> > > >
> > > > Initial bind of user ptr... if you meant the bind in gpu page fault handler,
> > > then the racing with invalidation is roughly like below:
> > > > Invalidation is called with mmap_write_lock
> > > 
> > > Is it? If the notifer does the invalidation via migrate_vma_setup in the
> > > CPU fault handler we established about that only the mmap_read_lock is
> > > held.
> > 
> > Yes the cpu fault triggered migrate-vma-setup/invalidation only hold read_lock. 
> > 
> > It was the munmap triggered invalidation in my mind. For this case, we hold mmap_write_lock during invalidation, so it is mutual exclusive with gpu triggered migration.
> > 
> > But a cpu fault triggered invalidation can proceed during gpu fault triggered migration. As explained above, it cause extra gpu page table invalidation which is just fine because the consequence is just extra gpu page fault (and later migration). So functionally I don't see a problem of this scheme.
> > 
> > The main reason we hold a read lock in gpu fault handler is, gpu fault handler doesn't modify process address space, rather it only traverse the cpu page table.
> >
> 
> Again not following, but quite sure your understand is not correct. See
> above.
>  
> > > 
> > > > In userptr_pin_page, we hold a mmap_read_lock, so we know during pin
> > > page, invalidation is excluded.
> > > 
> > > Nope, see above invalidation can happen when userptr_pin_page is
> > > executing because of the read lock. The seqno check (described below) is
> > > what prevents programming of bad page tables.
> > 
> > You are right.
> > 
> > 
> > > 
> > > > After pin, before programming gpu page table, we check whether there is
> > > invalidation happen *after last pin but before programming page table*, if
> > > that happened, we retry
> > > >
> > > 
> > > Yes, that is how it works on tip but I am refactoring it in [1]. I was
> > > trying to avoid the retry loop by turning PDE/PTE writes into clears if an
> > > invalidation happened but not sure if that works without a larger
> > > refactor due to nature PDEs being shared. I may need the retry loop but
> > > that also gets tricky with array of binds... A few options here and will
> > > land a on solution once [2] is merged.
> > 
> > I discussed this retry loop (I refer to the retry loop in xe-pt-userptr-pre-commit function before your change)with Thomas some time ago. I agree the retry loop scheme work
> > 
> > But to me other scheme without loop also work. As long as we have a retry check before programming gpu page table, it is fine to me. 
> > 
> > If you have further fix in this area, I will keep watching and pick it. 
> >
> 
> I have this properly fixed now in [2]. Faulting VMs will return with
> -EAGAIN if the seqno check fails in a bind. If will be up to the page
> fault handler to retry. In the case of user binds, this so race I
> currently am just kicking the -EAGAIN to userpace to retry. I could
> change the later if people disagre and retry in the kernel too.
> 
> [2] https://patchwork.freedesktop.org/series/133034/.
>  
> > > 
> > > Regardless my point here is still valid - it may not be the worst idea
> > > when getting this code initially up and running just to grab
> > > mmap_write_lock in GPU fault handler as a BLK (big kernel lock) to
> > > prevent all races. Once the code is stable and stress testing in place,
> > > switch to finer grained locking as define in HMM document (or newly
> > > defined if we fine locking design is insufficient).
> > 
> > As explained, I will keep the read-lock for now. And we will test it. If we do run into race condition, I will try write-lock and come back.
> >
> 
> See above, you will hit a race.
> 

Hmm, I'm thinking now and some of the races might actually be avoidable
'page grainularity' migration... You'd have to play around with this but
my initial thought is probably at least helps, if not out rights fixes
the races.

Matt

> I strongly suggest you checkout my working PoC and IGT and play around
> with that code. If you change from write lock to read lock in GPU
> handler around VRAM migration and hmm range fault, the 'race' sections
> fail.
> 
> Matt
>  
> > Thanks a lot for the great questions Matt! It is very helpful!
> > 
> > 
> > Oak
> > 
> > > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/series/125608/
> > > [2] https://patchwork.freedesktop.org/series/132246/
> > > 
> > > >
> > > >
> > > > Oak
> > > >
> > > > >
> > > > > > +
> > > > > > +	vas = find_vma_intersection(mm, start, start + 4);
> > > > >
> > > > > find_vma should work fine here.
> > > > >
> > > > > > +	if (!vas)
> > > > > > +		return -ENOENT;
> > > > > > +
> > > > > > +	migrate.vma = vas;
> > > > > > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) +
> > > sizeof(*src_dma_addr),
> > > > > > +					GFP_KERNEL);
> > > > > > +	if(!buf)
> > > > > > +		return -ENOMEM;
> > > > > > +	migrate.src = buf;
> > > > > > +	migrate.dst = migrate.src + npages;
> > > > > > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > > > > > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> > > > >
> > > > > Again as I discussed in [3] I think this should be broken out into a
> > > > > different step with the blocks allocated before this, and here just
> > > > > populate migrate.dst from the existing blocks.
> > > > >
> > > > > [3]
> > > https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> > > > >
> > > > > > +	if (ret)
> > > > > > +		goto kfree_buf;
> > > > > > +
> > > > > > +	ret = migrate_vma_setup(&migrate);
> > > > > > +	if (ret) {
> > > > > > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> > > > > [%lx - %lx]\n",
> > > > > > +				ret, start, end);
> > > > > > +		goto free_dst_pages;
> > > > > > +	}
> > > > > > +
> > > > > > +	/**FIXME: partial migration of a range print a warning for now.
> > > > > > +	 * If this message is printed, we need to split xe_vma as we
> > > > > > +	 * don't support a mixture placement of one vma
> > > > > > +	 */
> > > > > > +	if (migrate.cpages != npages)
> > > > > > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
> > > > >  %lx], range is %ld pages, migrate only %ld pages\n",
> > > > > > +				start, end, npages, migrate.cpages);
> > > > >
> > > > > As discussed above, we shouldn't support this. We should fall back to
> > > > > smaller xe_vma chunk size until we find one that works or simply leave
> > > > > the pages in sram and map those pages to GPU.
> > > > >
> > > > > > +
> > > > > > +	/**Migrate page by page for now.
> > > > > > +	 * Both source pages and destination pages can physically not
> > > > > contiguous,
> > > > > > +	 * there is no good way to migrate multiple pages per blitter
> > > command.
> > > > > > +	 */
> > > > >
> > > > > Touched on this a bunch throughout the series, lets do better than a
> > > > > page a time migration.
> > > > >
> > > > > Algorithm should be very similar to what I discussed here [4] but with a
> > > > > few key differences.
> > > > >
> > > > > - I think the sram pages can be unpopulated (page == NULL) if the user
> > > > >   has not yet touched the page
> > > > > - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> > > > >
> > > > > In these cases this indicate we have to issue a copy for the pages we
> > > > > are accumulating with contigous vram addresses.
> > > > >
> > > > > [4]
> > > https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > > > >
> > > > > > +	for (i = 0; i < npages; i++) {
> > > > > > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > > > > > +		if (unlikely(!src_page || !(migrate.src[i] &
> > > > > MIGRATE_PFN_MIGRATE)))
> > > > >
> > > > > Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> > > > > on subsequent errors we need to call migrate_vma_finalize to revert the
> > > > > pages to the original state. At least I think if I am reading the doc
> > > > > after this correctly.
> > > > >
> > > > > Here on error we just free the pages...
> > > > >
> > > > > Matt
> > > > >
> > > > > > +			goto free_dst_page;
> > > > > > +
> > > > > > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > > > > > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> > > > > PAGE_SIZE, DMA_TO_DEVICE);
> > > > > > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > > > > > +			drm_warn(&tile->xe->drm, "dma map error for host
> > > > > pfn %lx\n", migrate.src[i]);
> > > > > > +			goto free_dst_page;
> > > > > > +		}
> > > > > > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > > > > > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > > > > > +				dst_dpa, true, PAGE_SIZE);
> > > > > > +		if (IS_ERR(fence)) {
> > > > > > +			drm_warn(&tile->xe->drm, "migrate host page
> > > > > (pfn: %lx) to vram failed\n",
> > > > > > +					migrate.src[i]);
> > > > > > +			/**Migration is best effort. Even we failed here, we
> > > > > continue*/
> > > > > > +			goto free_dst_page;
> > > > > > +		}
> > > > > > +		/**FIXME: Use the first migration's out fence as the second
> > > > > migration's input fence,
> > > > > > +		 * and so on. Only wait the out fence of last migration?
> > > > > > +		 */
> > > > > > +		dma_fence_wait(fence, false);
> > > > > > +		dma_fence_put(fence);
> > > > > > +free_dst_page:
> > > > > > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > > > > > +	}
> > > > > > +
> > > > > > +	for (i = 0; i < npages; i++)
> > > > > > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > > > > > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> > > > > DMA_TO_DEVICE);
> > > > > > +
> > > > > > +	migrate_vma_pages(&migrate);
> > > > > > +	migrate_vma_finalize(&migrate);
> > > > > > +free_dst_pages:
> > > > > > +	if (ret)
> > > > > > +		xe_devm_free_blocks(&blocks);
> > > > > > +kfree_buf:
> > > > > > +	kfree(buf);
> > > > > > +	return ret;
> > > > > > +}
> > > > > > --
> > > > > > 2.26.3
> > > > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator
  2024-06-07 17:22     ` Zeng, Oak
@ 2024-06-07 18:18       ` Matthew Brost
  2024-06-07 18:23         ` Matthew Brost
  0 siblings, 1 reply; 72+ messages in thread
From: Matthew Brost @ 2024-06-07 18:18 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

On Fri, Jun 07, 2024 at 11:22:42AM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 10:55 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 31/31] drm/xe/svm: Migration from sram to vram for system
> > allocator
> > 
> > On Tue, Apr 09, 2024 at 04:17:42PM -0400, Oak Zeng wrote:
> > > If applicable, migrate a vma from sram to vram for system allocator.
> > > Traditional userptr is not migrated. Only userptr created during
> > > fault (aka userptr splitted from system allocator vma) can be
> > > migrated.
> > >
> > > FIXME: The migration should be conditional on user memory attributes
> > > setting. Add this logic when memory attributes are supported
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt_pagefault.c | 9 ++++++++-
> > >  drivers/gpu/drm/xe/xe_vm.c           | 4 ----
> > >  2 files changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > index 668984f0769e..c6ba00049964 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > @@ -20,6 +20,7 @@
> > >  #include "xe_guc_ct.h"
> > >  #include "xe_migrate.h"
> > >  #include "xe_trace.h"
> > > +#include "xe_svm.h"
> > >  #include "xe_vm.h"
> > >
> > >  struct pagefault {
> > > @@ -209,12 +210,18 @@ static int handle_pagefault(struct xe_gt *gt,
> > struct pagefault *pf)
> > >
> > >  	if (xe_vma_is_userptr(vma) && write_locked) {
> > >  		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> > > +		struct xe_userptr *userptr = &uvma->userptr;
> > >
> > >  		spin_lock(&vm->userptr.invalidated_lock);
> > > -		list_del_init(&uvma->userptr.invalidate_link);
> > > +		list_del_init(&userptr->invalidate_link);
> > >  		spin_unlock(&vm->userptr.invalidated_lock);
> > >
> > > +		mmap_read_lock(userptr->notifier.mm);
> > > +		/**FIXME: Add migration policy here*/
> > > +		if (xe_vma_is_fault_userptr(vma))
> > > +			xe_svm_migrate_vma_to_vram(vm, vma, tile);
> > 
> > Agree we need a policy here...
> > 
> > See my comments about locking in [1] thinking if we migrate we likely
> > want to hold the mmap lock until at least the bind being issued to
> > prevent races with the CPU fault handler, at least initially.
> 
> As explained in [1], I will continue to use mmap read lock for now. And the read lock only covers migration and userptr-pin-pages but not the bind. The bind correctness is guaranteed by a retry mechanism and a userptr's notifier_lock.
> 

See my reply, I think to avoid races VRAM migration and grabbing VRAM
pages via hmm_range_fault either need the write lock or a some core
refactoring needs to be done to avoid races.

One more possibly wrt to these races, maybe we simply don't care and
racing access between the CPU and GPU of shared memory and this is akin
to a user fault (i.e. the user must synchronize access or the behavior
is undefined possibly resulting in a segfault). If I recall correctly,
my tests showed when these races occured either a memory corruption
happened or the user got a segfault. 

FWIW, Cuda seems to have synchronize calls in their SVM documentation
(e.g. pass a malloc address to a shader, execute shader, call
synchronize, then access malloc address in user program). I can dig up
this documentation if you like.

Agree binds certainly do not need the mmap lock.

Matt

> > 
> > [1] https://patchwork.freedesktop.org/patch/588542/?series=132229&rev=1
> > 
> > >  		ret = xe_vma_userptr_pin_pages(uvma);
> > > +		mmap_read_unlock(userptr->notifier.mm);
> > >  		if (ret)
> > >  			goto unlock_vm;
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > index 498b36469d00..8a58fe144a02 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -71,16 +71,12 @@ int xe_vma_userptr_pin_pages(struct
> > xe_userptr_vma *uvma)
> > >  	struct xe_vma *vma = &uvma->vma;
> > >  	struct xe_vm *vm = xe_vma_vm(vma);
> > >  	struct xe_device *xe = vm->xe;
> > > -	struct xe_userptr *userptr;
> > >  	int ret;
> > >
> > >  	lockdep_assert_held(&vm->lock);
> > >  	xe_assert(xe, xe_vma_is_userptr(vma));
> > >
> > > -	userptr = &uvma->userptr;
> > > -	mmap_read_lock(userptr->notifier.mm);
> > >  	ret = xe_userptr_populate_range(uvma);
> > > -	mmap_read_unlock(userptr->notifier.mm);
> > 
> > Now you won't have the lock here other callers of this function...
> > Probably need to have locked / unlocked version or arguments here.
> 
> This is addressed in v3.
> 
> Oak
> 
> > 
> > Matt
> > 
> > >
> > >  	return ret;
> > >  }
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator
  2024-06-07 18:18       ` Matthew Brost
@ 2024-06-07 18:23         ` Matthew Brost
  0 siblings, 0 replies; 72+ messages in thread
From: Matthew Brost @ 2024-06-07 18:23 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe@lists.freedesktop.org, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Thomas.Hellstrom@linux.intel.com, Welty, Brian

On Fri, Jun 07, 2024 at 06:18:57PM +0000, Matthew Brost wrote:
> On Fri, Jun 07, 2024 at 11:22:42AM -0600, Zeng, Oak wrote:
> > Hi Matt,
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Wednesday, April 10, 2024 10:55 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > > Brian <brian.welty@intel.com>
> > > Subject: Re: [v2 31/31] drm/xe/svm: Migration from sram to vram for system
> > > allocator
> > > 
> > > On Tue, Apr 09, 2024 at 04:17:42PM -0400, Oak Zeng wrote:
> > > > If applicable, migrate a vma from sram to vram for system allocator.
> > > > Traditional userptr is not migrated. Only userptr created during
> > > > fault (aka userptr splitted from system allocator vma) can be
> > > > migrated.
> > > >
> > > > FIXME: The migration should be conditional on user memory attributes
> > > > setting. Add this logic when memory attributes are supported
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_gt_pagefault.c | 9 ++++++++-
> > > >  drivers/gpu/drm/xe/xe_vm.c           | 4 ----
> > > >  2 files changed, 8 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > index 668984f0769e..c6ba00049964 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > @@ -20,6 +20,7 @@
> > > >  #include "xe_guc_ct.h"
> > > >  #include "xe_migrate.h"
> > > >  #include "xe_trace.h"
> > > > +#include "xe_svm.h"
> > > >  #include "xe_vm.h"
> > > >
> > > >  struct pagefault {
> > > > @@ -209,12 +210,18 @@ static int handle_pagefault(struct xe_gt *gt,
> > > struct pagefault *pf)
> > > >
> > > >  	if (xe_vma_is_userptr(vma) && write_locked) {
> > > >  		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> > > > +		struct xe_userptr *userptr = &uvma->userptr;
> > > >
> > > >  		spin_lock(&vm->userptr.invalidated_lock);
> > > > -		list_del_init(&uvma->userptr.invalidate_link);
> > > > +		list_del_init(&userptr->invalidate_link);
> > > >  		spin_unlock(&vm->userptr.invalidated_lock);
> > > >
> > > > +		mmap_read_lock(userptr->notifier.mm);
> > > > +		/**FIXME: Add migration policy here*/
> > > > +		if (xe_vma_is_fault_userptr(vma))
> > > > +			xe_svm_migrate_vma_to_vram(vm, vma, tile);
> > > 
> > > Agree we need a policy here...
> > > 
> > > See my comments about locking in [1] thinking if we migrate we likely
> > > want to hold the mmap lock until at least the bind being issued to
> > > prevent races with the CPU fault handler, at least initially.
> > 
> > As explained in [1], I will continue to use mmap read lock for now. And the read lock only covers migration and userptr-pin-pages but not the bind. The bind correctness is guaranteed by a retry mechanism and a userptr's notifier_lock.
> > 
> 
> See my reply, I think to avoid races VRAM migration and grabbing VRAM
> pages via hmm_range_fault either need the write lock or a some core
> refactoring needs to be done to avoid races.
> 
> One more possibly wrt to these races, maybe we simply don't care and
> racing access between the CPU and GPU of shared memory and this is akin
> to a user fault (i.e. the user must synchronize access or the behavior
> is undefined possibly resulting in a segfault). If I recall correctly,
> my tests showed when these races occured either a memory corruption
> happened or the user got a segfault. 
> 
> FWIW, Cuda seems to have synchronize calls in their SVM documentation
> (e.g. pass a malloc address to a shader, execute shader, call
> synchronize, then access malloc address in user program). I can dig up
> this documentation if you like.
>

Sorry one more comment again. I haven't looked at the level0 spec for
SVM, we should also look at that and perhap spec it out as CPU and GPU
access of SVM memory is mutually exclusive and concurent access is
undefined.

Also missed a comment below.
 
> Agree binds certainly do not need the mmap lock.
> 
> Matt
> 
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/588542/?series=132229&rev=1
> > > 
> > > >  		ret = xe_vma_userptr_pin_pages(uvma);
> > > > +		mmap_read_unlock(userptr->notifier.mm);
> > > >  		if (ret)
> > > >  			goto unlock_vm;
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > > index 498b36469d00..8a58fe144a02 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -71,16 +71,12 @@ int xe_vma_userptr_pin_pages(struct
> > > xe_userptr_vma *uvma)
> > > >  	struct xe_vma *vma = &uvma->vma;
> > > >  	struct xe_vm *vm = xe_vma_vm(vma);
> > > >  	struct xe_device *xe = vm->xe;
> > > > -	struct xe_userptr *userptr;
> > > >  	int ret;
> > > >
> > > >  	lockdep_assert_held(&vm->lock);
> > > >  	xe_assert(xe, xe_vma_is_userptr(vma));
> > > >
> > > > -	userptr = &uvma->userptr;
> > > > -	mmap_read_lock(userptr->notifier.mm);
> > > >  	ret = xe_userptr_populate_range(uvma);
> > > > -	mmap_read_unlock(userptr->notifier.mm);
> > > 
> > > Now you won't have the lock here other callers of this function...
> > > Probably need to have locked / unlocked version or arguments here.
> > 
> > This is addressed in v3.
> > 

Misses this. Also to be clear, in you design Xe shouldn't touch the MMAP
lock either, that lock should be owned by the DRM layer.

Matt

> > Oak
> > 
> > > 
> > > Matt
> > > 
> > > >
> > > >  	return ret;
> > > >  }
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2024-06-07 18:25 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
2024-04-09 20:17 ` [v2 02/31] drm/xe/svm: Add SVM document Oak Zeng
2024-04-09 20:17 ` [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault Oak Zeng
2024-04-09 20:17 ` [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse Oak Zeng
2024-04-09 20:17 ` [v2 05/31] drm/xe: Fix op->tile_mask for fault mode Oak Zeng
2024-04-09 20:17 ` [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
2024-04-09 20:17 ` [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
2024-04-09 20:17 ` [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector Oak Zeng
2024-04-09 20:17 ` [v2 09/31] drm/xe: Introduce helper to populate userptr Oak Zeng
2024-04-09 20:17 ` [v2 10/31] drm/xe: Introduce a helper to free sg table Oak Zeng
2024-04-09 20:17 ` [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-04-10 21:09   ` Matthew Brost
2024-04-16 19:01   ` Matthew Brost
2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
2024-04-10 21:13   ` Matthew Brost
2024-06-04 18:57     ` Zeng, Oak
2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
2024-04-10 21:17   ` Matthew Brost
2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
2024-04-10 21:35   ` Matthew Brost
2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
2024-04-10 21:38   ` Matthew Brost
2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
2024-04-10 21:42   ` Matthew Brost
2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
2024-04-10 21:52   ` Matthew Brost
2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
2024-04-10 21:56   ` Matthew Brost
2024-06-05  2:29     ` Zeng, Oak
2024-04-09 20:17 ` [v2 20/31] drm/xe: add xe lock document Oak Zeng
2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
2024-04-10 22:06   ` Matthew Brost
2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
2024-04-10 22:23   ` Matthew Brost
2024-04-15 20:13     ` Zeng, Oak
2024-04-15 21:19       ` Matthew Brost
2024-06-05 22:16     ` Zeng, Oak
2024-06-05 23:37       ` Matthew Brost
2024-06-06  3:30         ` Zeng, Oak
2024-06-06  4:44           ` Matthew Brost
2024-04-17 20:55   ` Matthew Brost
2024-04-09 20:17 ` [v2 23/31] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
2024-04-10 22:25   ` Matthew Brost
2024-04-09 20:17 ` [v2 25/31] drm/xe/svm: Add vm to xe_svm process Oak Zeng
2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
2024-04-10 22:26   ` Matthew Brost
2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
2024-04-11  2:07   ` Matthew Brost
2024-04-12 17:24     ` Zeng, Oak
2024-04-12 18:10       ` Matthew Brost
2024-04-12 18:39         ` Zeng, Oak
2024-06-07  4:44         ` Zeng, Oak
2024-06-07  4:30     ` Zeng, Oak
2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
2024-04-11  2:49   ` Matthew Brost
2024-04-12 21:21     ` Zeng, Oak
2024-04-15 19:40       ` Matthew Brost
2024-06-07 17:12         ` Zeng, Oak
2024-06-07 17:56           ` Matthew Brost
2024-06-07 18:10             ` Matthew Brost
2024-04-09 20:17 ` [v2 29/31] drm/xe/svm: trace svm migration Oak Zeng
2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
2024-04-11  2:50   ` Matthew Brost
2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
2024-04-11  2:55   ` Matthew Brost
2024-06-07 17:22     ` Zeng, Oak
2024-06-07 18:18       ` Matthew Brost
2024-06-07 18:23         ` Matthew Brost
2024-04-09 20:52 ` ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox